NSX 6.3.6 to 6.4.5 – Controller problem encountered

NSX upgrades can be a delicate thing to upgrade, even though everything is in its finest shape.

After we successfully have upgrade the NSX managers we proceeded with upgrading of the NSX Controllers. We did pre-check and issued command “show control-cluster status” and it looked fine, upgrade to 6.4.6 went well and we could vMotion VMs around after the controller was booted. But post-checks was not ok, the “show control-cluster status” did not return as expected and we where not confident to proceed with the host upgrades.

After some trouble shooting we found that the /var/log partition on 2/3 of the controllers where full. Without any other evidence we concluded that this was the problem. After some google-fu we didn’t find any KB or blogs on how to purge logs.

But we found out that we could get into a engeering mode that would give us shell access. Long store short, we did the following:

1. https://kb.vmware.com/s/article/2149630 to gain shell access on manager
1.1 password is IAmOnThePhoneWithTechSupport
2. Extracting root passwords for controllers with /home/secureall/secureall/sem/WEB-INF/classes/GetNvpApiPassword.sh controller-nn
3. Loged into each controller, and issued : debug os-shell and thereby gain root shell access.
4. Deleted /var/log/syslog.1 on each node.
5. Rolling restart of controllers and after they booted they all joined the cluster.

 

After this we got the status as we wanted. In the mean while we had create a case with VMware support and the supporter was on a remote session with us. We told him what we have done, we verified that the controlleres was health and they where.

Next step, VIB upgrade on the hosts.

Good commands to know:

Edit: This article from VMware have the exact problem we encountered. We also contacted VMware Support, but before they where able to assist us we had the problem solved. 🙂
https://kb.vmware.com/s/article/59509

Process of getting the root password for controllers.

NSX Edge PowerShell manipulation

This is from a VMware support experience. A customer could not change DNS server parameters of the NSX Edge IP Pool. But actually is was a problem due to a bug in VCD 9.5, where a Edge XML config was missing some tags and therefor not being able to validate the XML when VCD post the edited XML config back to NSX manager.

I have attached VMware support answer in the bottom of the post.

Script will get all edges from the NSX manager, then you find the correct one and fill into the next part of the script. Then you get the XML down to a file on your local machine, you then edit the file and put in the missing tags and lastly PUT the XML backup NSX manager. After this operation, it works from the GUI again.

From Support:
– The issue you are seeing is a known issue 9.5.
– Like I mentioned in the previous email, this is due to missing elements from the xml.
– From the xml in the logs, I could see there are 52 NAT rules on that edge.Correct me if I am wrong. The following 2 rules had the elements missing

I have attached the file with the list of all the NAT rules seen from the logs if you need to cross-verify.

Plan:
– To fix the issue,please follow https://kb.vmware.com/s/article/67193

If you have any further questions,let me know.

Have a good evening,

Best regards,

Deepthy