VCD – Force delete network

In our v2t conversion, the NSX for Cloud Director migration tool has had some issues when doing cleanup. One of them is that it cant delete the old NSX-V backed network even though there is nothing left in VCD using it. The error message can be seen below.

2023-05-22 10:54:28,551 [connectionpool]:[_make_request]:452 [DEBUG] [tenant.01] | "DELETE /cloudapi/1.0.0/orgVdcNetworks/urn:vcloud:network:ce108a33-fa5c-4cae-8c16-60edd536ad20 HTTP/1.1" 400 None
2023-05-22 10:54:28,556 [vcdOperations]:[deleteOrgVDCNetworks]:1090 [DEBUG] [tenant.01] | Failed to delete Organization VDC Network lan.[ 1ca6fd03-de82-4835-b12e-58c5c043b2bc ] Network lan cannot be deleted, because it is in use by the following vApp Networks: lan.
2023-05-22 10:54:28,556 [vcdNSXMigratorCleanup]:[run]:230 [ERROR] [tenant.01] | Failed to delete Org VDC networks ['lan'] - as it is in use
Traceback (most recent call last):
  File "src\", line 218, in run
  File "<string>", line 1, in <module>
  File "src\core\vcd\", line 53, in inner
  File "src\core\vcd\", line 1094, in deleteOrgVDCNetworks
Exception: Failed to delete Org VDC networks ['lan'] - as it is in use

I found someone else having this problem, where they discovered a forceful way to delete the network. I have used this but wrapped it in Powershell instead. In my case, it can get the network URN from the log of the migration tools. Else you can also easily see the URN from the GUI URL when in the context of the network.

### Variables
$vcdUrl = ""
$apiusername = "@system"
$password = ''
$networkUrn = "urn:vcloud:network:ce108a33-fa5c-4cae-8c16-60edd536ad20"

### Auth against API and enable TLS1.2 for PowerShell
$base64AuthInfo = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(("{0}:{1}" -f $apiusername,$password)))
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12
$auth =Invoke-WebRequest -Uri "$vcdUrl/api/sessions" -Headers @{Accept = "application/*;version=36.0";Authorization="Basic $base64AuthInfo"} -Method Post

### Get VirtualWire
$virtualWire = Invoke-RestMethod -Uri "$vcdUrl/cloudapi/1.0.0/orgVdcNetworks/$($networkUrn)" -Headers @{Accept = "application/json;version=36.0";Authorization="Bearer $($auth.Headers.'X-VMWARE-VCLOUD-ACCESS-TOKEN')"} -Method GET

### Delete VirtualWire
$deleteStatus = Invoke-RestMethod -Uri "$($networkUrn)?force=true" -Headers @{Accept = "application/json;version=36.0";Authorization="Bearer $($auth.Headers.'X-VMWARE-VCLOUD-ACCESS-TOKEN')"} -Method DELETE

Above PowerShell is used at your own risk 🙂

NSX 4.0.1 > 4.1.0 upgrade problems

Precheck gave warnings back for all the edge nodes. Where it stated the problem below.

Edge node 4006d386-a394-43a4-6b04b242f8b3 vmId is not found on NSX Manager. Please refer to

The KB article states that NSX managers are missing the VM_ID for the edge nodes and gave an example of how to manually find the Edge VM moref and post it to the NSX API.

Using PowerShell to update the VM_ID

Instead of the manual procedure from the KB, I made a small script.

### Login details
$nsxUsername = "admin"
$nsxPassword = "Yi....kes12!"
$nsxmanager = "nsxt.home.lab"

### Connect to vcenter so that we can fetch moref
Connect-VIServer vcsa1.home.lab

### NSX Manager auth header
$Type = "application/json;charset=UTF-8"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword)) }
$nsxUri = "https://$($nsxmanager)"

### Edge Vm moref update
$edgenodes =  (Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes" -Headers $Header -Method GET -ContentType $Type).results | Where-Object {$_.node_deployment_info.deployment_type -eq "VIRTUAL_MACHINE"}

### Loop through the edge nodes
foreach($edgenode in $edgenodes){
write-host "Updating edge node - $($edgenode.display_name)"
$specEdge =  (Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($" -Headers $Header -Method GET -ContentType $Type).node_deployment_info.deployment_config

$vmid = ((get-vm $edgenode.display_name).Id).Split("-")[-1]
write-host "Found edge node moref in vcenter - vm-$vmid)"

write-host "Removing form factore and adding vm_id to object)"
$specEdge | Add-Member -NotePropertyName vm_id -NotePropertyValue "vm-$vmid"
$specEdge = $specEdge | Select-Object -Property * -ExcludeProperty form_factor

try {
    write-host "Updating against NSX API"
    Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($" -Headers $Header -Method POST -ContentType "application/json" -Body $($specEdge|ConvertTo-Json -Depth 10)
catch {
   $streamReader = [System.IO.StreamReader]::new($_.Exception.Response.GetResponseStream())
   $ErrResp = $streamReader.ReadToEnd() | ConvertFrom-Json
   write-host "$($ErrResp.error_message) - $($edgenode.display_name) not updated with success)" -ForegroundColor Red }
   write-host "$($edgenode.display_name) updated with success)"

Unfortunately, even after updating vm_id the precheck still failed, with the same error. NSX API accepted with code 200 the post nothing happened behind the scenes

VMware support:

Opening a VMware SR, they got logs and the info that we already tried to update the vm_id. They asked us to do a couple of other things.

Refreshing the edge node config data

VMware SR stated us to try and do an API call for refreshing the edge node config data. VMware code NSX API info. With the info from the previous script, we can do a POST against NSX API.

Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($" -Headers $Header -Method POST

Reboot of edge nodes

Put an Edge node into NSX maintenance mode and afterward do a reboot of the node, initiated from vCenter with a “Restart Guest OS”. The reboot went fine and the Edge node was put into production again.

Reboot of NSX Managers

Rebooting the NSX Managers, one at a time, and of cause waiting for the rebooted node to come back online with no errors before continuing with the next one.


Unfortunately, non of the above helped, precheck still gives the same error.

Further troubleshooting:

Looking at the API guide I stumbled over an edge node redeployment call. I have redeployed the edge nodes manual before, and I have to say, it’s a pain! It’s not hard, but it takes a lot of time. But this call will help to redeploy in a way that doesn’t affect the data plane.

  • Edge is being put into NSX maintenance mode
  • Edge is then deleted and a new one is deployed, with the same naming as the old one.
  • After the edge node is deployed and registered in the manager it exits maintenance mode and goes into production again

Executing the API call

First, we get the config for the specific edge. Afterward, we post to the redeploy API with the body that we got from the config get request.

The try-catch is helping you get a better error description, if something goes wrong.

$redeployBody = (Invoke-RestMethod -Uri $nsxUri/api/v1/transport-nodes/$($ -Headers $Header -Method GET) 

try {
    Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($" -Headers $Header -Method POST -ContentType "application/json" -body $redeployBody
catch {
    $streamReader = [System.IO.StreamReader]::new($_.Exception.Response.GetResponseStream())
    $ErrResp = $streamReader.ReadToEnd() | ConvertFrom-Json

After redeployment, more problems….

After the redeployment of one edge node, the error with vm_id was no more for the redeployed node. Great! but now the upgrade coordinator gave a new precheck error… I can’t exactly remember, but it said something like “The Host Upgrade Unit Groups are not suitable for a T0.”

Google results pointed in the direction of VMware KB to reset the upgrade plan. But was not successful. (I then later found out that on the next page of the upgrade wizard, there is a “Reset plan” button)

Continued to redeploy all the remaining edge nodes, this helped clear all the errors from the upgrade precheck. The upgrade could then begin 😀

And another problem…

After the edge nodes that held the T0 gateway, one of the edge nodes was not negotiating up BGP to physical fabric. Tried a lot of things.

  • Redeployment of the edge node with the redeploy API action didn’t help.
  • Migrating the edge node to the same hosts as the working edge node was on, didn’t help.
  • Ping from working edge node to non-working on the VLAN uplink IPs worked.
  • Ping from non-working to physical fabric didn’t work.

Then tried to remove and add the interfaces of the T0 on the non-working edge node. Made everything work again. A quite random bug?


NSX upgrades compared to NSX-V upgrades seem to be quite troublesome or let’s say that there is room for improvements. The good thing is that when edge nodes are upgraded then all tenants are also. And the upgrade happens with zero downtime. The T0/T1 robustness is amazing, no drops, no IPSec go down.

VCD CPI/CCM – Load balancer

When a TKG cluster is deployed by Container Service Extension(CSE) it means that it lives within VMware Cloud Director(VCD).

Inside this TKG cluster, you will find a Cloud Controller Manager(CCM) pod under kube-system called “vmware-cloud-director-ccm”. CCM pod is part of Cloud Provider Interface(CPI) that gives you some capabilities on how for example to add a Persistent Volume(PV) or do load balancing with NSX Advanced Loadbalancer(ALB). Basically, the CCM will contact the VCD CPI API and from there orchestrate what you requested in your Kubernetes YAML files.

At the time of writing, L4 load balancing(LB) features through ALB are the only option available. This is because the CPI is not yet completely featured to create L7 LB with ALB. It’s on the roadmap though.

One-arm vs. two-arm…

I found that there are two conspects that are with knowing of, one-arm and two-arm load balanceres.

The two LB methods are described here. Since it’s an old article, AVI/ALB was not in the VMware portfolio back then. And also NSX-T has migrated away from the LB service where it lived as haproxy within tier1 and over to AVI/ALB service engines.

The default setting of load balancing with VCD CPI is two arms, meaning that it will tell VCD to create a DNAT rule towards a 192.168.8.x internal subnet used to create ALB VIPs.

WAN > T1 DNAT(185.139.232.x:80)> 192.168.8.x(LB internal subnet) > ALB SE > LB Pool members

Since L7 LB features in VCD are not yet available AND also it will become very costly. Most customers will probably choose to have an Nginx or Apache ingress controller inside their own Kubernetes cluster.

Since VCD 10.4 the two-arm config has been working and therefore it’s more desirable since you can use multiple ports on a single public IP. Where one-arm config would allocate one public IP pr Kubernetes service(correct me if I’m wrong).

If you are running your own ingress controller then some find the one-arm approach more desirable since ALB will then hold the public IP address.

WAN > T1 Static Route to ALB (185.139.232.x) > ALB SE > LB Pool members

How to change to one arm LB?

Hugo Phan has done a good write-up on this blog.

Basically, it’s downloading the existing config and changing the config map of VCD CPI removing the part where it 192.168.8.x subnet is defined. After this, you delete the existing CPI and then add it from the yaml file that you edited.

Snip from Hugo Phan blog –

How can I use/test this with my VCD TKG cluster?

If you don’t have a demo app that you prefer, then I can recommend either yelp or retrogames. Here I will do it with yelp. William Lam has done a good write-up and also hosts deployment files for yelp.

Step 1 – Deploy the application

kubectl create ns yelb
kubectl apply -f

Step 2 – Check that all pods are running

jeram@QL4QJP2F4N ~ % kubectl -n yelb get pods
NAME                             READY   STATUS    RESTARTS   AGE
redis-server-74556bbcb7-f8c8f    1/1     Running   0          6s
yelb-appserver-d584bb889-6f2gr   1/1     Running   0          6s
yelb-db-694586cd78-27hl5         1/1     Running   0          6s
yelb-ui-8f54fd88c-cdvqq          1/1     Running   0          6s
jeram@QL4QJP2F4N ~ % 

The deployment file is asking k8s for a service of a load balancer, CCM picks this up and asks VCD CPI to have ALB creating the L4 load balancing.

Task view from VCD

Step 3 – Get the IP and go check out the yelb site

jeram@QL4QJP2F4N ~ % kubectl -n yelb get svc/yelb-ui
NAME      TYPE           CLUSTER-IP     EXTERNAL-IP       PORT(S)        AGE
yelb-ui   LoadBalancer   185.177.x.x   80:32047/TCP   5m52s

Step 4 – Scale the UI

Let’s see how many instances have from the initial deployment.

jeram@QL4QJP2F4N ~ % kubectl get rs --namespace yelb
NAME                       DESIRED   CURRENT   READY   AGE
redis-server-74556bbcb7    1         1         1       8m11s
yelb-appserver-d584bb889   1         1         1       8m11s
yelb-db-694586cd78         1         1         1       8m11s
yelb-ui-8f54fd88c          1         1         1       8m11s
jeram@QL4QJP2F4N ~ % 

We can then scale the UI to 3 and check again to see if that happens.

jeram@QL4QJP2F4N ~ % kubectl scale deployment yelb-ui --replicas=3 --namespace yelb
deployment.apps/yelb-ui scaled

jeram@QL4QJP2F4N ~ % kubectl get rs --namespace yelb
NAME                       DESIRED   CURRENT   READY   AGE
redis-server-74556bbcb7    1         1         1       9m45s
yelb-appserver-d584bb889   1         1         1       9m45s
yelb-db-694586cd78         1         1         1       9m45s
yelb-ui-8f54fd88c          3         3         3       9m45s
jeram@QL4QJP2F4N ~ % 

UI is now scaled to replicates of 3. Seen from the Load Balancer view I VCD it will only show the worker nodes. Since k8s is doing its own loadbalancing arose the pod instances.

Step 5 – Cleanup

jeram@QL4QJP2F4N ~ % kubectl -n yelb delete pod,svc --all && kubectl delete namespace yelb
pod "redis-server-74556bbcb7-f8c8f" deleted
pod "yelb-appserver-d584bb889-6f2gr" deleted
pod "yelb-db-694586cd78-27hl5" deleted
pod "yelb-ui-8f54fd88c-6llf7" deleted
pod "yelb-ui-8f54fd88c-9r6wf" deleted
pod "yelb-ui-8f54fd88c-cdvqq" deleted
service "redis-server" deleted
service "yelb-appserver" deleted
service "yelb-db" deleted
service "yelb-ui" deleted
namespace "yelb" deleted
jeram@QL4QJP2F4N ~ % 

Again, CCM will instruct VCD CPI to clean up. NICE!


We now have a good idea of how load balancing works with VCD TKG deployd K8s clusters. Off cause we are looking forward to the L7 features. But it’s a good start and VMware is working hard to help in making k8s deployment and day2 operations easier.

NSX-T – Topology view troubleshooting

Ever seen the beneath picture when trying to see the cool topology view in NSX-T?

If so, there is an API call that can help you resync the topology view so you again can see it.

# Login details
$nsxUsername = "admin",
$nsxPassword = "",
$nsxmanager = "nsxt.home.lab"

# NSX Manager auth header
$Type = "application/json;charset=UTF-8"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword)) }
$nsxUri = "https://$($nsxmanager)"

# Request topology resync
Invoke-RestMethod -Uri "$nsxUri/policy/api/v1/ui/network-topology/resync" -Headers $Header -Method POST -ContentType $Type

You don’t get any feedback from the API call, but after a few minutes, you will again be able to see the topology again.

NSX-T Edge node password management

The default NSX-T password expiration time is set to 90 days. But in a lab environment, this is not required. So here is a bit on how to disable the timer but also how to recover from an expired or forgotten password.

Reset exipred password

If the password for either of the users, audit, root, or admin, has expired you will see it when you try to log in with SSH. It will then prompt you to enter the current password followed by the new two times. Since this is only for home lab, and would like the previous password, I set a new and quick-to-remember password. Fimmer_old_password1. The SSH session then disconnects and you start a new connection with the new password.

nsx-edge> set user admin password My-New_VMware1!_Password old-password Fimmer_old_password1

After the reset and re-reset you now have 90 days of password again. or you could disable the password expiration…

If you find yourself in a situation of a forgotten admin password.

You will most likely be able to log in with the root account. Even if expired using the console of the Edge VM will always work. From there you can use the normal Linux password reset command to reset the admin account password.

passwd admin

And if you have tried the wrong password too much you can unlock the account with pam tally.

pam_tally2 --user admin --reset

Another note when you are logged in with root, users can still use nsxcli, just wrap your nsxcli commands with su admin -c ”

su admin '-c clear user audit password-expiration'

If your find yourself completely locked out of NSX-T

VMware has some good documentation on this. Basically it is

  • Connect to the console of the appliance and reboot the system. When the GRUB boot menu appears, press the left SHIFT or ESC key quickly. Press e to edit the menu. Press e to edit the selected option.
  • Search for the line starting with linux and add systemd.wants=PasswordRecovery.service to the end of the line. Press Ctrl-X to boot.

Set password to never expire

SSH to the edge node with the admin account. Using the nsxcli we can adjust the expiration to a maximum of 90 days. The commands below will set the password expiration to 9999 days and clear the expiration if already happened. VMware has it in their documentation here

nsx-edge> set user admin password-expiration 9999
nsx-edge> set user root password-expiration 9999
nsx-edge> set user audit password-expiration 9999
nsx-edge> clear user admin password-expiration
nsx-edge> clear user root password-expiration
nsx-edge> clear user audit password-expiration


Something that is always better than passwords is SSH Keys. You can add multiple ssh-keys to the same users in NSX-T. The cool thing is that you have a label for the key so multiple users can have access with their own SSH key, this way you avoid some of the hassles of having to use passwords in with your SSH connections

nsx-edge> set user admin ssh-keys label jr type ssh-rsa value AAAAB3NzaC1yc2EAAAADAQABAAACAQC/VPq30qzyJHr8v6qh1vF1CVY8R9U09iCkqnIs9H6d9hBOeDu/e52rPj2BOQUfHwBmGRPVqZUyuOO20hDgT/BzP0QxISv9l2OpFariz8AmHu9m4kUwAdrBDvplw8fFeafppUwQF/aFsIF+t1PtFluz0Bp3N/sp3NQGWfkez7myctGc9X3eMc6oUAYrPPJeDZz1x5JoGdwdH/w6wjr3uK03kRx6TX1kNqxSypIQQ/8lYg1TG7yAuF5DhX4fJrPjpiLau1H6z0vChVpqY1q8oMntzHHtYtByFMrNtWFfAvG94BT27h/Lkmz5JM5d41TbL0YdZT8zCTrXzUG87wdEaRiB5ZeKy9LENgfxKO66scSU2gjiXwpyJTrHKZYz9g5EERH/41w+qMT90HAM3ArSIvk7pROoKhZy0IeOwfWbmMlQvKQFjS7OtKnFEeVUYRqnLvi3XeUiFbLxmW3ID8IqQy3iDNuESiVNRcp/PoN7lxL9cfGJdXBuJ3PBcaQZx/vQpePRqW9eBSmhhS1beIUlLV0UOFdRGTMMMjOlp7m7jaw5EnvztbInfPOdMPoUuSL9iGut7M0SVMgEzo0MiJDHNdLQYK0EKO8qrWMz76UHhpdnhOQNdi3/wtVVzxVUR/D9zBa1q2oL8ml7jKVubVbBd6Vm0lEEquDEN3I9Dan/Ev0j9w==


Having an expired password will cause you all sorts of trouble.

If you don’t have a PAM solution that can help you to automatically change the password, then setting the expiration to 9999 days will for sure help your manageability.

Putting your SSH key onto the nodes and managers will help you in the long run, and is in my opinion also a more secure solution than having passwords.

NSX API – DLR L2 bridging

Here is a script for mass DLR L2 bridge creation. I had to bridge a couple of hundred VLAN to VXLAN, and while it was maybe faster to create it by hand I would not have learned anything.

The script is reading from a CSV file where I have all my info. Then loops through the entries and create a distributed port group and then initiates an L2 bridge. The VXLAN had been created post to this operation.

$csv = Import-Csv "D:\temp\VLAN.csv" -Delimiter ";"
Import-Module PowerNSX
get-module -name vmware* -ListAvailable | Import-Module

$cred = get-credential
connect-viserver -server -Credential $cred

foreach ($net in $csv) {
    $vdportgroup = ("zitmit-$($net.acl)").ToLower()

    $exists = Get-VDSwitch -Name "DSMpls01-EX" | Get-VDPortgroup -Name $vdportgroup -ErrorAction SilentlyContinue
    if (!$exists) {
        Get-VDSwitch -Name "DSMpls01-EX" | New-VDPortgroup -Name $vdportgroup -VLanId $net.mitvlan -NumPorts 2
        $created = Get-VDSwitch -Name "DSMpls01-EX" | Get-VDPortgroup -Name "zitmit-acl-10344"
        if (!created) {
            Write-Host -ForegroundColor Green "Portgroup created: $vdportgroup"

            $vdportgroupId = ($created.Id).Replace("DistributedVirtualPortgroup-","")
            $vdportgrpupName = $created.Name

            create-nsxl2bridge -aclname $($net.acl) -dvportGroup $($created.key)
    else {
        Write-Host -ForegroundColor Yellow "Portgroup have allready been created: $vdportgroup"
        #Get-VDSwitch -Name "DSMpls01-EX" | New-VDPortgroup -Name $vdportgroup -VLanId $net.mitvlan -NumPorts 2

Function create-nsxl2bridge {

    # Login info
    $nsxUsername = 
    $nsxPassword = 

    # Allow all SSL protocols
    $AllProtocols = [System.Net.SecurityProtocolType]'Ssl3,Tls,Tls11,Tls12' 
    [System.Net.ServicePointManager]::SecurityProtocol = $AllProtocols

    # Connect to NSX manager
    $connection = Connect-NsxServer -Username $nsxUsername -Password $nsxPassword -WarningAction SilentlyContinue
    $virtualwire = Get-NsxLogicalSwitch | Where-Object { $ -match "$aclname" -and $ -notmatch "lan" }

    if ($virtualwire.count -gt 1) {
        $message = "Something could wrong - $aclname"
        write-host $message -ForegroundColor yellow
        $message | Out-File C:\log\create-nsxl2bridge.txt -Append
        $virtualwire = $virtualwire[0]
    elseif (!$virtualwire) {
        $message = "virtualwire was not found: $($virtualwire.objectId) - acl: $aclname"
        write-host $message -ForegroundColor yellow
        $message | Out-File C:\log\create-nsxl2bridge.txt -Append

    # Edge info
    $edgeId = "edge-1120"
    $Type = "Accept: application/xml"
    $Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword)) }
    $nsxUri = "$edgeId/bridging/config"

    # Getting edge config
    $currentL2Config = $null
    $currentL2Config = Invoke-RestMethod -Uri $nsxUri -Headers $Header -Method GET -ContentType $Type

    # Check if already there
    foreach ($z in $currentL2Config.SelectNodes("//name"))
        if ($z.'#text' -match $aclname ) {
            write-host "Already exists: $aclname" -ForegroundColor yellow

    # Add extra xml node to currentconfig
    $handler1 = $null
    $handler1 = $currentL2Config.CreateNode('element', "bridge", '')
    $attr = $currentL2Config.CreateNode('element', "bridgeId", '')
    $attr.InnerText = "$null";
    $attr = $currentL2Config.CreateNode('element', "name", '')
    $attr.InnerText = "$aclname";
    $attr = $currentL2Config.CreateNode('element', "virtualWire", '')
    $attr.InnerText = "$($virtualwire.objectId)";
    $attr = $currentL2Config.CreateNode('element', "dvportGroup", '')
    $attr.InnerText = "$dvportGroup";
    # Remove nodes from existing XML
    $currentL2Config.SelectNodes("//virtualWireName") | ForEach-Object { $_.ParentNode.RemoveChild($_) }
    $currentL2Config.SelectNodes("//isSharedNetwork") | ForEach-Object { $_.ParentNode.RemoveChild($_) }
    $currentL2Config.SelectNodes("//dvportGroupName") | ForEach-Object { $_.ParentNode.RemoveChild($_) }

    # Add the newly created node to existing XML

    # PUT edge config
    $respons = Invoke-RestMethod -Uri $nsxUri -Headers $Header -Method PUT -ContentType 'application/xml' -Body $currentL2Config
    write-host "L2 Created: $($virtualwire.objectId) - acl: $aclname" -ForegroundColor Green

NSX 6.3.6 to 6.4.5 – Controller problem encountered

NSX upgrades can be a delicate thing to upgrade, even though everything is in its finest shape.

After we successfully have upgrade the NSX managers we proceeded with upgrading of the NSX Controllers. We did pre-check and issued command “show control-cluster status” and it looked fine, upgrade to 6.4.6 went well and we could vMotion VMs around after the controller was booted. But post-checks was not ok, the “show control-cluster status” did not return as expected and we where not confident to proceed with the host upgrades.

After some trouble shooting we found that the /var/log partition on 2/3 of the controllers where full. Without any other evidence we concluded that this was the problem. After some google-fu we didn’t find any KB or blogs on how to purge logs.

But we found out that we could get into a engeering mode that would give us shell access. Long store short, we did the following:

1. to gain shell access on manager
1.1 password is IAmOnThePhoneWithTechSupport
2. Extracting root passwords for controllers with /home/secureall/secureall/sem/WEB-INF/classes/ controller-nn
3. Loged into each controller, and issued : debug os-shell and thereby gain root shell access.
4. Deleted /var/log/syslog.1 on each node.
5. Rolling restart of controllers and after they booted they all joined the cluster.


After this we got the status as we wanted. In the mean while we had create a case with VMware support and the supporter was on a remote session with us. We told him what we have done, we verified that the controlleres was health and they where.

Next step, VIB upgrade on the hosts.

Good commands to know:

show process monitor
show controller list all
show control-cluster status

Edit: This article from VMware have the exact problem we encountered. We also contacted VMware Support, but before they where able to assist us we had the problem solved. 🙂

Process of getting the root password for controllers.

NSX Edge PowerShell manipulation

This is from a VMware support experience. A customer could not change DNS server parameters of the NSX Edge IP Pool. But actually is was a problem due to a bug in VCD 9.5, where a Edge XML config was missing some tags and therefor not being able to validate the XML when VCD post the edited XML config back to NSX manager.

I have attached VMware support answer in the bottom of the post.

Script will get all edges from the NSX manager, then you find the correct one and fill into the next part of the script. Then you get the XML down to a file on your local machine, you then edit the file and put in the missing tags and lastly PUT the XML backup NSX manager. After this operation, it works from the GUI again.

# Import credential module and login information
$ReturnObj = import-credentials vmwareSSO
$nsxUsername = $ReturnObj.Username
$nsxPassword = $ReturnObj.Password

# Other variables
$tempFile = "C:\temp\edge-747_jvr.xml"

# Allow all SSL protocols
$AllProtocols = [System.Net.SecurityProtocolType]'Ssl3,Tls,Tls11,Tls12' 
[System.Net.ServicePointManager]::SecurityProtocol = $AllProtocols

Add-Type @"
    using System;
    using System.Net;
    using System.Net.Security;
    using System.Security.Cryptography.X509Certificates;
    public class ServerCertificateValidationCallback
        public static void Ignore()
            ServicePointManager.ServerCertificateValidationCallback += 
                    Object obj, 
                    X509Certificate certificate, 
                    X509Chain chain, 
                    SslPolicyErrors errors
                    return true;


# Getting all edges
$Type = "Accept: application/xml"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword))}
$nsxUri = ""

[xml]$edges = (Invoke-WebRequest -Uri $nsxUri -Headers $Header -Method GET -ContentType $Type).Content
foreach ($edge in $edges.pagedEdgeList.edgePage.edgeSummary)
    $edgeInfo = "name: {0} - ID: {1}" -f $, $edge.objectId

# Getting specefic edge config
$edgeId = "edge-747"
$Type = "Accept: application/xml"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword))}
$nsxUri = "$edgeId"

(Invoke-WebRequest -Uri $nsxUri -Headers $Header -Method GET -ContentType $Type).Content | out-file $tempFile

# PUT edge config after edit
$Type = 'application/xml'
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword))}
$nsxUri = "$edgeId"
$edgeConfigAltered = Get-Content $tempFile

$respons = Invoke-WebRequest -Uri $nsxUri -Headers $Header -Method Put -ContentType 'application/xml' -Body $edgeConfigAltered
# Statuscode 204 is accepted

From Support:
– The issue you are seeing is a known issue 9.5.
– Like I mentioned in the previous email, this is due to missing elements from the xml.
– From the xml in the logs, I could see there are 52 NAT rules on that edge.Correct me if I am wrong. The following 2 rules had the elements missing


I have attached the file with the list of all the NAT rules seen from the logs if you need to cross-verify.

– To fix the issue,please follow

If you have any further questions,let me know.

Have a good evening,

Best regards,