NSX 4.0.1 > 4.1.0 upgrade problems

Precheck gave warnings back for all the edge nodes. Where it stated the problem below.

Edge node 4006d386-a394-43a4-6b04b242f8b3 vmId is not found on NSX Manager. Please refer to https://kb.vmware.com/s/article/90072

The KB article states that NSX managers are missing the VM_ID for the edge nodes and gave an example of how to manually find the Edge VM moref and post it to the NSX API.

Using PowerShell to update the VM_ID

Instead of the manual procedure from the KB, I made a small script.

### Login details
$nsxUsername = "admin"
$nsxPassword = "Yi....kes12!"
$nsxmanager = "nsxt.home.lab"

### Connect to vcenter so that we can fetch moref
Connect-VIServer vcsa1.home.lab

### NSX Manager auth header
$Type = "application/json;charset=UTF-8"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword)) }
$nsxUri = "https://$($nsxmanager)"

### Edge Vm moref update
$edgenodes =  (Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes" -Headers $Header -Method GET -ContentType $Type).results | Where-Object {$_.node_deployment_info.deployment_type -eq "VIRTUAL_MACHINE"}

### Loop through the edge nodes
foreach($edgenode in $edgenodes){
write-host "Updating edge node - $($edgenode.display_name)"
$specEdge =  (Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)" -Headers $Header -Method GET -ContentType $Type).node_deployment_info.deployment_config

$vmid = ((get-vm $edgenode.display_name).Id).Split("-")[-1]
write-host "Found edge node moref in vcenter - vm-$vmid)"

write-host "Removing form factore and adding vm_id to object)"
$specEdge | Add-Member -NotePropertyName vm_id -NotePropertyValue "vm-$vmid"
$specEdge = $specEdge | Select-Object -Property * -ExcludeProperty form_factor

try {
    write-host "Updating against NSX API"
    Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)?action=addOrUpdatePlacementReferences" -Headers $Header -Method POST -ContentType "application/json" -Body $($specEdge|ConvertTo-Json -Depth 10)
}
catch {
   $streamReader = [System.IO.StreamReader]::new($_.Exception.Response.GetResponseStream())
   $ErrResp = $streamReader.ReadToEnd() | ConvertFrom-Json
   $streamReader.Close()
  }
if($ErrResp){
   write-host "$($ErrResp.error_message) - $($edgenode.display_name) not updated with success)" -ForegroundColor Red }
   else{
   write-host "$($edgenode.display_name) updated with success)"
   }
}

Unfortunately, even after updating vm_id the precheck still failed, with the same error. NSX API accepted with code 200 the post nothing happened behind the scenes

VMware support:

Opening a VMware SR, they got logs and the info that we already tried to update the vm_id. They asked us to do a couple of other things.

Refreshing the edge node config data

VMware SR stated us to try and do an API call for refreshing the edge node config data. VMware code NSX API info. With the info from the previous script, we can do a POST against NSX API.

Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)?action=refresh_node_configuration&resource_type=EdgeNode&read_only=true" -Headers $Header -Method POST

Reboot of edge nodes

Put an Edge node into NSX maintenance mode and afterward do a reboot of the node, initiated from vCenter with a “Restart Guest OS”. The reboot went fine and the Edge node was put into production again.

Reboot of NSX Managers

Rebooting the NSX Managers, one at a time, and of cause waiting for the rebooted node to come back online with no errors before continuing with the next one.

Result

Unfortunately, non of the above helped, precheck still gives the same error.

Further troubleshooting:

Looking at the API guide I stumbled over an edge node redeployment call. I have redeployed the edge nodes manual before, and I have to say, it’s a pain! It’s not hard, but it takes a lot of time. But this call will help to redeploy in a way that doesn’t affect the data plane.

  • Edge is being put into NSX maintenance mode
  • Edge is then deleted and a new one is deployed, with the same naming as the old one.
  • After the edge node is deployed and registered in the manager it exits maintenance mode and goes into production again

Executing the API call

First, we get the config for the specific edge. Afterward, we post to the redeploy API with the body that we got from the config get request.

The try-catch is helping you get a better error description, if something goes wrong.

$redeployBody = (Invoke-RestMethod -Uri $nsxUri/api/v1/transport-nodes/$($edgenode.id) -Headers $Header -Method GET) 

try {
    Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)?action=redeploy" -Headers $Header -Method POST -ContentType "application/json" -body $redeployBody
    }
catch {
    $streamReader = [System.IO.StreamReader]::new($_.Exception.Response.GetResponseStream())
    $ErrResp = $streamReader.ReadToEnd() | ConvertFrom-Json
    $streamReader.Close()
    $ErrResp
}

After redeployment, more problems….

After the redeployment of one edge node, the error with vm_id was no more for the redeployed node. Great! but now the upgrade coordinator gave a new precheck error… I can’t exactly remember, but it said something like “The Host Upgrade Unit Groups are not suitable for a T0.”

Google results pointed in the direction of VMware KB to reset the upgrade plan. But was not successful. (I then later found out that on the next page of the upgrade wizard, there is a “Reset plan” button)

Continued to redeploy all the remaining edge nodes, this helped clear all the errors from the upgrade precheck. The upgrade could then begin 😀

And another problem…

After the edge nodes that held the T0 gateway, one of the edge nodes was not negotiating up BGP to physical fabric. Tried a lot of things.

  • Redeployment of the edge node with the redeploy API action didn’t help.
  • Migrating the edge node to the same hosts as the working edge node was on, didn’t help.
  • Ping from working edge node to non-working on the VLAN uplink IPs worked.
  • Ping from non-working to physical fabric didn’t work.

Then tried to remove and add the interfaces of the T0 on the non-working edge node. Made everything work again. A quite random bug?

Conclusion

NSX upgrades compared to NSX-V upgrades seem to be quite troublesome or let’s say that there is room for improvements. The good thing is that when edge nodes are upgraded then all tenants are also. And the upgrade happens with zero downtime. The T0/T1 robustness is amazing, no drops, no IPSec go down.

Cisco ASA cluster – upgrade

Cisco seems to have a good track record of there products, but I must say that there ASA firewalls have seen a lot of critical bugs in the last couple of years. Both in hardware and software…

The last critical bug I was not informed about, so didn’t catch it before the customer did. Always nice when a customer calls in with the problem of there primary ASA being down. It crashed in a way that meant that it did not come up again. It needed a physical reboot.

Before having the chance to have someone onsite locate the firewall and reboot it that secondary also died. And did not come up again! Customer needs to get online again, so there was no time to get a console cable and see what the heck was going on. So I told them to do a hard reboot on both firewalls. After the ASA booted they both became active again and could see each other. Great, customer online. But why and how.

Contact with Conscia Cisco support could confirm that the exact issue has been hitting multiple customers. Due to a bug, the firmware did a memory buffer overflow when being hit by a specific udp/500 attack. Great, now we know the problem and the fix is to upgrade ASA firmware.

It’s not something I do often, and I always forget to write down to procedure, so here goes.

Upgrade procedure

  1. Have a look at the cisco ASA upgrade guide, to see what version you and on and what is supported to go up to. I were on 9.8.2 and could go up to 9.13.x. So I did. https://www.cisco.com/c/en/us/td/docs/security/asa/upgrade/asa-upgrade/planning.html#ID-2152-0000000a
  2. Download and upload firmware to BOTH members of the cluster
  3. Change the boot image to the newly uploaded image
  4. Update the secondary, make a failover
  5. Update the primary and make a failover
  6. Done

Uploading the images to both nodes with TFTP.

I used the portable version of Tftpd64 by Jounin, simple and works out of the box. Copied the freshly downloaded images to both nodes.

### Primary
DS-ESB-ASA5516x# copy /noconfirm tftp://10.0.2.14/asa9-13-1-lfbff-k8.SPA disk0:/asa9-13-1-lfbff-k8.SPA
DS-ESB-ASA5516x# copy /noconfirm tftp://10.0.2.14/asdm-7131.bin disk0:/asdm-7131.bin

### Secondary
DS-ESB-ASA5516x# failover exec mate copy /noconfirm tftp://10.0.2.14/asa9-13-1$
DS-ESB-ASA5516x# failover exec mate copy /noconfirm tftp://10.0.2.14/asdm-7131$

Change config to the new image

So now we will change over the config so that it will use the new boot images that we have uploaded. First, we remove the existing boot image, and afterwards, we set the new image together with the new ASDM image.

### Show current boot image
DS-ESB-ASA5516x# show running-config boot system
boot system disk0:/gf/asa982-20-lfbff-k8.SPA

### Remove existing boot image
DS-ESB-ASA5516x(config)# no boot system disk0:/gf/asa982-20-lfbff-k8.SPA

### Add new boot image that you just uploaded
DS-ESB-ASA5516x(config)# boot system disk0:/asa9-13-1-lfbff-k8.SPA

### Reload the standby node for the new firmware to take effect
DS-ESB-ASA5516x(config)# failover reload-standby

### Look at the output from show failover, check if the standby is up and verify the firmware version.

Failover and reload the second node

So now the secondary node is booted with the new firmware, time to failover to it so we can reload and have the new firmware running on the primary node. When doing the failover you might lose the SSH connection, just connect again. This time you will be connected to the second node, that is not the active node. Reload the primary, that is now standby and wait for it up come up. It will show in the console that its sending config to mate. Just like when we did it with the first reload of the standby, secondary node.

### Controlled failover to secoundary, standby node
DS-ESB-ASA5516x# no failover active
### reload the primary, standby node for firmware to take effect.
DS-ESB-ASA5516x# failover reload-standby

You are done

Now it’s only to test, and if you want, failback to the primary. But that up to you. I did not lose one ping through the upgrade process. So that cluster is indeed working as it should. While you are at it then why not also update the AnyConnect client and remember to clean up the flash so the old versions and file won’t fill it up. Enjoy your newly updated cluster.