VMware CSE – Stuck cluster deployment

After upgrading to CSE 3.1.3 with VCD 10.3.1 I encountered a problem when creating clusters from the Ubuntu 20.04 native cluster template.

Basically, the mstr node would be deployed and started, VMTools will become ready and the first script injection would happen. Then all of a sudden the VM would reboot and the cluster creation will fail because it can’t see the process anymore. This will sometimes leave a cluster in the “Creation in progress” status but somehow it can not be managed anymore.

22-06-02 10:42:34 | cluster_service_2_x:2811 - _wait_for_tools_ready_callback | DEBUG :: waiting for guest tools, status: vm='vim.VirtualMachine:vm-835608', status=guestToolsNotRunning
22-06-02 10:42:39 | cluster_service_2_x:2811 - _wait_for_tools_ready_callback | DEBUG :: waiting for guest tools, status: vm='vim.VirtualMachine:vm-835608', status=guestToolsRunning
22-06-02 10:42:41 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: waiting for process 1706 on vm 'vim.VirtualMachine:vm-835608' to finish (1)
22-06-02 10:42:46 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: process [0, <Response [200]>, <Response [200]>] on vm 'vim.VirtualMachine:vm-835608' finished, exit code: 0
22-06-02 10:42:46 | cluster_service_2_x:2869 - _execute_script_in_nodes | DEBUG :: about to execute script on mstr-7e34 (vm='vim.VirtualMachine:vm-835608'), wait=True
22-06-02 10:42:48 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: waiting for process 1729 on vm 'vim.VirtualMachine:vm-835608' to finish (1)
22-06-02 10:42:58 | cluster_service_2_x:2896 - _execute_script_in_nodes | ERROR :: Error executing script in node mstr-7e34: process not found (pid=1729) (vm='vim.VirtualMachine:vm-835608')
Traceback (most recent call last):
  File "/opt/vmware/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x.py", line 2879, in _execute_script_in_nodes

I created an SR request with Cloud Director GSS for both the failed deployment and for the stuck clusters that now couldn’t be deleted. Multiple screen sharing sessions later and no result.

Then I found the GitHub for Container Service Extension, the issue page had a very tempting title Failed deployments using TKGm on VCD. Many seem to have the same problem, no fix on the deployments but it seems that one guy had the fix for deletion of the stuck clusters.

The workaround

You need to find the ID of the user that owns the cluster. You can in the More>Kubernetes Clusters menu in VCD see who the owner is.

When you have the owner you can go into Administration > User > <User>. Then then the URL with contain the ID of the user.


With the userId you can fill out the body for the next API call.

$vdchost = "vcd.ramsgaard.me"
$apiusername = "svc-cse@system"
$password = 'Ye.........iks12!'

$base64AuthInfo = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(("{0}:{1}" -f $apiusername,$password)))
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12
$auth =Invoke-WebRequest -Uri "https://$vdchost/api/sessions" -Headers @{Accept = "application/*;version=32.0";Authorization="Basic $base64AuthInfo"} -Method Post

$accessBody = '{
    "grantType": "MembershipAccessControlGrant",
    "accessLevelId": "urn:vcloud:accessLevel:FullControl",
    "memberId": "urn:vcloud:user:e96cf9e8-535f-45d8-8a87-b9dac659f85f"
  }' | ConvertFrom-Json

$status = Invoke-RestMethod -Uri "https://$vdchost/cloudapi/1.0.0/entities/urn:vcloud:type:cse:nativeCluster:2.1.0/accessControls" -Headers @{Accept = "application/json;version=36.1";Authorization="Bearer $($auth.Headers.'X-VMWARE-VCLOUD-ACCESS-TOKEN')"} -ContentType "application/json" -Method post -Body ($accessBody | ConvertTo-Json)

When the API call is done you should now be able to delete the stuck cluster.

If you should be so unfortunate that the cluster is stuck in a “not resolved” state and the deletion through VCD GUI still fails you need to use the vcd cse cli.

### Login to VCD system or tenant organistaion
vcd login vcd.ramsgaard.me system jr
### Show clusters
vcd cse cluster list
### Force delete the cluster
vcd cse cluster delete tanzu1 --force


The problem occurred in the first place due to a bug in VCD 10.3.1, the MQTT bus had some bug and therefore the cluster creation failed. 10.3.2 or 10.3.3 fixed the bug. (Off cause the VMware Tanzy Grid version should be used in the future)

It took some time to find the workaround, I hope the future of CSE will be more fault tolerant so these situations would not appear.

Until then there is a way to get out of the stuck cluster situation.

Disk mapping Windows <-> VMware – Part 2

A couple of years ago I did a post on how to map your windows disk with the real disk in VMware. The post will be an extension of it but with updated commands.

Why do I need to know the mapping? It happens when you stumble upon a VM disk with many disks attached. If the many disks vary in size you normally can look at those numbers and match them with the disks in VMware, but when all disks have the same size that approach become difficult.

Windows serial number:

In windows, we can retrieve the serial number on the disk we need to expand and then map the serial number to the VMware disk. In newer Windows Server versions it’s fairly easy to find but when dealing with older than 2012 you are missing the PowerShell cmdlets like get-disk. Someone on StackOverflow got a way that works on Windows Server 2008 > 2022.

$DriveLetter = "C:"
Get-CimInstance -ClassName Win32_DiskDrive |
Get-CimAssociatedInstance -Association Win32_DiskDriveToDiskPartition |
Get-CimAssociatedInstance -Association Win32_LogicalDiskToPartition |
Where-Object DeviceId -eq $DriveLetter |
Get-CimAssociatedInstance -Association Win32_LogicalDiskToPartition |
Get-CimAssociatedInstance -Association Win32_DiskDriveToDiskPartition |
Select-Object -Property SerialNumber

VMware disk:

From VMware’s side, it’s straightforward to find the disk and its serial number. Below is an scripted way of finding the disk and then adding the extra capacity.

Connect-VIServer ""

$VMname = ""
$disksn = "6000c295ec128b3d14472bdbf8e65aee"
$vmDisk = (Get-VM $VMname | Get-HardDisk) | Where-Object {$_.ExtensionData.Backing.uuid.Replace("-","") -eq $disksn } 

$ExpandSizeGb = 50
$vmDisk | Set-HardDisk -CapacityGB ($vmDisk.CapacityGB + $ExpandSizeGb) -Confirm:$false 


Instead of having to guess what disk in windows is mapping to the VMware disk you here have a more automated way. The disk serial number retrieve commands are compatible with up to Windows Server 2022.

Manual mount VMFS datastore

Have a datastore that shows Not Consumed? From time to time I stumble across them and from what I have found there is really only one way to get around it, manually mount the datastores from the shell of the ESXi host.

Not sure what the root cause for it is, but if you know, then please let me know 🙂


What we need to do is have the partitions UUID’s on the block device listed and afterwards mount the datastore with that UUID.

### Listing all available datastores that is not mounted
esxcfg-volume –l
### Mount a specefic datastore with the UUID found with -l 
esxcfg-volume –M <UUID>


After mounting with esxcfg-volume it should be mounted permanatly. Hope it works for you to.

Change VM MoRef in VBR database

This information can be found in many other places on the big internet, but since I can never find it myself, I will make a post more about the procedure.

When you switch ESXi host, vCenter, or remove and add from inventory your VM will get a new ID. In the world of VMware, it’s called MoRef ID.

When this happens Veeam will lose its coupling to the VM and backup will fail with:
– Virtual Machine <> is unavailable and will be skipped from processing.
– Nothing to process. All machines were excluded from task list.

How to verify there is a MoRef mismatch:

From a VMware perspective it’s easy:

connect-viserver <vcenter> -Credential $cred
Get-VM | select name, id

This will give you something like:

PS C:\Windows\system32> Get-VM | select name, id
Name Id
---- --

From Veeam perspective it’s a bit harder since you will need to query the MS SQL database that Veeam uses. So download the SQL Studio Manager from Microsoft.

Open the SQL Studio Manager as administrator on the server to gain access to the Veeam database. You can use the following query to find the MoRef that is in the Veeam database:

SELECT [dbo].[BObjects].id, [dbo].[BObjects].object_id, [dbo].[BObjects].host_id, [dbo].[BObjectsSensitiveInfo].object_name, [dbo].[BObjectsSensitiveInfo].path
FROM [dbo].[BObjects]
INNER JOIN [dbo].[BObjectsSensitiveInfo] ON [dbo].[BObjectsSensitiveInfo].bObject_id=[dbo].[BObjects].id  
WHERE object_name = '<vmname>'


So we can now see that the VM in VMware has MoRef “vm-71326”. But Veeam database has “vm-992”. From here on you know what’s wrong and you need to open a Veeam support case to get the supported procedure.

If you don’t care about supported procedures you can update the database with VMware VM new MoRef ID and your VBR job should be running again. The SQL query would look like this:

UPDATE [dbo].[BOobjects]
SET [object_id] = 'new-id'
WHERE [object_id] = 'old-id'


It’s not that had to change the MoRef in the VBR database. But remember, if you care about having a supported installation. Then you need to create a Veeam support case and have them help you. Something could have changed in the VBR database schema since this post.

Shrink VMDK disk

I have always thought that VMDK could only grow, so that has also been my default response to colleagues when they expanded a disk too much. Sure a storage vMotion could reclaim unused space in a thin disk, but the “down arrow” for storage capacity would never work. But then someone mentioned that he had done shrinking of disks a couple of times, I decided to investigate.

The official VMware kb isn’t too much help – somewhere discussing it on StackOverflow. But then I found an older post back from 2016 that seems to have found the approach so that’s what we are going to test out.


This is not supported in any way, use at your own responsibility. If you want a supported solution, then VMware converter in a v2v manner is kind of the only way. If you still want to try out the method, then be sure to have a valid backup! And by backup, it’s not a VMware snapshot.

Not supported:

From the VMware documentation, it seems shrinking disk is not allowed under the following circumstances:

  • The virtual machine is hosted on an ESX/ESXi server.ESX/ESXi Server can shrink the size of a virtual disk only when a virtual machine is exported. The space occupied by the virtual disk on the ESX/ESXi server, however, does not change.
  • The virtual machine has a Mac guest operating system.
  • You preallocated all the disk space to the virtual disk when you created it.
  • The virtual machine contains a snapshot.
  • The virtual machine is a linked clone or the parent of a linked clone.
  • The virtual disk is an independent disk in nonpersistent mode.
  • The file system is a journaling file system, such as an ext4, xfs, or jfs file system.

The test scenario:

I have a windows 2019 VM, here is the process I want to try out

  1. Expand VMDK disk in vCenter
  2. Extent disk in VM guest using diskpart
  3. Shrink disk in VM guest using diskpart
  4. calculate new sector size
  5. edit VM *.vmdk with the newly calculated sector size
  6. Storage migrate to other datastore
  7. Check if VM is still ok.


We start off with the VM. Its Windows 2019, original size is 40GB.
Disk is now extended with 5gb.
With a view from the esxi we can see the disk is also showing 45GB.
inside “win2019.vmdk” we can see the “extent description”. This is the number we have to change after the guest os filesystem has been shrunk.
Here we see the disk has been extended to 45GB and then shrunk down with 10GB.

Calculating the “extent description”:

So there is now 10GB free space we can shrink the VMDK with.

A virtual disk described as monolithic and flat consists of two files. One file contains the descriptor. The other file is the extent used to store virtual machine data.

Considering our existing extent
RW 94371840 VMFS “win2019-flat.vmdk”
This means that the file win2019-flat.vmdk is 94371840 sectors Ă— 512 bytes/sector = 48318382080 bytes = 48318MB in size.

Let’s calculate the new value from GB to sectors.

36GB x 1024(mb) x 1024(kb) x 1024(byte) / 512byte pr sector = 75.497.472

before proceeding, we need to power off the VM. The .vmdk file is loaded into memory, so even if we can edit it now and start storage vMotion our changed value will just change back.

Letting vMotion do its magic
And after the boot of VM the disk is now shrunk. And we still have a working guest os.


It worked, we were able to add more space to the VM, extent, and shrink the guest os filesystem. We then calculated the number of sectors for the .vmdk file and storage vMotion did its magic and made the VMDK smaller in physical size.

I have also tried this in a couple of cases, also real life senairoes where people have added 4TB to much…. Then its sometimes easier to shrink than having to move files around.

VCD – Find free external IPs

Finding free public IPs in Cloud Director backed by NSX-V is not as easy as it should be. Some people will tell you to ping the scope and see what’s responding. But pinging is not reliable was of finding free IPs. Not every device is responding to ICMP messages.

Somewhere along the line, I found a guy on the VMware forum posting a script for finding available IPs in Cloud Director using the PowerCLI module for querying VCD and getting back IPs that are not allocated by an Edge. I have been using the script quite a bit since. His blog is not available today, but the code is still on the forum.

Now it’s also available here on the site with a bit more explanation on how to connect and use the function. I have been using it with NSX-V as backend, haven’t tried it with NSX-T at the network backend yet.

 ### Install PowerCLI module
Install-Module -Name VMware-vCD-Module

### Import PowerCLI Cloud module
Import-Module -Name VMware.VimAutomation.Cloud

### Connect to Cloud Director instance with your credentials
Connect-CIServer -server <VCD_URL>

Function Get-FreeExtIPAddress([String]$extnetName){
    function  Convertto-IPINT64  () { 
    param ($ip) 
    $octets = $ip.split(".") 
    return [int64]([int64]$octets[0]*16777216 +[int64]$octets[1]*65536 +[int64]$octets[2]*256 +[int64]$octets[3]) 
    function  Convertto-INT64IP() { 
    param ([int64]$int) 
    return (([math]::truncate($int/16777216)).tostring()+"."+([math]::truncate(($int%16777216)/65536)).tostring()+"."+([math]::truncate(($int%65536)/256)).tostring()+"."+([math]::truncate($int%256)).tostring() )
    $extnet = Get-ExternalNetwork -name $extnetName
    $ExtNetView = $Extnet | Get-CIView
    $allocatedGatewayIPs = $extnetView.Configuration.IpScopes.IpScope[0].SubAllocations.SubAllocation.IpRanges.IpRange | ForEach-Object {
        $startaddr = Convertto-IPINT64 -ip $_.StartAddress
        $endaddr = Convertto-IPINT64 -ip $_.EndAddress
        for ($i = $startaddr; $i -le $endaddr; $i++) 
            Convertto-INT64IP -int $i 
    [int]$ThirdStartingIP = [System.Convert]::ToInt32($extnet.StaticIPPool[0].FirstAddress.IPAddressToString.Split(".")[2],10)
    [int]$ThirdEndingIP = [System.Convert]::ToInt32($extnet.StaticIPPool[0].LastAddress.IPAddressToString.Split(".")[2],10)
    [int]$FourthStartingIP = [System.Convert]::ToInt32($extnet.StaticIPPool[0].FirstAddress.IPAddressToString.Split(".")[3],10)
    [int]$FourthEndingIP = [System.Convert]::ToInt32($extnet.StaticIPPool[0].LastAddress.IPAddressToString.Split(".")[3],10)
    $octet = $extnet.StaticIPPool[0].FirstAddress.IPAddressToString.split(".")
    $3Octet = ($octet[0]+"."+$octet[1]+"."+$octet[2])
    $2Octet = ($octet[0]+"."+$octet[1])
    $ips = @()
    if ($ThirdStartingIP -eq $ThirdEndingIP) {
        $ips = $FourthStartingIP..$FourthEndingIP | % {$3Octet+'.'+$_}
    } else {
        do {
            for ($i=$FourthStartingIP; $i -le 255; $i++) {
                $ips += ($2Octet + "." + $ThirdStartingIP + "." + $i)
            $ThirdStartingIP=$ThirdStartingIP + 1
        } while ($ThirdEndingIP -ne $ThirdStartingIP)
        for ($i=0;$i -le $FourthEndingIP; $i++) {
            $ips += ($2Octet + "." + $ThirdStartingIP + "." + $i)
        $allocatedIPs = $ExtNetView.Configuration.IpScopes.IpScope[0].AllocatedIpAddresses.IpAddress
    for ($i=0;$i -le $ips.count; $i++) {
        for ($j=0; $j -lt $allocatedGatewayIPs.count; $j++) {
            if ($ips[$i] -eq $allocatedGatewayIPs[$j]) {
                $ips = $ips | Where-Object { $_ -ne $ips[$i] }
        for($z=0;$z -lt $allocatedIPs.count;$z++) {
            if ($ips[$i] -eq $allocatedIPs[$z]) {
                $ips = $ips | Where-Object { $_ -ne $ips[$i] }
    return $Ips

### Find the names of external networks

### Find free IPs using function from above
Get-FreeExtIPAddress -extnetName <vcd-net>

You can help yourself by copy and pasting the code snip into either PowerShell ISE or VisualCode. And since you need to install a cmdlet you need to run it with elevated rights. If you get a red message with importing the module it’s probably because of execution rights, you then need to run to command beneath. This is for allowing remote signed cmdlets to be executed.

Set-ExecutionPolicy RemoteSigned

Getting the names of the external networks with “get-externalnetwork”
Using the function to find available IPs in the selected external network

NSX API – DLR L2 bridging

Here is a script for mass DLR L2 bridge creation. I had to bridge a couple of hundred VLAN to VXLAN, and while it was maybe faster to create it by hand I would not have learned anything.

The script is reading from a CSV file where I have all my info. Then loops through the entries and create a distributed port group and then initiates an L2 bridge. The VXLAN had been created post to this operation.

$csv = Import-Csv "D:\temp\VLAN.csv" -Delimiter ";"
Import-Module PowerNSX
get-module -name vmware* -ListAvailable | Import-Module

$cred = get-credential
connect-viserver -server -Credential $cred

foreach ($net in $csv) {
    $vdportgroup = ("zitmit-$($net.acl)").ToLower()

    $exists = Get-VDSwitch -Name "DSMpls01-EX" | Get-VDPortgroup -Name $vdportgroup -ErrorAction SilentlyContinue
    if (!$exists) {
        Get-VDSwitch -Name "DSMpls01-EX" | New-VDPortgroup -Name $vdportgroup -VLanId $net.mitvlan -NumPorts 2
        $created = Get-VDSwitch -Name "DSMpls01-EX" | Get-VDPortgroup -Name "zitmit-acl-10344"
        if (!created) {
            Write-Host -ForegroundColor Green "Portgroup created: $vdportgroup"

            $vdportgroupId = ($created.Id).Replace("DistributedVirtualPortgroup-","")
            $vdportgrpupName = $created.Name

            create-nsxl2bridge -aclname $($net.acl) -dvportGroup $($created.key)
    else {
        Write-Host -ForegroundColor Yellow "Portgroup have allready been created: $vdportgroup"
        #Get-VDSwitch -Name "DSMpls01-EX" | New-VDPortgroup -Name $vdportgroup -VLanId $net.mitvlan -NumPorts 2

Function create-nsxl2bridge {

    # Login info
    $nsxUsername = 
    $nsxPassword = 

    # Allow all SSL protocols
    $AllProtocols = [System.Net.SecurityProtocolType]'Ssl3,Tls,Tls11,Tls12' 
    [System.Net.ServicePointManager]::SecurityProtocol = $AllProtocols

    # Connect to NSX manager
    $connection = Connect-NsxServer -Username $nsxUsername -Password $nsxPassword -WarningAction SilentlyContinue
    $virtualwire = Get-NsxLogicalSwitch | Where-Object { $_.name -match "$aclname" -and $_.name -notmatch "lan" }

    if ($virtualwire.count -gt 1) {
        $message = "Something could wrong - $aclname"
        write-host $message -ForegroundColor yellow
        $message | Out-File C:\log\create-nsxl2bridge.txt -Append
        $virtualwire = $virtualwire[0]
    elseif (!$virtualwire) {
        $message = "virtualwire was not found: $($virtualwire.objectId) - acl: $aclname"
        write-host $message -ForegroundColor yellow
        $message | Out-File C:\log\create-nsxl2bridge.txt -Append

    # Edge info
    $edgeId = "edge-1120"
    $Type = "Accept: application/xml"
    $Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword)) }
    $nsxUri = "$edgeId/bridging/config"

    # Getting edge config
    $currentL2Config = $null
    $currentL2Config = Invoke-RestMethod -Uri $nsxUri -Headers $Header -Method GET -ContentType $Type

    # Check if already there
    foreach ($z in $currentL2Config.SelectNodes("//name"))
        if ($z.'#text' -match $aclname ) {
            write-host "Already exists: $aclname" -ForegroundColor yellow

    # Add extra xml node to currentconfig
    $handler1 = $null
    $handler1 = $currentL2Config.CreateNode('element', "bridge", '')
    $attr = $currentL2Config.CreateNode('element', "bridgeId", '')
    $attr.InnerText = "$null";
    $attr = $currentL2Config.CreateNode('element', "name", '')
    $attr.InnerText = "$aclname";
    $attr = $currentL2Config.CreateNode('element', "virtualWire", '')
    $attr.InnerText = "$($virtualwire.objectId)";
    $attr = $currentL2Config.CreateNode('element', "dvportGroup", '')
    $attr.InnerText = "$dvportGroup";
    # Remove nodes from existing XML
    $currentL2Config.SelectNodes("//virtualWireName") | ForEach-Object { $_.ParentNode.RemoveChild($_) }
    $currentL2Config.SelectNodes("//isSharedNetwork") | ForEach-Object { $_.ParentNode.RemoveChild($_) }
    $currentL2Config.SelectNodes("//dvportGroupName") | ForEach-Object { $_.ParentNode.RemoveChild($_) }

    # Add the newly created node to existing XML

    # PUT edge config
    $respons = Invoke-RestMethod -Uri $nsxUri -Headers $Header -Method PUT -ContentType 'application/xml' -Body $currentL2Config
    write-host "L2 Created: $($virtualwire.objectId) - acl: $aclname" -ForegroundColor Green

VCD – remove org and its items

Having a nice VRO job to create the VCD tenants with its VDCS, Edges and networks are nice. When having to clean up after testing its a pain to click through the GUI to first remote networks, then edges, then disable VDC, delete it, and final delete the org. A bit of PowerShell fu can help with the task, this is a quick and dirty script set of commands, but it works as intended.

get-module VMware.VimAutomation.Cloud | Import-Module

$ciCred = Get-Credential
Connect-CIServer -Server vcd.domain.tld -Credential $ciCred

$org = get-org deletemeorg
$orgvdc = $org | get-orgvdc
### Remove Org networks
$orgvdc | Get-OrgvdcNetwork | Remove-OrgVdcNetwork 
### Remove edges
($orgvdc | get-edgegateway | Get-CIView).delete() 
### Remove VDC
$orgvdc | Set-OrgVdc -Enabled $false | remove-orgvdc
### Remove Org
$org | set-org -Enabled $false | remove-org

Cloud Director 10.1 released

Been using vCloud since version 5.1. After a brief love affair with something called “Azure Pack” we put all our focus into vCloud.

8.20 was the first sign of heartbeats coming from VCD. We got confirmation that vCloud was for sure the platform that we were and had been looking for. Now we see the 10.1 released and from my point of view it’s a big one, may things change in GUI as in infrastructure. This release is also the final farewell to the old flex GUI.

First off we have to address the naming, I always liked the vCloud term, for me a strong brand. So a bit sad to see that go and now we have to get used to the Cloud Director instead. Thankfully we can still use the acronym VCD for VMware Cloud Director. #LongLiveVCD.

In the next few points, I will address some of the major things within this release.


We use a lot of the functionality of the APIs of VCD. Since we see that the development of VCD is changing into higher gear, so is the deprecation of the older API versions. For a small service provider, it’s always hard to revisit automation already working with existing APIs. When going on board 10.1 we have to go through a couple of workflows to update the to use the new 34.0 API. But on the other side, it’s also a good chance to refactor and optimize.

  • VMware Cloud Director API version 29 and below are not supported.
  • VMware Cloud Director API version 30.0 is deprecated and will become unsupported after VMware Cloud Director 10.1
  • VMware Cloud Director API version 31.0 is deprecated.

NSX-T feature improvements

More of the core NSX-T features is now available through VCD.

  • IPSec VPN
  • Dedicated External Network
  • BGP and Route Advertisement

We have been looking from the side for NSX-T development to reach an acceptable level for some time. NSX-V is still doing a good job. As someone who right now is standing up a new 16 node VMware cluster as a new provider VDC, I would have wished for it to be 6 months later so that all NSX-T functionality was ready and we could hopefully solo use NSX-T.

But we have to look into maybe having two 8 node clusters for NSX-V and on for NSX-T so we can already now start to transition to NSX-T…

But the good thing about being a VMware customer is that you are not left in the dust. There have been already been created migration tools for NSX-V > NSX-T, NSX-T Data Center Migration Coordinator, but it had no integration to VCD. which bring me to the next point!

NSX-V to NSX-T VCD Migration Tool

This is a way of helping us transition from NSX-V to NSX-T as we are seeing NSX-V lacking to the end of support in January 2021.

Before we could still do a new provider VDC that was backed by NSX-T controller and then start to move workloads over to the new cluster and at the time had to use NSX-T functionality, but all in a manual process.

There is now an automated way to do it, which is VCD aware. The approach will require a new cluster since NSX-V and NSX-T can’t coexist in the same cluster. From the Whats New in 10.1 it stats that the workflow will help with following

  • Automates migration of vCD metadata and workloads from NSX-V to NSX-T
  • Migrate per Org VDC migration to reduce maintenance window to single tenant
  • Minimize network downtime with bridged networks during migration
  • Live migrate with vMotion to ensure non-disruption to user workloads
  • Keep source VDC configuration and environment as-is to allow rollback
Before live migration
After live migration

Tomas did a good discussion on this subject

SSL and Certificate Management

This seems like something to read up on carefully. In short, VCD does not trust endpoint certificates unless they have been imported to the trust store.

There is a tool helping with the import, trust-infra-certs, that automatically connect to the endpoint, grabbing and importing the certificate. If this is not done successfully you will not be able to talk to those endpoints after upgrading to VCD 10.1.

App Launchpad

A new feature to help introduce a marketplace with the help of the content from Bitnami. From there we can now offer customers to easily find, deploy and manage new workloads. Not just as VMs but also as containers.

Daniel did an excellent write up on this subject.


There is still a lot more in this release to talk about, CSE2.6, OSE1.5, Terraform 2.7 provider, etc. read more from the official release notes.

Might have had to write a disclaimer for the length of this post and the lack of interesting pictures, will try to improve for next time.

I love to see VCD take flight. We are looking forward being part of the future journey where things like Bitnami and App Launchpad together with more NSX-T functionality and a whole lot of other features helps us Cloud Providers to help other business to there digital transformation .

Big shout out to VMware and the VCD team!

vCloud SAML authentication

vCloud have LDAP, SAML and local users as an option for tenant authentication. In this post, we are looking into SAML integration. With AzureAD.

The cool thing about AzureAD is that you will gain the MFA option out of the box, and when tenants want access we can also invite them from their own AzureAD tenant into the resource AzureAD tenant. This gives flexibility and overview of who has access.

ADFS is also an option, but there you need to keep your own infrastructure with a resource AD/ADFS and furthermore need a 3. party MFA solution.


  • Setup Enterprise app in desired resource AzureAD
  • Setup claims
  • Set federation entity id for tenant
  • Import vCloud federation metadata to AzureAD
  • Import AzureAD enterprise app federation metadata to vCloud
  • Setup allowed users/groups in vCloud


Let’s get started with Azure AD configuration. Login to your AzureAD portal https://portal.azure.com. Navigate to “Azure Active Directory” > “Enterprise App” and press “New Application”. Choose “Non-gallery application”. Give it the name “vCloud SAML test” and press “Add”. This will take a couple of minutes.

Navigate back to “Enterprise Apps” > “All applications” and choose your newly created App.

For test purpose, add/assign a test user to the app. This is under “Users and groups”. This user will be able to login to the enterprise app with AzureAD.

Now go to “Single Sign-on”. This will now ask for the sign-on method, and here we will choose “SAML”. This will then take us to the SAML setup. The first thing to do is importing the metadata from the cloud.

You will find the metadata by logging in to vCloud, go to the tenant, under “administration” > “federation” tab. Enter the URL for the tenant as a entity id, apply and afterwards download the metadata from the link.

You will find the metadata by logging in to vCloud, go to the tenant, under administration choose the federation tab. Enter the URL for the tenant as a entity id, apply and afterwards download the metadata from the link.

In azureAD “Upload Metadata” and chose the downloaded file from vCloud. This will give AzureAD the knowledge of where to redirect and accept request from.

vCloud can validate a couple of user/group parameters. Vmware documentation. So we will add some claims to Azure AD.

Now we will need to download the AzureAD metadata and import into vCloud. Fetch the data by pressing “Download” to the “Federation Metadata XML”.

Head over to vCloud tenant federation page again. Paste the content from the download metadata file. check the “Use SAML identity” and apply. Now we are almost ready to try it out. But first, head over to “Users” tab in vCloud. We need to add the user/role to whom are allowed to gain vCloud Access.

Here we put in the mail address and role of the user from Azure AD. When the SAML response then returns to vCloud then vCloud can see it been authenticated in Azure AD and that the user is an Org admin.

Next step would be to use groups and roles so that we can put users into groups in Azure AD and that way manage access for the tenant. But after this, we can now head to the tenant URL. We will then be redirected to the Azure AD login page, login and accept to MFA so that we can be redirected to our vCloud tenant.

And voila, we have logged into our vCloud tenant with Azure AD.


When I first started this project I was using a GUID as a vCloud entity id. That meant that I could get it to work with ADFS but not AzureAD. I went full mole on the troubleshooting.

In the end, I intercepted the SAML responses. These are encoded in base64, easy task to decode. And afterwards, I got the XML that either ADFS or AzureAD is sending back. I could then compare them, and I saw som <ds> tags to the cert that wasn’t on in the response from AzureAD. Unfortunately, that was a duck and meant nothing.

By tailing the log from vCloud, tail -f /opt/vmware/vcloud-director/logs/vcloud-container-debug.log, I could get some hints when the SAML auth failed.

org.opensaml.common.SAMLException: Local entity is not the intended audience of the assertion in at least one AudienceRestriction

doing a bit more googling and found out that I should be looking at the <audience> tag from the two SAML responses. And yes, that made some sense.

Azure AD sets the value of this element to the value of Issuer element of the AuthnRequest that initiated the sign-on. To evaluate the Audience value, use the value of the App ID URI that was specified during application registration.
Like the Issuer value, the Audience value must exactly match one of the service principal names that represents the cloud service in Azure AD. However, if the value of the Issuer element is not a URI value, the Audience value in the response is the Issuer value prefixed with spn:.


And that was the problem, spn: prefix when not using a URL as entity id. Changing it to the URL made it work.

Maybe this is obvious to the world, but I didn’t know it, but glad my troubleshooting skills where sufficient 🙂

Install and use MegaCLI on VMware host

Over the last decade, I had the fun of how having to manage an LSI based RAID controller. Never on Windows machines, where the GUI-based Storage Manager tools are simple to work with.

Even though I usually find the vib and get it installed I always struggle to remember how it’s installed and what the commands are. This time I will write it down for the future me, or you?


  • Find the MegaCLI vib file and download it…
  • Copy vib to ESXi host
  • Install vib
  • Use MegaCLI for whatever purpose you got

Finding the vib

This is where I struggle the most. LSI was bought by Avago and soon after Avago was bought by Broadcom. So the support links for the downloads have been 404 and using Broadcom’s support site is an education degree that I do not own. This time the link was this, giving you a zip file containing the MegaCLI package for all platforms.

If the link does not work for next time, or maybe a newer version is out. I also managed to find it on https://www.broadcom.com/support/download-search. Make a keyword search for MegaCLI, expand the “management software and tools” from the results and choose the newest “MegaCLI x.x Px” For now it’s MegaCLI 5.5 P1 version 8.07.07.

Install MegaCLI

We now got the zip, extract it and under the “VmwareMN” folder there is the vib that we are gonna be needing.

### SCP it to the host
jr@mbp:~ jr$ scp /Users/jr/Download/8-07-07_MegaCLI/VmwareMN/vmware-esx-MegaCLI-8-07-07.vib root@[ESXHOST]:/tmp/

### SSH to the ESXi host and install. Reboot afterwards
[root@esxhost:~] esxcli software vib install -v /tmp/vmware-esx-MegaCLI-8-07-07.vib

If you are lucky and get a “Could not find a trusted signer” when trying to install the vib the workaround is to add “–no-sig-check” at the end of the esxcli command, after the file path. Since I downloaded it from Broadcom’s own site, I trust it.

After the host reboot(which is very annoying, but necessary). We can not find MegaCLI binary under /opt/lsi/MegaCLI/

Useful MegaCLI commands

### Enclosure information
 /opt/lsi/MegaCLI/MegaCli -EncInfo -aALL

### Virtual drive information
/opt/lsi/MegaCLI/MegaCli -LDInfo -Lall -aALL

### Physical drive information
/opt/lsi/MegaCLI/MegaCli -PDList -aALL

### Silence active alarm
/opt/lsi/MegaCLI/MegaCli -AdpSetProp AlarmSilence -aALL

### Disable alarm
/opt/lsi/MegaCLI/MegaCli -AdpSetProp AlarmDsbl -aALL

### Enable alarm
/opt/lsi/MegaCLI/MegaCli -AdpSetProp AlarmEnbl -aALL

### Prepare for removal
/opt/lsi/MegaCLI/MegaCli -PdPrpRmv -PhysDrv [E:S] -aN

### Unconfigured Bad to good
/opt/lsi/MegaCLI/MegaCli -PDMakeGood -PhysDrv[E:S] -aN

I found a guy that did a bit more advanced MegaCLI scripting, its bit old but still very useful. You can find the site here. I have done some copy-pasting from the script, but all credit goes to the guy behind the link.

### List disk status
/opt/lsi/MegaCLI/MegaCli -PDlist -aALL -NoLog | egrep 'Slot|state' | awk '/Slot/{if (x)print x;x="";}{x=(!x)?$0:x" -"$0;}END{print x;}' | sed 's/Firmware state


CLI is awesome, so many possibilities and so flexible. In my opinion its a bit hard to find, but after you got it installed its easy. I have tested this on ESXi6.7 and it world as it should. I hope you can use some of it.

Disk mapping Windows/VMware

Since I’m working in a datacenter department at a service provider automation is a big thing. We have lots of different automatic workflows already. Everything from reading out power usage for co-location customers to creating a fully functional virtual datacenter with VMware vCloud Director.

The latest idea was to create an automatic disk expansion service. We monitor the customer’s environments with PRTG and call to help them with an expansion when more disk space is needed. But that’s only within business hours and of our service desk are busy we don’t always make the expansion in a timely fashion. For an exchange server, this is bad, full-disk means no mail flow.

Our backend developer(super skilled guy) extended the service agent that we run on all customer servers, with a new data collector that looks for free space and disk-identifiers. If a disk is running full he will create a RabbitMQ ticket that will trigger a vRealize Orchestrator workflow that finds the disk and expands it. Then reports back to his services so that his service can expand the disk from within Windows.

Identifying Windows disk from VMware environment

Our google foo was giving the same result over and over again, we should look at the SCSI ID. From within Windows, you can get the LUN ID and what controller its located. That position should then be the same as seen from VMware side.

While testing it on Windows 2016+ this worked ok. BUT we have customers that are still on Windows 2012, and here it didn’t work. *Sigh*. If the VM where having multiple controllers then we could not see what UnitId were to attach to the corresponding Controller Id. So back to the drawing board.

### From VMs Id and ControllerId and UnitId the disk that needs expansion is found. 
#$vmDisk = (Get-VM -Id $vmid | Get-HardDisk) | where { $_.ExtensionData.ControllerKey -eq ((Get-VM -id $vmid | Get-ScsiController ).ExtensionData | where { $_.BusNumber -eq $ControllerId }).Key } | where { $_.ExtensionData.UnitNumber -eq $SCSITargetId }

### Afterwards the disk can have the added capacity.
$vmDisk | Set-HardDisk -CapacityGB ($vmDisk.CapacityGB + $ExpandSizeGb) -Confirm:$false

We then kept looking but could not find anything in particular. Thinking about a physical disk having a serial number we began to pursue that idea, the VM should see the UUID that VMware where presenting. And yes, this sure seems to be working a Windows 2008 through Windows 2019.

VMware VM extension data – UUID

With the disk serial number approach, it was also easier to find the disk.

### UUID can be found in the VM extension data.
$vmDisk = (Get-VM -Id $vmid | Get-HardDisk) | Where-Object {$_.ExtensionData.Backing.uuid.Replace("-","") -eq $disksn } 


Don’t know why other people are not suggesting the disk serial number approach instead of the SCSI ID. But my theory is that many looks at what data they can get from the vCenter GUI. And here the SCSI ID based on controller id and unit id is the only thing really available.

But there is a lot of nice data when using PowerCLI to look at the data. Especially when doing automation.

ESXCLI host upgrade procedure

Most of the time you would want to use VMware Update Manager when doing upgrade. Its part of vCenter and is necessary tool when having to maintain your environment. But for smaller deployments, with standalone hosts and no vCenter the following upgrade methods are desired and can help the upgrade time. Instead of having to upgrade with IPMI and an ISO.

Online mode:

This method is for getting the update online, no need to download ISO/offline bundles, etc. This will work for most of the upgrade use cases.

1: Connect to your ESXi host via the host client and enable SSH. Afterward ssh to the ESXi host and enable ESXi firewall rule to allow the host to access the internet.

esxcli network firewall ruleset set -e true -r httpClient

2: With the beneath command you will get a list of available ESXi packaged that are on the VMware repos. Enter this command to list all available profiles. We filter only those which are relevant to our case – upgrade to ESXi 6.7

esxcli software sources profile list -d https://hostupdate.vmware.com/software/VUM/PRODUCTION/main/vmw-depot-index.xml | grep -i ESXi-6.7

3. Chose the desired profile and use the following command for choosing and upgrading the ESXi version. Before upgrade its a good idea to enter maintenance mode.

esxcli system maintenanceMode set --enable true
esxcli software profile update -p ESXi-6.7.0-20190402001-standard -d https://hostupdate.vm

4. After it’s done, you will need to restart the host, after its rebooted you will run on the new ESXi version.

Custom, with Offline bundle:

This method is for when you desire to install a custom update, or that your hosts down have access to the internet.

1: Download the offline bundle from the VMware webpage, in this upgrade I will use an HPE custom version. But if you run a generic version, that will also work.

2: After downloading the “VMware-ESXi-6.7.0-8169922-depot.zip” file, place it (upload it) to a datastore which is visible by your ESXi host. Best would be a local datastore if this host has some. If not, it can also be a shared datastore too.

3: Find the profile name from the depot offline bundle

 esxcli software sources profile list -d /vmfs/volumes/prd.r60lun01/ISO/VMware-ESXi-6.7.0-Up

Put your host into maintenance mode, enable SSH if you haven’t done yet.

3: Execute this command to upgrade your ESXi 6.x to 6.7

esxcli software profile update -p ESXi-6.7.0-13006603-standard -d /vmfs/volumes/your_datastore/VMware-ESXi-6.7.0-13006603-depot.zip

esxcli software profile update -p HPE-ESXi-6.7.0-Update2-Gen9plus-670.U2. -d /vmfs/volumes/prd.r60lun01/ISO/VMware-ESXi-6.7.0-Update2-13006603-HPE-Gen9plus-670.U2.

After checking that your upgrade was successful, reboot your host. You should see a message saying that the upgrade completed successfully.


I have tried to get an error with:

Failed updating the bootloader: Execution of command /usr/lib/vmware/bootloader-installer/install-bootloader failed: non-zero code returned…. return code: 1”

Error when upgrading, due to “insufficient space”.

This problem is due to the SWAP is but on the installation of the ESXi, not a good thing. So let’s change it.

Go to the UI of the ESXi Hosts https://IP/ui, login and proceed to the following:

Manage > System > Swap > Edit Settings

Chose the dropdown and select a datastore. Apply and the swap space is not freed from the ESXi install device so that you can try to upgrade again.


After the upgrade, it’s a good idea to disable the ESXi firewall rule for “HTTP outside access”. Stop and disable SSH again, but it’s optional 🙂

esxcli network firewall ruleset set -e false -r httpClient

Now you should have an upgraded host.

NSX 6.3.6 to 6.4.5 – Controller problem encountered

NSX upgrades can be a delicate thing to upgrade, even though everything is in its finest shape.

After we successfully have upgrade the NSX managers we proceeded with upgrading of the NSX Controllers. We did pre-check and issued command “show control-cluster status” and it looked fine, upgrade to 6.4.6 went well and we could vMotion VMs around after the controller was booted. But post-checks was not ok, the “show control-cluster status” did not return as expected and we where not confident to proceed with the host upgrades.

After some trouble shooting we found that the /var/log partition on 2/3 of the controllers where full. Without any other evidence we concluded that this was the problem. After some google-fu we didn’t find any KB or blogs on how to purge logs.

But we found out that we could get into a engeering mode that would give us shell access. Long store short, we did the following:

1. https://kb.vmware.com/s/article/2149630 to gain shell access on manager
1.1 password is IAmOnThePhoneWithTechSupport
2. Extracting root passwords for controllers with /home/secureall/secureall/sem/WEB-INF/classes/GetNvpApiPassword.sh controller-nn
3. Loged into each controller, and issued : debug os-shell and thereby gain root shell access.
4. Deleted /var/log/syslog.1 on each node.
5. Rolling restart of controllers and after they booted they all joined the cluster.


After this we got the status as we wanted. In the mean while we had create a case with VMware support and the supporter was on a remote session with us. We told him what we have done, we verified that the controlleres was health and they where.

Next step, VIB upgrade on the hosts.

Good commands to know:

show process monitor
show controller list all
show control-cluster status

Edit: This article from VMware have the exact problem we encountered. We also contacted VMware Support, but before they where able to assist us we had the problem solved. 🙂

Process of getting the root password for controllers.

NSX Edge PowerShell manipulation

This is from a VMware support experience. A customer could not change DNS server parameters of the NSX Edge IP Pool. But actually is was a problem due to a bug in VCD 9.5, where a Edge XML config was missing some tags and therefor not being able to validate the XML when VCD post the edited XML config back to NSX manager.

I have attached VMware support answer in the bottom of the post.

Script will get all edges from the NSX manager, then you find the correct one and fill into the next part of the script. Then you get the XML down to a file on your local machine, you then edit the file and put in the missing tags and lastly PUT the XML backup NSX manager. After this operation, it works from the GUI again.

# Import credential module and login information
$ReturnObj = import-credentials vmwareSSO
$nsxUsername = $ReturnObj.Username
$nsxPassword = $ReturnObj.Password

# Other variables
$tempFile = "C:\temp\edge-747_jvr.xml"

# Allow all SSL protocols
$AllProtocols = [System.Net.SecurityProtocolType]'Ssl3,Tls,Tls11,Tls12' 
[System.Net.ServicePointManager]::SecurityProtocol = $AllProtocols

Add-Type @"
    using System;
    using System.Net;
    using System.Net.Security;
    using System.Security.Cryptography.X509Certificates;
    public class ServerCertificateValidationCallback
        public static void Ignore()
            ServicePointManager.ServerCertificateValidationCallback += 
                    Object obj, 
                    X509Certificate certificate, 
                    X509Chain chain, 
                    SslPolicyErrors errors
                    return true;


# Getting all edges
$Type = "Accept: application/xml"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword))}
$nsxUri = ""

[xml]$edges = (Invoke-WebRequest -Uri $nsxUri -Headers $Header -Method GET -ContentType $Type).Content
foreach ($edge in $edges.pagedEdgeList.edgePage.edgeSummary)
    $edgeInfo = "name: {0} - ID: {1}" -f $edge.name, $edge.objectId

# Getting specefic edge config
$edgeId = "edge-747"
$Type = "Accept: application/xml"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword))}
$nsxUri = "$edgeId"

(Invoke-WebRequest -Uri $nsxUri -Headers $Header -Method GET -ContentType $Type).Content | out-file $tempFile

# PUT edge config after edit
$Type = 'application/xml'
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword))}
$nsxUri = "$edgeId"
$edgeConfigAltered = Get-Content $tempFile

$respons = Invoke-WebRequest -Uri $nsxUri -Headers $Header -Method Put -ContentType 'application/xml' -Body $edgeConfigAltered
# Statuscode 204 is accepted

From Support:
– The issue you are seeing is a known issue 9.5.
– Like I mentioned in the previous email, this is due to missing elements from the xml.
– From the xml in the logs, I could see there are 52 NAT rules on that edge.Correct me if I am wrong. The following 2 rules had the elements missing


I have attached the file with the list of all the NAT rules seen from the logs if you need to cross-verify.

– To fix the issue,please follow https://kb.vmware.com/s/article/67193

If you have any further questions,let me know.

Have a good evening,

Best regards,


Make a clone of VMs to NAS – The PowerCLI way

Quick post, had a customer that yearly wants a clone of their VMs, copied to a NAS, and then shipped to customers HQ. The owner of the company put this as a requirement. Fair enough. I have almost always done the clone of the VMs by GUI, in the start, this was easy because they only had 5 servers, but they now have more. So this time I wanted to try and script it instead. It took me some extra time, but in the end, I think it’s worth it. My PowerShell skills are not great, still learning so bear with me.

# Variables
$vcenter = "<IP or hostname>"
$cluster = "<name of cluster that contains the servers>"
$nfsIP = ""
$nfsMount = "/nfs"

# Getting VMware PowerShell Modules
Get-Module -Name vmware* -ListAvailable | Import-Module

# Connect to vcenter
Connect-VIServer -Server $vcenter -User <username>

# Mount NAS 
get-cluster $cluster | get-vmhost | new-datastore -nfs -name NAS -path $nfsMount -nfshost $nfsIP

# More Variables
$ds = get-datastore NAS
$tempHost = "esx74.domain.tld"
$vms = Get-VM -Name customer*

# Copy all 
foreach ($vm in $vms)
    new-vm -name "$($vm.name)-clone" -VM $vm -Datastore $ds -vmhost $tempHost
    Remove-VM -VM "$($vm.name)-clone" -DeleteFromDisk:$false -Confirm:$false -RunAsync

# Remove datastore from hosts again
Get-Cluster $cluster | Get-VMHost | Remove-Datastore -Datastore $ds