Cloud Director PowerShell deploy

Having to deploy multiple Cloud Director environments can be tedious, especially when OVF deployment in vCenter fails with some obsure error all the time. But using powershell to deploy is smooth sailing.

Here you will get the small piece of PowerShell to do so. Its very basic and but works everytime.

Connect-VIServer -server vcsa.home.lab

$OVA = "C:\teknik\VMware_Cloud_Director-10.4.1.9057-20912720_OVF10.ova"

# vCenter Server used to deploy to
$VMCluster = "mgmt01"
$VMDatastore = "vSan"

# VCD Configuration
$VMName = "vcd1-03"

$VMNetwork1 = "vcd_vCloud_External_Perimeter"
$VMNetwork2 = "vcd_vCloud_Internal_Perimeter"

$VMNetworkIP1 = "10.66.66.31"
$VMNetworkIP2 = "10.55.55.31"
$VMNetmask = "255.255.255.0"
$VMGateway = "10.66.66.1"

$VMDNS = "10.44.44.10"
$VMNTP = "ntp.home.lab"
$VMSearchPath = "home.lab"

$eth0Routes = ""
$eth1Routes = "10.55.55.1 10.44.44.0/24, 10.55.55.1 10.1.94.0/24"

$DeploymentSize = "standby-medium"
$RootPassword = ''

$cluster = Get-Cluster $VMCluster
$vmhost = $cluster | Get-VMHost | Get-Random
$datastore = $vmhost | Get-Datastore $VMDatastore

# Setup ovfconfig
$OvfConfig = Get-OvfConfiguration $OVA

$OvfConfig.DeploymentOption.Value = $DeploymentSize

$OvfConfig.NetworkMapping.eth0_Network.Value = $VMNetwork1
$OvfConfig.NetworkMapping.eth1_Network.Value = $VMNetwork2

$OvfConfig.vami.VMware_vCloud_Director.ip0.Value =  $VMNetworkIP1
$OvfConfig.vami.VMware_vCloud_Director.ip1.Value =  $VMNetworkIP2
$OvfConfig.vami.VMware_vCloud_Director.netmask0.Value = $VMNetmask
$OvfConfig.vami.VMware_vCloud_Director.netmask1.Value = $VMNetmask
$OvfConfig.vami.VMware_vCloud_Director.gateway.Value = $VMGateway
$OvfConfig.vami.VMware_vCloud_Director.DNS.Value = $VMDns
$OvfConfig.vami.VMware_vCloud_Director.domain.Value = $VMName
$OvfConfig.vami.VMware_vCloud_Director.searchpath.Value = $VMSearchPath

$OvfConfig.vcloudapp.VMware_vCloud_Director.enable_ssh.Value = $true
$OvfConfig.vcloudapp.VMware_vCloud_Director.expire_root_password.Value = $false
$OvfConfig.vcloudapp.VMware_vCloud_Director.ntp_server.Value = $VMNTP
$OvfConfig.vcloudapp.VMware_vCloud_Director.varoot_password.Value = $RootPassword

$OvfConfig.vcloudnet.VMware_vCloud_Director.routes0.Value = $eth0Routes
$OvfConfig.vcloudnet.VMware_vCloud_Director.routes1.Value = $eth1Routes

# Deploy
Import-VApp -Source $OVA -OvfConfiguration $OvfConfig -Name $VMName -Location $cluster -VMHost $vmhost -Datastore $datastore -DiskStorageFormat thin

ESXi VM PCAP collection

This is a small guide on how to collect PCAPs for network traffic from/to a virtual machine running on VMware ESXi. Scripts have been provided by VMware GSS and can be found here and used at your own risk.

(I get it, who would want to use Python scripts downloaded from a random website and run them on your ESXi, feel free to look into the code after download, or else request them from VMware)

The scripts can help to collect on multiple PortIDs, divide the PCAPs into smaller files, and rotate the logs for each x minutes and it can help to monitor the capture that is ongoing.

If you don’t want to use the scripts, then there is also a more manual way to run the collection here.

Finding the VM port IDs

Connect to the ESXi host with SSH and list the running VMs.

[root@dc1esxedge1-02:~] esxcli network vm list
World ID  Name              Num Ports  Networks
--------  ----------------  ---------  --------
68018705  nsxedgenode1             3  dvportgroup-914135, dvportgroup-914135 

Find the VM that you want to collect from, let’s take WorldID 68018705 and check the switch port IDs for it

[root@esxedge1-02:~] esxcli network vm port list -w 68018092
   Port ID: 67108942
   vSwitch: DSVNSX
   Portgroup: dvportgroup-914135
   DVPort ID: 109
   MAC Address: 00:50:56:ab:82:59
   IP Address: 0.0.0.0
   Team Uplink: vmnic2
   Uplink Port ID: 2214592556
   Active Filters:

   Port ID: 67108943
   vSwitch: DSVNSX
   Portgroup: dvportgroup-914135
   DVPort ID: 108
   MAC Address: 00:50:56:ab:06:b0
   IP Address: 0.0.0.0
   Team Uplink: vmnic2
   Uplink Port ID: 2214592556
   Active Filters:

   Port ID: 67108944
   vSwitch: DSVNSX
   Portgroup: dvportgroup-914134
   DVPort ID: 98
   MAC Address: 00:50:56:ab:93:ab
   IP Address: 0.0.0.0
   Team Uplink: vmnic1
   Uplink Port ID: 2214592558
   Active Filters:

In this case, we note down the following port IDs and save them for later

67108942
67108943
67108945
67108946

Start the collection with scripts

Upload the Python script to the /tmp/ on the ESXi host. You also need a folder on the local datastore of the ESXi host where the pcap logs can be stored.

Prepare and run the first script where we provide the PortIDs and where to store the PCAPs

[root@esxedge1-02:~] python /tmp/prepare_pktcap.py -p 67108942 -p 67108943 -p 67108945 -p67108946 -u -d /vmfs/volu
mes/prd.dc1esxedge1-02/capture/42434546 -o /tmp/rotating_cap.sh -G 15m -r 3600
The current vmci heap is 99% free and so far there have been 0 allocation failures
[root@esxedge1-02:~]

Next, we will need to start the rotating log script, this will start the PCAPs collection threads and rotate the logs each 15 min as we defined in the prepare_pktcap script. The script should be left running in the SSH connection. if you stop it, the collection will also stop.

[root@esxedge1-02:~] /tmp/rotating_cap.py
Dump: 294272, broken : 0, drop: 0, file err: 0.
Dump: 175296, broken : 0, drop: 0, file err: 0.
Dump: 371584, broken : 0, drop: 0, file err: 0.
Dump: 317248, broken : 0, drop: 0, file err: 0.
Dump: 371648, broken : 0, drop: 0, file err: 0.
Dump: 317312, broken : 0, drop: 0, file err: 0.

Monitor the collection

The last script is for doing the monitoring of the sessions. You will need to open a new SSH session to the host to run this script.

[root@esxedge1-02:~] /tmp/pktcap_sessions.py -l
The vmci heap is 58% free
session    portID       devName
579        67108942
580        67108942
581        67108945
582        67108946
583        67108943
584        2214592556   vmnic2
585        67108946
586        2214592556   vmnic2
587        67108943
588        67108945
589        2214592558   vmnic1
590        2214592558   vmnic1

From the output, we can see that it collects the PortIDs that we have defined, but it also collects PCAP for the VMNics. This is valuable to us if we need to compare traffic on what is going in on the host and to the VM and visa versa.

Looking at the output that is stored on the datastore

[root@esxedge1-02:/vmfs/volumes/5fd89aca-1f6f344a-f65f-043f72c0064a/capture/42434546] ls
dc1esxedge1-02_2023-11-29T07_42_11_p67108942_d0_sna.pcap
          ....
dc1esxedge1-02_2023-11-29T07_42_11_p67108945_d0_sna.pcap.log      dc1esxedge1-02_2023-11-29T07_42_11_pvmnic2_d0_sna.pcap_rot.log
dc1esxedge1-02_2023-11-29T07_42_11_p67108945_d0_sna.pcap_rot.log  dc1esxedge1-02_2023-11-29T07_42_11_pvmnic2_d1_sna.pcap
dc1esxedge1-02_2023-11-29T07_42_11_p67108945_d1_sna.pcap          dc1esxedge1-02_2023-11-29T07_42_11_pvmnic2_d1_sna.pcap.log
dc1esxedge1-02_2023-11-29T07_42_11_p67108945_d1_sna.pcap.log      dc1esxedge1-02_2023-11-29T07_42_11_pvmnic2_d1_sna.pcap_rot.log
dc1esxedge1-02_2023-11-29T07_42_11_p67108945_d1_sna.pcap_rot.log  
killfile

We can see the PCAPs for each of the PortIDs and VMnics, and it will rotate every 15 minutes.

Stopping the collection

You might have noticed the “killfile” from above. Remote this file and the collection will stop.

[root@esxedge1-02:~] rm -rf /vmfs/volumes/5fd89aca-1f6f344a-f65f-043f72c0064a/capture/42434546/killfile

Conclusion

The scripts are handy because they rotate the logs and have a way to monitor and kill the collections. This way we don’t have to manually kill processes on the ESXi host.

When the collection is done, you can copy out the logs for further analysis, I found the Filezilla SFTP client to be the fastest way of copying out the data.

If out want to merge the PCAPs afterward, on MacOS, you can do it with

mergecap -w merged.pcap *.pcap

Troubleshooting

If you find that you can’t start the rotating logs script it might be because you have tried to start it before and it somehow stall. You can find the process IDs for it and kill it manually.

[root@esxedge1-02:~] ps -Tcjstv | grep -i rotating_cap

[root@esxedge1-02:~] kill 

Download

Just Enoght vSphere rights for replication

Just enough rights for a user to see and manage VMs in order to setup migration replication with either Azure Migrate or VMware Availability.

We are seeing more and more moving VMs around between providers, just a few years back this was not something that anybody wanted to go into, but the market is in a transition where offboarding is as important as onboarding. Good for the customer.

To ensure that the rights are just enough so that Azure Migrate or VMware Availability Onprem can’t see all VMs in your datacenter you need to limit the rights that the appliance is given.

To help out I have created a small permission script that can help with the setup of permissions.

It will create two roles, one global and one for the tenant resource group. Then it will setup the permissions so it’s just enough for the replication to work.

If you do not want the VCDA plugin into vCenter then you can remove the lines that define “Extension”

$roleGlobal = "vcda_repl_global"
$roleTenant = "vcda_repl_tenant"
$viserver = "vcsa1.home.lab"
$tenantRespool = "tenant1.comp1.01 (6284cdf1-cb7f-43bb-8e0f-09f439e09555)"
$vsphereUser = "tenant1_vcda"

Connect-VIServer -server $viserver

$roleIds = @()
$roleIds += "System.Anonymous"
$roleIds += "System.View"
$roleIds += "System.Read"
### Cryptographic Operations
$roleIds += "Cryptographer.ManageKeys"
$roleIds += "Cryptographer.RegisterHost"
### Datastore Privileges
$roleIds += "Datastore.Browse"
$roleIds += "Datastore.Config"
$roleIds += "Datastore.FileManagement"
### Extension Privileges - Not needed if you dont want plugin to vcenter 
$roleIds += "Extension.Register"
$roleIds += "Extension.Unregister"
$roleIds += "Extension.Update"
### Host Configuration Privileges
$roleIds += "Host.Config.Connection"
### Profile-driven Storage Privileges
$roleIds += "StorageProfile.View"
### Storage Views Privileges
$roleIds += "StorageViews"
### Host.Hbr.HbrManagement
$roleIds += "Host.Hbr.HbrManagement"

$roleIdsTenant = @()
### Resource Privileges
$roleIdsTenant += "Resource.AssignVMToPool"
### Virtual Machine Configuration Privileges
$roleIdsTenant += "VirtualMachine.Config.AddExistingDisk"
$roleIdsTenant += "VirtualMachine.Config.Settings"
$roleIdsTenant += "VirtualMachine.Config.RemoveDisk"
### Virtual Machine Inventory Privileges
$roleIdsTenant += "VirtualMachine.Inventory.register"
$roleIdsTenant += "VirtualMachine.Inventory.Unregister"
### Virtual Machine Interaction
$roleIdsTenant += "VirtualMachine.Interact.PowerOn"
$roleIdsTenant += "VirtualMachine.Interact.PowerOff"
### Virtual Machine State Privileges
$roleIdsTenant += "VirtualMachine.State.CreateSnapshot"
$roleIdsTenant += "VirtualMachine.State.RemoveSnapshot"
### Host.Hbr.HbrManagement
$roleIdsTenant += "VirtualMachine.Hbr.ConfigureReplication"
$roleIdsTenant += "VirtualMachine.Hbr.ReplicaManagement"
$roleIdsTenant += "VirtualMachine.Hbr.MonitorReplication"


New-VIRole -name $roleGlobal -Privilege (Get-VIPrivilege -Server $viserver -id $roleIds) -Server $viserver
Set-VIRole -Role $roleGlobal -AddPrivilege (Get-VIPrivilege -Server $viserver -id $roleIds) -Server $viserver

New-VIRole -name $roleTenant -Privilege (Get-VIPrivilege -Server $viserver -id $roleIdsTenant) -Server $viserver
Set-VIRole -Role $roleTenant -AddPrivilege (Get-VIPrivilege -Server $viserver -id $roleIdsTenant) -Server $viserver

$globalPrivileges = Get-VIPrivilege -Role $roleGlobal

$rootFolder = Get-Folder -NoRecursion
$permission1 = New-VIPermission -Entity $rootFolder -Principal (Get-VIAccount -Domain vsphere.local -User $vsphereUser ) -Role $roleGlobal -Propagate:$false

$tenantRespool = Get-ResourcePool -Name $tenantRespool
$permission1 = New-VIPermission -Entity  $tenantRespool -Principal (Get-VIAccount -Domain vsphere.local -User $vsphereUser ) -Role $roleTenant -Propagate:$true

Cloud Director – Tenant SAML SSO with AzureAD

I have done this post before, but it was back in the day when we called VCD for vCloud, miss the name a bit but not the flash GUI. In the future Cloud Director will not support local users in organizations. Therefore, if the customer needs to have access to their organization it needs to be with an IDP identity provider for SAML or OAuth.

Many customers today have Microsoft 365 already and therefore also AzureAD which is a convenient choice for SAML integration For the non-AzureAD IDP admin, you should also be able to use this guide since it’s only claims and metadata.

This guide will focus on access assigned based on group membership, but you can also use roles. The main difference between roles and groups would be that roles can be more granular than group memberships. For example, you can have a role that gives access to expand a disk or delete a VM, the roles translate better than groups.

Microsoft have changed the name of AzureAD to Entra, this post will refer to both AzureAD but this also means Entra

Process

  • Create Enterprise app in desired resource AzureAD
  • Add entity ID and claims
  • Import AzureAD enterprise app federation metadata to Cloud Director
  • Assign allowed users/groups to roles in Cloud Director

Claims

Cloud Director is very picky about claims, VMware documentation gives a list of how the claims should look.

  • email address = “EmailAddress”
  • user name = “UserName”
  • full name = “FullName”
  • user’s groups = “Groups”
  • user’s roles = “Roles”

If you want the claim names with the full namespace, then you can map it in the Cloud Director SAML attribute mapper. Attribute mapper

AzureAD(Entra) setup

Enterprise Application creation

The basic setup of Entra is shown in this short story below. We will end up with an Enterprise App that is ready to setup.

In Entra, find the Enterprise application and create a new one.

Enterprise app configuration

SAML Claims

Set up the claims according to the screenshot below. If you make claim names as in the screenshot then it will match Cloud Director so the SSO should work out of the box.

The only special setup is for the group claim where you should do the following

  • Choose “Groups assigned to the application”. If you have many groups in Entra SAML setup is only returning some of them. And if the group Cloud Director is looking for is not returned, then you are not logged in.
  • Source attribute of “Group ID”. Haven’t gotten it
  • Under Advanced options, choose “Customize the name of the group claim” and set it to “Groups” This will make sure Cloud Director knows that its groups.
If you are using a differnt IDP wher you cant alter the claim names, then you the attribut mapper in Cloud Diretor to translate the claimnames over to something that Cloud Director understands.

Cloud Director configuration

Go to tenant context and navigate to “Administration” > “SAML”. You might be hit with a prompt that the certificate is expired.

Hit the “Regenerate certificate” and afterward the “Configure” button and a popup show.

Fill out the wanted entity ID. I usually just use the URL for the tenant vcd. The important thing is just that the entity ID matches between Entra Enterprise App and Cloud Director.

Download the federation metadata XML from the AzureAD Enterprise application.

choose the “Federation Metadata XML” link

Upload the metadata to Cloud Director SAML configuration, enable the SAML identity provider, and press save.

We are now ready to upload the metadata from Cloud Director.

This can be found under “Administration” > “SAML” and “Retrieve Metadata”.

Now upload the metadata to the Enterprise Application

After the upload the Entity ID and reply URL will be filled out.

Save the uploaded metadata

Cloud Director SAML groups

For the Cloud Director to know what role the login user should have we need to tell it what group ID is associated with what role. We do that by fetching the ID of the group in AzureAD.

Copy the ID and go over to Cloud Director under “Administrator” > “Groups” > “Import Groups”

Paste in the ID and select the role followed by save. Using the Group ID helps if someone decides to change the name of the groups in Entra. For easier to see what group is behind the ID I tend to insert the group name(s) into the description after it has first been created.

Troubleshooting

There can be multiple places where problems can happen. Most of the time I start to look into a saml response, if that contains the groups that I expect then you will need to go on and look in the VCD logs.

SAML response

In your favorite browser, find developer mode and have it capture the traffic. Then log on to VCD have the error occur and then stop the capture again.

Capture the base64 encoded text and decode it in your favorite tool
<samlp:Response Destination="https://vcd.ramsgaard.dk/login/org/system/saml/SSO/alias/vcd" ID="_684262e1-e528-422f-8bf7-dd6d78f3ab49" InResponseTo="af3h19802aa811a5824j5b18aidee2" IssueInstant="2023-09-18T11:52:08.837Z" Version="2.0" xmlns:samlp="urn:oasis:names:tc:SAML:2.0:protocol">
	<Issuer xmlns="urn:oasis:names:tc:SAML:2.0:assertion">https://sts.windows.net/c5c123a4-e828-43bd-84d6-05cbfe1efae5/</Issuer>
	<samlp:Status>
		<samlp:StatusCode Value="urn:oasis:names:tc:SAML:2.0:status:Success"/>
	</samlp:Status>
	<Assertion ID="_e96104c4-681e-4624-111e-a6e9a2bf6900" IssueInstant="2023-09-18T11:52:08.833Z" Version="2.0" xmlns="urn:oasis:names:tc:SAML:2.0:assertion">

We want to see a success for the SAML authentication. If we got that then it can be due to the claims being sent with it.

			<Attribute Name="name">
				<AttributeValue>user@domain.tld</AttributeValue>
			</Attribute>
			<Attribute Name="http://schemas.xmlsoap.org/ws/2005/05/identity/claims/UserName">
				<AttributeValue>user@domain.tld</AttributeValue>
			</Attribute>
			<Attribute Name="Groups">
				<AttributeValue>93547461-c174-4cab-961f-6f21c43051bd</AttributeValue>
				<AttributeValue>3c86569c-9e03-43f8-87fa-30b566ebb829</AttributeValue>
			</Attribute>
		</AttributeStatement>

In the snip above you can see that I got two group IDs back in the claim, check if these are the group IDs VCD is looking for. If not, then you need to investigate if the user is a member of the correct groups.

Conclusion

Since VCD local users are deprecated and will be removed in the future we need to convert over to use an IDP instead. AzureAD is a nice way to implement MFA and has a single identity for accessing corporate resources.

VCD – Force delete network

In our v2t conversion, the NSX for Cloud Director migration tool has had some issues when doing cleanup. One of them is that it cant delete the old NSX-V backed network even though there is nothing left in VCD using it. The error message can be seen below.

2023-05-22 10:54:28,551 [connectionpool]:[_make_request]:452 [DEBUG] [tenant.01] | https://vcd.ramsgaard.me:443 "DELETE /cloudapi/1.0.0/orgVdcNetworks/urn:vcloud:network:ce108a33-fa5c-4cae-8c16-60edd536ad20 HTTP/1.1" 400 None
2023-05-22 10:54:28,556 [vcdOperations]:[deleteOrgVDCNetworks]:1090 [DEBUG] [tenant.01] | Failed to delete Organization VDC Network lan.[ 1ca6fd03-de82-4835-b12e-58c5c043b2bc ] Network lan cannot be deleted, because it is in use by the following vApp Networks: lan.
2023-05-22 10:54:28,556 [vcdNSXMigratorCleanup]:[run]:230 [ERROR] [tenant.01] | Failed to delete Org VDC networks ['lan'] - as it is in use
Traceback (most recent call last):
  File "src\vcdNSXMigratorCleanup.py", line 218, in run
  File "<string>", line 1, in <module>
  File "src\core\vcd\vcdValidations.py", line 53, in inner
  File "src\core\vcd\vcdOperations.py", line 1094, in deleteOrgVDCNetworks
Exception: Failed to delete Org VDC networks ['lan'] - as it is in use

I found someone else having this problem, where they discovered a forceful way to delete the network. I have used this but wrapped it in Powershell instead. In my case, it can get the network URN from the log of the migration tools. Else you can also easily see the URN from the GUI URL when in the context of the network.

### Variables
$vcdUrl = "https://vcd.ramsgaard.me"
$apiusername = "@system"
$password = ''
$networkUrn = "urn:vcloud:network:ce108a33-fa5c-4cae-8c16-60edd536ad20"

### Auth against API and enable TLS1.2 for PowerShell
$base64AuthInfo = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(("{0}:{1}" -f $apiusername,$password)))
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12
$auth =Invoke-WebRequest -Uri "$vcdUrl/api/sessions" -Headers @{Accept = "application/*;version=36.0";Authorization="Basic $base64AuthInfo"} -Method Post

### Get VirtualWire
$virtualWire = Invoke-RestMethod -Uri "$vcdUrl/cloudapi/1.0.0/orgVdcNetworks/$($networkUrn)" -Headers @{Accept = "application/json;version=36.0";Authorization="Bearer $($auth.Headers.'X-VMWARE-VCLOUD-ACCESS-TOKEN')"} -Method GET
$virtualWire

### Delete VirtualWire
$deleteStatus = Invoke-RestMethod -Uri "https://vcd.hostcenter.dk/cloudapi/1.0.0/orgVdcNetworks/$($networkUrn)?force=true" -Headers @{Accept = "application/json;version=36.0";Authorization="Bearer $($auth.Headers.'X-VMWARE-VCLOUD-ACCESS-TOKEN')"} -Method DELETE
$deleteStatus

Above PowerShell is used at your own risk 🙂

Basic VMware PhotonOS config

Still a bit new to PhotonOS. But it’s getting more and more that I use PhotonOS for deploying things with, like CSE for VCD.

I always struggle with the basics, even though I know a bit about Linux and Unix. So here is come quick tips on

  • First login
  • Network
  • SSH
  • Passwords
  • NTP

First login

When the VM is deployed you can now login to it with root and password “changeme”. It will then require you to change the password to something else.

Network

You can view network interfaces with networkctrl and IP a to show what is already configured.

If you want to set it to a static ip you have to create a new config file.

cat > /etc/systemd/network/10-static-eth0.network << "EOF"
 
[Match]
Name=eth0
 
[Network]
Address=172.16.4.225/24
Gateway=172.16.4.1
DNS=172.16.4.10
Domains=home.lab
EOF

After the file has been created you can set the correct permissions and restart the network and resolver. If you skip the chmod you will probably see a fail in network reboot due to the system not being able to read the new file

chmod 644 /etc/systemd/network/10-static-eth0.network
systemctl restart systemd-networkd
systemdctl restart systemd-resolved

SSH config

When you want to connect to PhotonOS with a password but also have either your pageant or native ssh console running you will try to authenticate with a public/private key. If your key is not added to the server you will get a “Too many failed authentications”. To get around this you can use a parameter.

ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no <hostname>

This will then make authentication with a password.

Forgot root password

If you forgot the root password and need to reset it you can do the following.

  1. reboot and press “e” to enter the grub boot loader. add the line rw init=/bin/bash shown on the screenshot. Press F10 to boot

2. You should now be in a shell on the system, from here you can run passwd

3. Unmount / and reboot the system

umount /
reboot -f

Disable password history

If you encounter the problem of trying to reset the password but the password you want to set says: Password has been already used. Choose another.

You can disable this by editing the /etc/pam.d/system-password.

By changing ‘remember ‘ from 5 to 0 we can disable the remember password count and reset the root password.

Paste into PhotonOS PuTTy session

When using vi there is a “bug” that prevents you from normal paste with right-click. To workaround this you need to write :set mouse= in vi command mode. After done you can not use paste with right-click.

Shift+Ins might be also used to paste on PuTTY.

NTP settings

Some apps running in PhotonOS are very time-sensitive, vcda is one of them. use watch -n 0,1 date on each of the appliances that need to communicate and verify that time is not skewed.

If you need to set up NTP post-deployment you can do as follows. Edit the timesyncd.conf file with a text editor such as vi:

vi /etc/systemd/timesyncd.conf

In the [Time] section edit the NTP entry with the correct NTP server address:

[Time]
#FallbackNTP=time1.google.com time2.google.com time3.google.com time4.google.com
NTP=ntpAddress

After having put in the NTP server you want to use restart the network and time sync service

systemctl restart systemd-networkd
systemctl restart systemd-timesyncd

Verify that the time on the appliance is now synchronizing with the NTP server.

Further troubleshooting is to see if the NTP service is running

systemctl status systemd-timesyncd

NSX 4.0.1 > 4.1.0 upgrade problems

Precheck gave warnings back for all the edge nodes. Where it stated the problem below.

Edge node 4006d386-a394-43a4-6b04b242f8b3 vmId is not found on NSX Manager. Please refer to https://kb.vmware.com/s/article/90072

The KB article states that NSX managers are missing the VM_ID for the edge nodes and gave an example of how to manually find the Edge VM moref and post it to the NSX API.

Using PowerShell to update the VM_ID

Instead of the manual procedure from the KB, I made a small script.

### Login details
$nsxUsername = "admin"
$nsxPassword = "Yi....kes12!"
$nsxmanager = "nsxt.home.lab"

### Connect to vcenter so that we can fetch moref
Connect-VIServer vcsa1.home.lab

### NSX Manager auth header
$Type = "application/json;charset=UTF-8"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword)) }
$nsxUri = "https://$($nsxmanager)"

### Edge Vm moref update
$edgenodes =  (Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes" -Headers $Header -Method GET -ContentType $Type).results | Where-Object {$_.node_deployment_info.deployment_type -eq "VIRTUAL_MACHINE"}

### Loop through the edge nodes
foreach($edgenode in $edgenodes){
write-host "Updating edge node - $($edgenode.display_name)"
$specEdge =  (Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)" -Headers $Header -Method GET -ContentType $Type).node_deployment_info.deployment_config

$vmid = ((get-vm $edgenode.display_name).Id).Split("-")[-1]
write-host "Found edge node moref in vcenter - vm-$vmid)"

write-host "Removing form factore and adding vm_id to object)"
$specEdge | Add-Member -NotePropertyName vm_id -NotePropertyValue "vm-$vmid"
$specEdge = $specEdge | Select-Object -Property * -ExcludeProperty form_factor

try {
    write-host "Updating against NSX API"
    Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)?action=addOrUpdatePlacementReferences" -Headers $Header -Method POST -ContentType "application/json" -Body $($specEdge|ConvertTo-Json -Depth 10)
}
catch {
   $streamReader = [System.IO.StreamReader]::new($_.Exception.Response.GetResponseStream())
   $ErrResp = $streamReader.ReadToEnd() | ConvertFrom-Json
   $streamReader.Close()
  }
if($ErrResp){
   write-host "$($ErrResp.error_message) - $($edgenode.display_name) not updated with success)" -ForegroundColor Red }
   else{
   write-host "$($edgenode.display_name) updated with success)"
   }
}

Unfortunately, even after updating vm_id the precheck still failed, with the same error. NSX API accepted with code 200 the post nothing happened behind the scenes

VMware support:

Opening a VMware SR, they got logs and the info that we already tried to update the vm_id. They asked us to do a couple of other things.

Refreshing the edge node config data

VMware SR stated us to try and do an API call for refreshing the edge node config data. VMware code NSX API info. With the info from the previous script, we can do a POST against NSX API.

Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)?action=refresh_node_configuration&resource_type=EdgeNode&read_only=true" -Headers $Header -Method POST

Reboot of edge nodes

Put an Edge node into NSX maintenance mode and afterward do a reboot of the node, initiated from vCenter with a “Restart Guest OS”. The reboot went fine and the Edge node was put into production again.

Reboot of NSX Managers

Rebooting the NSX Managers, one at a time, and of cause waiting for the rebooted node to come back online with no errors before continuing with the next one.

Result

Unfortunately, non of the above helped, precheck still gives the same error.

Further troubleshooting:

Looking at the API guide I stumbled over an edge node redeployment call. I have redeployed the edge nodes manual before, and I have to say, it’s a pain! It’s not hard, but it takes a lot of time. But this call will help to redeploy in a way that doesn’t affect the data plane.

  • Edge is being put into NSX maintenance mode
  • Edge is then deleted and a new one is deployed, with the same naming as the old one.
  • After the edge node is deployed and registered in the manager it exits maintenance mode and goes into production again

Executing the API call

First, we get the config for the specific edge. Afterward, we post to the redeploy API with the body that we got from the config get request.

The try-catch is helping you get a better error description, if something goes wrong.

$redeployBody = (Invoke-RestMethod -Uri $nsxUri/api/v1/transport-nodes/$($edgenode.id) -Headers $Header -Method GET) 

try {
    Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)?action=redeploy" -Headers $Header -Method POST -ContentType "application/json" -body $redeployBody
    }
catch {
    $streamReader = [System.IO.StreamReader]::new($_.Exception.Response.GetResponseStream())
    $ErrResp = $streamReader.ReadToEnd() | ConvertFrom-Json
    $streamReader.Close()
    $ErrResp
}

After redeployment, more problems….

After the redeployment of one edge node, the error with vm_id was no more for the redeployed node. Great! but now the upgrade coordinator gave a new precheck error… I can’t exactly remember, but it said something like “The Host Upgrade Unit Groups are not suitable for a T0.”

Google results pointed in the direction of VMware KB to reset the upgrade plan. But was not successful. (I then later found out that on the next page of the upgrade wizard, there is a “Reset plan” button)

Continued to redeploy all the remaining edge nodes, this helped clear all the errors from the upgrade precheck. The upgrade could then begin 😀

And another problem…

After the edge nodes that held the T0 gateway, one of the edge nodes was not negotiating up BGP to physical fabric. Tried a lot of things.

  • Redeployment of the edge node with the redeploy API action didn’t help.
  • Migrating the edge node to the same hosts as the working edge node was on, didn’t help.
  • Ping from working edge node to non-working on the VLAN uplink IPs worked.
  • Ping from non-working to physical fabric didn’t work.

Then tried to remove and add the interfaces of the T0 on the non-working edge node. Made everything work again. A quite random bug?

Conclusion

NSX upgrades compared to NSX-V upgrades seem to be quite troublesome or let’s say that there is room for improvements. The good thing is that when edge nodes are upgraded then all tenants are also. And the upgrade happens with zero downtime. The T0/T1 robustness is amazing, no drops, no IPSec go down.

VCD CPI/CCM – Load balancer

When a TKG cluster is deployed by Container Service Extension(CSE) it means that it lives within VMware Cloud Director(VCD).

Inside this TKG cluster, you will find a Cloud Controller Manager(CCM) pod under kube-system called “vmware-cloud-director-ccm”. CCM pod is part of Cloud Provider Interface(CPI) that gives you some capabilities on how for example to add a Persistent Volume(PV) or do load balancing with NSX Advanced Loadbalancer(ALB). Basically, the CCM will contact the VCD CPI API and from there orchestrate what you requested in your Kubernetes YAML files.

At the time of writing, L4 load balancing(LB) features through ALB are the only option available. This is because the CPI is not yet completely featured to create L7 LB with ALB. It’s on the roadmap though.

One-arm vs. two-arm…

I found that there are two conspects that are with knowing of, one-arm and two-arm load balanceres.

The two LB methods are described here. Since it’s an old article, AVI/ALB was not in the VMware portfolio back then. And also NSX-T has migrated away from the LB service where it lived as haproxy within tier1 and over to AVI/ALB service engines.

The default setting of load balancing with VCD CPI is two arms, meaning that it will tell VCD to create a DNAT rule towards a 192.168.8.x internal subnet used to create ALB VIPs.

WAN > T1 DNAT(185.139.232.x:80)> 192.168.8.x(LB internal subnet) > ALB SE > LB Pool members

Since L7 LB features in VCD are not yet available AND also it will become very costly. Most customers will probably choose to have an Nginx or Apache ingress controller inside their own Kubernetes cluster.

Since VCD 10.4 the two-arm config has been working and therefore it’s more desirable since you can use multiple ports on a single public IP. Where one-arm config would allocate one public IP pr Kubernetes service(correct me if I’m wrong).

If you are running your own ingress controller then some find the one-arm approach more desirable since ALB will then hold the public IP address.

WAN > T1 Static Route to ALB (185.139.232.x) > ALB SE > LB Pool members

How to change to one arm LB?

Hugo Phan has done a good write-up on this blog.

Basically, it’s downloading the existing config and changing the config map of VCD CPI removing the part where it 192.168.8.x subnet is defined. After this, you delete the existing CPI and then add it from the yaml file that you edited.

Snip from Hugo Phan blog – vmwire.com

How can I use/test this with my VCD TKG cluster?

If you don’t have a demo app that you prefer, then I can recommend either yelp or retrogames. Here I will do it with yelp. William Lam has done a good write-up and also hosts deployment files for yelp.

Step 1 – Deploy the application

kubectl create ns yelb
kubectl apply -f https://raw.githubusercontent.com/lamw/vmware-k8s-app-demo/master/yelb-lb.yaml

Step 2 – Check that all pods are running

jeram@QL4QJP2F4N ~ % kubectl -n yelb get pods
NAME                             READY   STATUS    RESTARTS   AGE
redis-server-74556bbcb7-f8c8f    1/1     Running   0          6s
yelb-appserver-d584bb889-6f2gr   1/1     Running   0          6s
yelb-db-694586cd78-27hl5         1/1     Running   0          6s
yelb-ui-8f54fd88c-cdvqq          1/1     Running   0          6s
jeram@QL4QJP2F4N ~ % 

The deployment file is asking k8s for a service of a load balancer, CCM picks this up and asks VCD CPI to have ALB creating the L4 load balancing.

Task view from VCD

Step 3 – Get the IP and go check out the yelb site

jeram@QL4QJP2F4N ~ % kubectl -n yelb get svc/yelb-ui
NAME      TYPE           CLUSTER-IP     EXTERNAL-IP       PORT(S)        AGE
yelb-ui   LoadBalancer   100.64.53.23   185.177.x.x   80:32047/TCP   5m52s

Step 4 – Scale the UI

Let’s see how many instances have from the initial deployment.

jeram@QL4QJP2F4N ~ % kubectl get rs --namespace yelb
NAME                       DESIRED   CURRENT   READY   AGE
redis-server-74556bbcb7    1         1         1       8m11s
yelb-appserver-d584bb889   1         1         1       8m11s
yelb-db-694586cd78         1         1         1       8m11s
yelb-ui-8f54fd88c          1         1         1       8m11s
jeram@QL4QJP2F4N ~ % 

We can then scale the UI to 3 and check again to see if that happens.

jeram@QL4QJP2F4N ~ % kubectl scale deployment yelb-ui --replicas=3 --namespace yelb
deployment.apps/yelb-ui scaled

jeram@QL4QJP2F4N ~ % kubectl get rs --namespace yelb
NAME                       DESIRED   CURRENT   READY   AGE
redis-server-74556bbcb7    1         1         1       9m45s
yelb-appserver-d584bb889   1         1         1       9m45s
yelb-db-694586cd78         1         1         1       9m45s
yelb-ui-8f54fd88c          3         3         3       9m45s
jeram@QL4QJP2F4N ~ % 

UI is now scaled to replicates of 3. Seen from the Load Balancer view I VCD it will only show the worker nodes. Since k8s is doing its own loadbalancing arose the pod instances.

Step 5 – Cleanup

jeram@QL4QJP2F4N ~ % kubectl -n yelb delete pod,svc --all && kubectl delete namespace yelb
pod "redis-server-74556bbcb7-f8c8f" deleted
pod "yelb-appserver-d584bb889-6f2gr" deleted
pod "yelb-db-694586cd78-27hl5" deleted
pod "yelb-ui-8f54fd88c-6llf7" deleted
pod "yelb-ui-8f54fd88c-9r6wf" deleted
pod "yelb-ui-8f54fd88c-cdvqq" deleted
service "redis-server" deleted
service "yelb-appserver" deleted
service "yelb-db" deleted
service "yelb-ui" deleted
namespace "yelb" deleted
jeram@QL4QJP2F4N ~ % 

Again, CCM will instruct VCD CPI to clean up. NICE!

Conclusion

We now have a good idea of how load balancing works with VCD TKG deployd K8s clusters. Off cause we are looking forward to the L7 features. But it’s a good start and VMware is working hard to help in making k8s deployment and day2 operations easier.

NSX-T – Topology view troubleshooting

Ever seen the beneath picture when trying to see the cool topology view in NSX-T?

If so, there is an API call that can help you resync the topology view so you again can see it.

# Login details
$nsxUsername = "admin",
$nsxPassword = "",
$nsxmanager = "nsxt.home.lab"

# NSX Manager auth header
$Type = "application/json;charset=UTF-8"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword)) }
$nsxUri = "https://$($nsxmanager)"

# Request topology resync
Invoke-RestMethod -Uri "$nsxUri/policy/api/v1/ui/network-topology/resync" -Headers $Header -Method POST -ContentType $Type

You don’t get any feedback from the API call, but after a few minutes, you will again be able to see the topology again.

NSX-V Edge – Disable ALG

In NSX-V 6.4.11 and 6.4.12 there was introduced a bug to the Application Layer Gateway where it would cause package drop.

From the release notes of NSX-V 6.4.13, it states:

Fixed Issue 2886595: ALG-based services are not working in NSX Data Center for vSphere 6.4.11 and 6.4.12.:

Edge Firewall drops packets for ALG-based services FTP/SFTP/SNMP when using NSX Data Center for vSphere 6.4.11 and 6.4.12.

A temporary workaround is to disable ALG on the affected edge until the NSX-V installation can be upgraded to 6.4.13 or 6.4.14.

Procedure

  • Connect to the NSX Manager as admin and enter enable mode by typing: enable
    Enter engineering mode by typing: st en
  • Enter the NSX Manager root password: IAmOnThePhoneWithTechSupport
    Get the password for the Edge by typing:
    /home/secureall/secureall/sem/WEB-INF/classes/GetSpockEdgePassword.sh edge-ID
[root@nsxvmanager ~]# /home/secureall/secureall/sem/WEB-INF/classes/GetSpockEdgePassword.sh edge-1269
Edge root password:
        edge-1269       -> u8ORKdfFIM$hZ
[root@nsxvmanager ~]#
  • Access the Edge VM console, log in as the admin user and enter enable mode by typing: enable
  • Enable engineering mode by typing: debug engineeringmode enable
  • Enter the root shell on the Edge by typing the password obtained from the NSX manager: st en
  • Run commands as followings to make the workaround reboot permanent and to disable the ALG in the kernel without a reboot.
    echo “net.netfilter.nf_conntrack_helper = 1” >> /etc/sysctl.conf
    sysctl net.netfilter.nf_conntrack_helper=1

Conclusion

Since the procedure above is done directly on the edge it will not survive an edge redeploy. This is because a redeploy will take its configuration from the NSX manager and not look at what is done directly on the edge itself.

Off cause, the correct solution would be to have the NSX manager upgraded to the latest version and afterward upgrade the edge version.

NSX-T Edge node password management

The default NSX-T password expiration time is set to 90 days. But in a lab environment, this is not required. So here is a bit on how to disable the timer but also how to recover from an expired or forgotten password.

Reset exipred password

If the password for either of the users, audit, root, or admin, has expired you will see it when you try to log in with SSH. It will then prompt you to enter the current password followed by the new two times. Since this is only for home lab, and would like the previous password, I set a new and quick-to-remember password. Fimmer_old_password1. The SSH session then disconnects and you start a new connection with the new password.

nsx-edge> set user admin password My-New_VMware1!_Password old-password Fimmer_old_password1

After the reset and re-reset you now have 90 days of password again. or you could disable the password expiration…

If you find yourself in a situation of a forgotten admin password.

You will most likely be able to log in with the root account. Even if expired using the console of the Edge VM will always work. From there you can use the normal Linux password reset command to reset the admin account password.

passwd admin

And if you have tried the wrong password too much you can unlock the account with pam tally.

pam_tally2 --user admin --reset

Another note when you are logged in with root, users can still use nsxcli, just wrap your nsxcli commands with su admin -c ”

su admin '-c clear user audit password-expiration'

If your find yourself completely locked out of NSX-T

VMware has some good documentation on this. Basically it is

  • Connect to the console of the appliance and reboot the system. When the GRUB boot menu appears, press the left SHIFT or ESC key quickly. Press e to edit the menu. Press e to edit the selected option.
  • Search for the line starting with linux and add systemd.wants=PasswordRecovery.service to the end of the line. Press Ctrl-X to boot.

Set password to never expire

SSH to the edge node with the admin account. Using the nsxcli we can adjust the expiration to a maximum of 90 days. The commands below will set the password expiration to 9999 days and clear the expiration if already happened. VMware has it in their documentation here

nsx-edge> set user admin password-expiration 9999
nsx-edge> set user root password-expiration 9999
nsx-edge> set user audit password-expiration 9999
nsx-edge> clear user admin password-expiration
nsx-edge> clear user root password-expiration
nsx-edge> clear user audit password-expiration

SSH-keys

Something that is always better than passwords is SSH Keys. You can add multiple ssh-keys to the same users in NSX-T. The cool thing is that you have a label for the key so multiple users can have access with their own SSH key, this way you avoid some of the hassles of having to use passwords in with your SSH connections

nsx-edge> set user admin ssh-keys label jr type ssh-rsa value AAAAB3NzaC1yc2EAAAADAQABAAACAQC/VPq30qzyJHr8v6qh1vF1CVY8R9U09iCkqnIs9H6d9hBOeDu/e52rPj2BOQUfHwBmGRPVqZUyuOO20hDgT/BzP0QxISv9l2OpFariz8AmHu9m4kUwAdrBDvplw8fFeafppUwQF/aFsIF+t1PtFluz0Bp3N/sp3NQGWfkez7myctGc9X3eMc6oUAYrPPJeDZz1x5JoGdwdH/w6wjr3uK03kRx6TX1kNqxSypIQQ/8lYg1TG7yAuF5DhX4fJrPjpiLau1H6z0vChVpqY1q8oMntzHHtYtByFMrNtWFfAvG94BT27h/Lkmz5JM5d41TbL0YdZT8zCTrXzUG87wdEaRiB5ZeKy9LENgfxKO66scSU2gjiXwpyJTrHKZYz9g5EERH/41w+qMT90HAM3ArSIvk7pROoKhZy0IeOwfWbmMlQvKQFjS7OtKnFEeVUYRqnLvi3XeUiFbLxmW3ID8IqQy3iDNuESiVNRcp/PoN7lxL9cfGJdXBuJ3PBcaQZx/vQpePRqW9eBSmhhS1beIUlLV0UOFdRGTMMMjOlp7m7jaw5EnvztbInfPOdMPoUuSL9iGut7M0SVMgEzo0MiJDHNdLQYK0EKO8qrWMz76UHhpdnhOQNdi3/wtVVzxVUR/D9zBa1q2oL8ml7jKVubVbBd6Vm0lEEquDEN3I9Dan/Ev0j9w==

Conclusion

Having an expired password will cause you all sorts of trouble.

If you don’t have a PAM solution that can help you to automatically change the password, then setting the expiration to 9999 days will for sure help your manageability.

Putting your SSH key onto the nodes and managers will help you in the long run, and is in my opinion also a more secure solution than having passwords.

ESXi network routes

I honestly don’t know why this is still a problem. Support for routed vmotion traffic was added back at vSphere6. Here we are vSphere7 and still have to set your gateway/routes for the vmotion stack through esxcli.

Either way, here is how it works

vMotion stack

Each tcp/ip stack can only have one gateway, that makes sense. And if you want to keep your management and vMotion traffic separated you need two tcp/ip stacks.

It’s nicely done through vSphere vCenter GUI and there is a KB for it. And you even have the option to override the default gateway and specify the right one for your vMotion stack.

[root@dc1esxcompx-xx:~] esxcli network ip route ipv4 list -N vmotion
Network     Netmask        Gateway  Interface  Source
----------  -------------  -------  ---------  ------
10.1.115.0  255.255.255.0  0.0.0.0  vmk1       MANUAL

But when looking at the routing table from esxcli it is not set. If you know, feel free to give me a kick and enlighten me.

ESXCLI add a static route

So for me to set an actually default route I have to do it as shown below

[root@dc1esxcompx-xx:~] esxcli network ip route ipv4 add -g 10.1.115.1 -n 0.0.0.0/0 -N vmotion

[root@dc1esxcompx-xx:~] esxcli network ip route ipv4 list -N vmotion
Network     Netmask        Gateway     Interface  Source
----------  -------------  ----------  ---------  ------
default     0.0.0.0        10.1.115.1  vmk1       MANUAL
10.1.115.0  255.255.255.0  0.0.0.0     vmk1       MANUAL

PowerCLI

Need to do it on a cluster with multiple hosts? No problem LucD from VMware community got you covered. I only did a little customization and it works for my needs.

connect-viserver -Server 

$stackName = 'vmotion'
$ipGateway = '10.1.115.1'
$ipDevice = 'vmk3'
$cluster = "computexx"
$vmhosts = get-cluster $cluster | get-vmhost

foreach($vmhost in $vmhosts)
{
$esx = Get-VMHost -Name $vmhost
$netSys = Get-View -Id $esx.ExtensionData.ConfigManager.NetworkSystem
$stack = $esx.ExtensionData.Config.Network.NetStackInstance | where{$_.Key -eq 'vmotion'}
$config = New-Object VMware.Vim.HostNetworkConfig
$spec = New-Object VMware.Vim.HostNetworkConfigNetStackSpec
$spec.Operation = [VMware.Vim.ConfigSpecOperation]::edit
$spec.NetStackInstance = $stack
$spec.NetStackInstance.ipRouteConfig.defaultGateway = $ipGateway
$spec.NetStackInstance.ipRouteConfig.gatewayDevice = $ipDevice
$config.NetStackSpec += $spec
$netsys.UpdateNetworkConfig($config,[VMware.Vim.HostConfigChangeMode]::modify)
}

Conclusion

Manipulating the vmotion stack route table with either esxcli or PowerCLI is working great.

Need to know more? there are plenty of good bloggers and KBs out here.

ssacli – CLI configuring SmartArray

When you install an HPE server with the VMware custom image for HPE servers you automatically get all the HPE tools for configuring the hardware. Neat.

Here is a small guide on how to clean and setup the array. The ssacli is located in /opt/smartstorageadmin/ssacli/bin on the ESXi serve – start here and follow the next commands

Show the existing config:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 ld all show
Smart Array P440ar in Slot 0 (Embedded)
   Array A
      logicaldrive 1 (279.37 GB, RAID 1, Failed)
   Array B
      logicaldrive 2 (3.27 TB, RAID 1+0, OK)

Delete the existing logical drives:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 ld 1 delete
Warning: Deleting an array can cause other array letters to become renamed.
         E.g. Deleting array A from arrays A,B,C will result in two remaining
         arrays A,B ... not B,C 
Warning: Deleting the specified device(s) will result in data being lost.
         Continue? (y/n)  y
[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 ld 2 delete
Warning: Deleting the specified device(s) will result in data being lost.
         Continue? (y/n)  y
[root@esx2:/opt/smartstorageadmin/ssacli/bin] 

Show all physical drives in the server:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 pd all show
Smart Array P440ar in Slot 0 (Embedded)
   Array A
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:14 (port 1I:box 1:bay 14, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:15 (port 1I:box 1:bay 15, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:16 (port 1I:box 1:bay 16, SAS HDD, 900 GB, OK)

Create a new volume with available physical drives:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 create type=ld drives=1I:1:9,1I:1:10,1I:1:11,1I:1:12,1I:1:13,1I:1:14,1I:1:15,1I:1:16 raid=6

Warning: Controller cache is disabled. Enabling logical drive cache will not take effect until this has been resolved.
[root@esx2:/opt/smartstorageadmin/ssacli/bin] 

Conclusion

Nice and easy – Give the HPE ssacli manual a read for more commands, starting page 57 and forward or use ssacli help.

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli help

CLI Syntax
   A typical SSACLI command line consists of three parts: a target device, 
   a command, and a parameter with values if necessary. Using angle brackets to
   denote a required variable and plain brackets to denote an optional 
   variable, the structure of a typical SSACLI command line is as follows:

      <target> <command> [parameter=value]

   <target> is of format:
      [controller all|slot=#|serialnumber=#]
      [array all|<id>]
...........

VMware CSE – Stuck cluster deployment

After upgrading to CSE 3.1.3 with VCD 10.3.1 I encountered a problem when creating clusters from the Ubuntu 20.04 native cluster template.

Basically, the mstr node would be deployed and started, VMTools will become ready and the first script injection would happen. Then all of a sudden the VM would reboot and the cluster creation will fail because it can’t see the process anymore. This will sometimes leave a cluster in the “Creation in progress” status but somehow it can not be managed anymore.

22-06-02 10:42:34 | cluster_service_2_x:2811 - _wait_for_tools_ready_callback | DEBUG :: waiting for guest tools, status: vm='vim.VirtualMachine:vm-835608', status=guestToolsNotRunning
22-06-02 10:42:39 | cluster_service_2_x:2811 - _wait_for_tools_ready_callback | DEBUG :: waiting for guest tools, status: vm='vim.VirtualMachine:vm-835608', status=guestToolsRunning
22-06-02 10:42:41 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: waiting for process 1706 on vm 'vim.VirtualMachine:vm-835608' to finish (1)
22-06-02 10:42:46 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: process [0, <Response [200]>, <Response [200]>] on vm 'vim.VirtualMachine:vm-835608' finished, exit code: 0
22-06-02 10:42:46 | cluster_service_2_x:2869 - _execute_script_in_nodes | DEBUG :: about to execute script on mstr-7e34 (vm='vim.VirtualMachine:vm-835608'), wait=True
22-06-02 10:42:48 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: waiting for process 1729 on vm 'vim.VirtualMachine:vm-835608' to finish (1)
22-06-02 10:42:58 | cluster_service_2_x:2896 - _execute_script_in_nodes | ERROR :: Error executing script in node mstr-7e34: process not found (pid=1729) (vm='vim.VirtualMachine:vm-835608')
Traceback (most recent call last):
  File "/opt/vmware/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x.py", line 2879, in _execute_script_in_nodes
    callback=_wait_for_guest_execution_callback)

I created an SR request with Cloud Director GSS for both the failed deployment and for the stuck clusters that now couldn’t be deleted. Multiple screen sharing sessions later and no result.

Then I found the GitHub for Container Service Extension, the issue page had a very tempting title Failed deployments using TKGm on VCD. Many seem to have the same problem, no fix on the deployments but it seems that one guy had the fix for deletion of the stuck clusters.

The workaround

You need to find the ID of the user that owns the cluster. You can in the More>Kubernetes Clusters menu in VCD see who the owner is.

When you have the owner you can go into Administration > User > <User>. Then then the URL with contain the ID of the user.

vcd.ramsgaard.me/tenant/tenant1/administration/access-control/users/v9993018-ebf5-4ded-8134-27ddcc4ccbf0/general

With the userId you can fill out the body for the next API call.

$vdchost = "vcd.ramsgaard.me"
$apiusername = "svc-cse@system"
$password = 'Ye.........iks12!'

$base64AuthInfo = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(("{0}:{1}" -f $apiusername,$password)))
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12
$auth =Invoke-WebRequest -Uri "https://$vdchost/api/sessions" -Headers @{Accept = "application/*;version=32.0";Authorization="Basic $base64AuthInfo"} -Method Post

$accessBody = '{
    "grantType": "MembershipAccessControlGrant",
    "accessLevelId": "urn:vcloud:accessLevel:FullControl",
    "memberId": "urn:vcloud:user:e96cf9e8-535f-45d8-8a87-b9dac659f85f"
  }' | ConvertFrom-Json

$status = Invoke-RestMethod -Uri "https://$vdchost/cloudapi/1.0.0/entities/urn:vcloud:type:cse:nativeCluster:2.1.0/accessControls" -Headers @{Accept = "application/json;version=36.1";Authorization="Bearer $($auth.Headers.'X-VMWARE-VCLOUD-ACCESS-TOKEN')"} -ContentType "application/json" -Method post -Body ($accessBody | ConvertTo-Json)

When the API call is done you should now be able to delete the stuck cluster.

If you should be so unfortunate that the cluster is stuck in a “not resolved” state and the deletion through VCD GUI still fails you need to use the vcd cse cli.

### Login to VCD system or tenant organistaion
vcd login vcd.ramsgaard.me system jr
### Show clusters
vcd cse cluster list
### Force delete the cluster
vcd cse cluster delete tanzu1 --force

Conclusion:

The problem occurred in the first place due to a bug in VCD 10.3.1, the MQTT bus had some bug and therefore the cluster creation failed. 10.3.2 or 10.3.3 fixed the bug. (Off cause the VMware Tanzy Grid version should be used in the future)

It took some time to find the workaround, I hope the future of CSE will be more fault tolerant so these situations would not appear.

Until then there is a way to get out of the stuck cluster situation.

Disk mapping Windows <-> VMware – Part 2

A couple of years ago I did a post on how to map your windows disk with the real disk in VMware. The post will be an extension of it but with updated commands.

Why do I need to know the mapping? It happens when you stumble upon a VM disk with many disks attached. If the many disks vary in size you normally can look at those numbers and match them with the disks in VMware, but when all disks have the same size that approach become difficult.

Windows serial number:

In windows, we can retrieve the serial number on the disk we need to expand and then map the serial number to the VMware disk. In newer Windows Server versions it’s fairly easy to find but when dealing with older than 2012 you are missing the PowerShell cmdlets like get-disk. Someone on StackOverflow got a way that works on Windows Server 2008 > 2022.

$DriveLetter = "C:"
Get-CimInstance -ClassName Win32_DiskDrive |
Get-CimAssociatedInstance -Association Win32_DiskDriveToDiskPartition |
Get-CimAssociatedInstance -Association Win32_LogicalDiskToPartition |
Where-Object DeviceId -eq $DriveLetter |
Get-CimAssociatedInstance -Association Win32_LogicalDiskToPartition |
Get-CimAssociatedInstance -Association Win32_DiskDriveToDiskPartition |
Select-Object -Property SerialNumber

VMware disk:

From VMware’s side, it’s straightforward to find the disk and its serial number. Below is an scripted way of finding the disk and then adding the extra capacity.

Connect-VIServer ""

$VMname = ""
$disksn = "6000c295ec128b3d14472bdbf8e65aee"
$vmDisk = (Get-VM $VMname | Get-HardDisk) | Where-Object {$_.ExtensionData.Backing.uuid.Replace("-","") -eq $disksn } 

$ExpandSizeGb = 50
$vmDisk | Set-HardDisk -CapacityGB ($vmDisk.CapacityGB + $ExpandSizeGb) -Confirm:$false 

Conclusion:

Instead of having to guess what disk in windows is mapping to the VMware disk you here have a more automated way. The disk serial number retrieve commands are compatible with up to Windows Server 2022.

Brocade – FiberChannel zoning with CLI

When having todo zoning in Brocade world you still have to use an old java GUI client. Not always easy to open the GUI since not all Brocade switches have been updated and therefore all kinds of java security will prevent you from doing your job. With CLI we can skip the GUI and get to business instant.

CLI config does also have the great advantage to be sort of self-documentation 🙂 Here we will look a bit into doing what I would call basic zoning. A new device has been added to your fiber channel fabric and you need to create a zone for the new WWN and the storage system.

Adding ssh key

This is, in my opinion, a good thing to set up, makes it way easier for you to log in the next time. Off-cause is not optional, but a great advantage.

The public ssh key is gathered from a system that allows SSH-based logins. See below for info on how to set it up.

SAN-SW-03:admin> sshutil importpubkey
Enter user name for whom key is imported:admin
Enter IP address:10.1.100.20
Enter remote directory:/home/username
Enter public key name(must have .pub suffix):username.pub
Enter login name:username
username@10.1.100.20's password: 
public key is imported successfully.

Create alias

Below you first see the syntax, and afterward the actual command used in production.

# Syntax
alicreate “ALIAS_NAME”, “WWPN”

# Production command
alicreate "dc2esxmgmt1_11", "50:01:43:80:31:80:50:98"

Create zone

# Syntax
zonecreate "ZONE_NAME", "WWPN_alias_1;WWPN_alias_2"

# Production command
zonecreate "dc2storv5030_dc2esxmgmt1_11", "dc2esxmgmt1_11;dc2storv5030_01_01_p1v;dc2storv5030_01_01_p2v;dc2storv5030_01_02_p1v;dc2storv5030_01_02_p2v"

Add zone to config

cfgadd “fabric1”,”dc2storv5030_dc2esxmgmt1_11"

SAN-SW-03:admin> cfgsave
WARNING!!!
The changes you are attempting to save will render the
Effective configuration and the Defined configuration
inconsistent. The inconsistency will result in different
Effective Zoning configurations for switches in the fabric if
a zone merge or HA failover happens. To avoid inconsistency
it is recommended to commit the configurations using the
'cfgenable' command.

Do you want to proceed with saving the Defined
zoning configuration only?  (yes, y, no, n): [no] yes
Updating flash ...

Enable new config

SAN-SW-03:admin> cfgenable fabric1 
You are about to enable a new zoning configuration.
This action will replace the old zoning configuration with the
current configuration selected. If the update includes changes 
to one or more traffic isolation zones, the update may result in  
localized disruption to traffic on ports associated with
the traffic isolation zone changes.
Do you want to enable 'fabric1' configuration  (yes, y, no, n): [no] yes
zone config "fabric1" is in effect
Updating flash ...

Conclusion

It’s easy to do basic fiber channel zoning. The only thing that I miss from the GUI, and haven’t found yet, is a view to see all discovered WWN and then choose the WWN to the new alias you create. The server management system, in this case, shows the WWN name of the FC adapter and it’s easy to copy-paste into CLI commands.

So if you don’t have HPE OneView or some other fancy FC provisioning tool then you know how a basic and lowkey way of doing it.

And again, did I mention that when using CLI it basically document itself? 😉

Juniper upgrade process

Junos is in my opinion an awesome OS for your network. I enjoy the CLI, where commands are alike across all of Juniper’s products. Also, the many features and the fact that it’s not cisco.

BUT it also has its drawbacks. Honestly, I have seen some weird bugs. And keeping track of all the PRs from Juniper is a full-time job. And last but not least, the software upgrades are kind of a pain. especially on Junos devices older than 18.x.

EX3400 – format/install

For this case, I had a new EX3400, but with older firmware, 15.1X53-D58.3. I needed to upgrade to the latest SR in the newest train but from the CLI of the device only jumping 3 firmware versions are supported.

15.1> 18.1 > 18.4 > 19.3 > 20.2 > 21.1

But you can also do a format/install where you interrupt the boot process and then load a new firmware image on the device from a TFTP server. This is all done outside of Junos. This way you can jump to whatever version you want.

Jumping many version might make your config invalid, so beaware.

Juniper has a LOT of kb articles for this process and they all vary. So here is the process in my own writing

Process of format install

First, we need to get the right image from the juniper support side. It needs to the install image and the extension is .tgz

  • Download the image into your TFTP server.

In my case, the TFTP is a Linux box. If you prefer windows then TFTPd3264 is the way to go. Or MacOS then look here.

root@tftp:/srv/tftp# wget -O junos-install-media-net-ex-arm-32-21.4R1.12.tgz  'https://cdn.juniper.net/software/junos/21.4R1.12/junos-install-media-net-ex-arm-32-21.4R1.12.tgz?SM_USER=jv......5ce43fbdad2'
Resolving cdn.juniper.net (cdn.juniper.net)... 23.78.40.231
Connecting to cdn.juniper.net (cdn.juniper.net)|23.78.40.231|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 393745989 (376M) [application/octet-stream]
Saving to: ‘junos-install-media-net-ex-arm-32-21.4R1.12.tgz’

junos-install-media-net-ex-arm-32-21.4R1.12.tgz      100%[==================================================================>] 375.50M  3.48MB/s    in 2m 4s

2022-01-26 20:47:46 (3.03 MB/s) - ‘junos-install-media-net-ex-arm-32-21.4R1.12.tgz’ saved [393745989/393745989]

root@tftp:/srv/tftp# ls
junos-install-media-net-ex-arm-32-21.4R1.12.tgz
  • Now let’s reboot the switch and interrupt the “first” boot loader. just keep hitting ctrl+c after you powered rebooted when you see the “=>” you are in the right place. Here we set the IP address on the me0 interface and boot into the next boot loader.
Board: EX3400-24T
Base MAC: C00380FAAD2E
arm_clk=1000MHz, axi_clk=500MHz, apb_clk=125MHz, arm_periph_clk=500MHz
Net:   Registering eth
Broadcom BCM IPROC Ethernet driver 0.1
Using GMAC0 (0x18022000)
et0: ethHw_chipAttach: Chip ID: 0xdc14; phyaddr: 0x1
et0: gmac_serdes_init read sdctl(0xf4141c)
et0: gmac_serdes_init() serdes_status0: 0xf100ff00; serdes_status1: 0xf00
et0: gmac_serdes_init() PLL ready brought up exit
serdes_reset_core pbyaddr(0x1) id2(0xf)
bcmiproc_eth-0
Last Reset Reason: 0
Hit ^C to stop autoboot:  0
=>setenv ipaddr 10.1.100.253
=>setenv gatewayip 10.1.100.1
=>setenv netmask 255.255.255.0
=>setenv serverip 10.1.101.130
=>save
=>boot
Saving Environment to SPI Flash...
SF: Detected MX25L6405D with page size 256 Bytes, erase size 64 KiB, total 8 MiB, mapped at 0001faa0
Erasing SPI flash...Writing to SPI flash...done
Erasing SPI flash...Writing to SPI flash...done
SF: Detected MX25L6405D with page size 256 Bytes, erase size 64 KiB, total 8 MiB
device 0 offset 0x3c0000, size 0x10000
SF: 65536 bytes @ 0x3c0000 Read: OK
  • Wait for a few seconds for the next bootloader to appear and press ctrl+c again. Now you will see a menu, this menu you choose 5 and 5 and you should see “loader>”
Hit ^C to stop autoboot:  0 
Options Menu

1.  Recover [J]unos volume
2.  Recovery mode - [C]LI

3.  Check [F]ile system
4.  Enable [V]erbose boot
5.  [B]oot prompt
6.  [M]ain menu
Choice: 
Type 'menu' to go back to the menu
Type 'boot-junos' to boot into Junos
Type 'reboot' to reboot

5 5
  • We now set use the install format with the TFTP location of the image we downloaded in the first step.
Type '?' for a list of commands, 'help' for more detailed help.
loader> install --format tftp://10.1.101.130/junos-install-media-net-ex-arm-32-21.4R1.12.tgz
/kernel text=0x105b888 data=0x640fc+0x1fbf04 syms=[0x4+0x914a0+0x4+0x9b821]
/ex3400.dtb size=0x1f76
/crypto.ko text=0x419e0 data=0xe58+0x2a0 syms=[0x4+0x4740+0x4+0x2ba5]
/iflib.ko text=0x11f10 data=0x910+0x58 syms=[0x4+0x2b10+0x4+0x2194]
/miibus.ko text=0x19f38 data=0x10c4+0x78 syms=[0x4+0x51f0+0x4+0x3491]
/if_gmac.ko text=0xbc3c data=0x688+0xc syms=[0x4+0x1cc0+0x4+0x15ad]
/contents.iso size=0x279b000
Using DTB from loaded file '/ex3400.dtb'.
Kernel entry at 0xc1000180...
Kernel args: (null)
---<<BOOT>>---
GDB: no debug ports present
K cache
Release APs
WARNING: WITNESS option enabled, expect reduced performance.
mwill now attempt to reach the remote host.
<====== LOADS OF OUTPUT TO CONSOLE ======>
<====== LOADS OF OUTPUT TO CONSOLE ======>
Downloading /junos-install-media-net-ex-arm-32-21.4R1.12.tgz from 10.1.101.130 ...
rmed on 1024 samples passed.t-up health tests perfo
  300.6MB  03:52random: unblocking device.
  393.7MB  05:04
Installing Junos OS release ...

After 15-20 minutes the switch will have the install finished and ready for you to log into and start loading your config.

FreeBSD/arm (Amnesiac) (ttyu0)
login: 

Conclusion

This is a very helpful process and might come in handy when having new switches with old firmware that need to be applied. Skipping the smaller version jumps is a time saver.

This format install process can also be done with a USB key. This process is also quite simple but requires you to have physical access to the switch.

In my case, I have a console over ssh and can manage the switch out-of-band so TFTP is the easy way.

Veeam – retrive saved passwords from VBR

Ever needed to retrieve a saved Veeam password? I did – Found the process for it on the Veeam forum.

  • Open SQL Studio as administrator and connect to the Veeam DB instance
  • Run query from below on the VeeamBackup database
SELECT TOP (1000) [id]
,[user_name]
,[password]
,[usn]
,[description]
,[visible]
,[change_time_utc]
FROM [VeeamBackup].[dbo].[Credentials]
Query the Veeam DB for all stored credentials to backup infrastructure components

Get the password hash from the results (match the description to the one you need). Then run PowerShell below with the hash you grabbed.

Add-Type -Path "C:\Program Files\Veeam\Backup and Replication\Backup\Veeam.Backup.Common.dll"
$encoded = 'AQAAANCM....RhQ'
[Veeam.Backup.Common.ProtectedStorage]::GetLocalString($encoded)
Password revealed and ready to use

Conclusion:

Is this a security problem? Depends, but it will give you a reminder of how important it is to keep your Veeam VBR server safe. Never domain join and have the firewall closed as much as possible. If a malicious person comes by your Veeam server they can grab the keys for the rest of your infrastructure, including your backup of cause. In most cases that would mean game over.

Faster and more scripted way:

$instance = (Get-ItemProperty -Path "HKLM:\SOFTWARE\Veeam\Veeam Backup and Replication" -name SqlInstanceName).SqlInstanceName
$server = (Get-ItemProperty -Path "HKLM:\SOFTWARE\Veeam\Veeam Backup and Replication" -name SqlServerName).SqlServerName
$result = Invoke-Sqlcmd -Query "SELECT TOP (1000) [user_name],[password],[description] FROM [VeeamBackup].[dbo].[Credentials]" -ServerInstance "$server\$instance"
Add-Type -Path "C:\Program Files\Veeam\Backup and Replication\Backup\Veeam.Backup.Common.dll"
$result | ForEach-Object { [Veeam.Backup.Common.ProtectedStorage]::GetLocalString($($_.password))}

Cloud Director 10.3 – Update certificates

Since my last article on how to update Cloud Director SSL certificates, there has been a major change. No more binary java truststore – jaaay.

Cloud Director has changed over too, what I think, is a better and more normal way of storing the private and public keys, which is in PEM format. From release notes, the change actually happened in 10.2, but the certificate path changed again in 10.3. If you are in doubt of where the certificate path is then look inside global.properties

/opt/vmware/vcloud-director/etc/global.properties

VMware’s own documentation state that we can now just swap the .pem files, use the cell-management tool to import and restart the cell.

What we will do and what is needed

  • Get a new public signed certificate
    • Either in PEM format as .key and .pem(certificate including intermediate)
    • Or in PFX so it can be exported
  • Backup existing certificates
  • Replace existing certificates with your new certificate
  • Run VCD tool to import and define the private key encryption password
  • Restart cell(s)

Process

If you have a pfx you can use this article to extract the key and cert. If you already have the two files, .key end .pem then you can proceed.

We will follow VMware documentation and create a backup of the existing files.

cp /opt/vmware/vcloud-director/etc/user.http.pem /opt/vmware/vcloud-director/etc/user.http.pem.original
cp /opt/vmware/vcloud-director/etc/user.http.key /opt/vmware/vcloud-director/etc/user.http.key.original
cp /opt/vmware/vcloud-director/etc/user.consoleproxy.pem /opt/vmware/vcloud-director/etc/user.consoleproxy.pem.original
cp /opt/vmware/vcloud-director/etc/user.consoleproxy.key /opt/vmware/vcloud-director/etc/user.consoleproxy.key.original

Now we can wither SCP in our key and certificate or edit and replace the content of the files on the server by copying and pasting in content from the files you have. Whatever you find to be the easiest.

Forgot your root password for the Cloud Director appliance, off cause not. But anyway, here is a link to reset it....

After the “user.http.pem/key” and “user.consoleproxy.pem/key” files have been updated with the new certificate data we can tell Cloud Dictor to update its config with the commands below. This is done to update the encryption password for the private key.

If you don’t care about security you can also update without –key-password, then off cause your private key will need to be in an unencrypted format in the .key files.

/opt/vmware/vcloud-director/bin/cell-management-tool certificates -j --cert /opt/vmware/vcloud-director/etc/user.consoleproxy.pem --key /opt/vmware/vcloud-director/etc/user.consoleproxy.key --key-password PASSWD
/opt/vmware/vcloud-director/bin/cell-management-tool certificates -p --cert /opt/vmware/vcloud-director/etc/user.http.pem --key /opt/vmware/vcloud-director/etc/user.http.key --key-password PASSWD

If everything works out it will tell you the certificates have been updated and you need to restart VCD for it to take effect.

SSL configuration has been updated. You will need to restart the cell for changes to take effect.

Now safely shut down your cell(s) with the command below. this will ensure that VCD is the first shutdown when all tasks are done.

/opt/vmware/vcloud-director/bin/cell-management-tool cell -i $(service vmware-vcd pid cell) -s

Start again with the command below

systemctl start vmware-vcd

Conclusion

VMware has made it much easier to change a certificate in Cloud Director. The new way of storing certificates is a warm welcome change.

I did see a few different placements for the .key and .pem files depending on versions or if the cells have been created with raw Linux or an appliance, but you can always look in the conflig file placed in the same folder as the certificates.

Storage DRS recommendations – with PowerCLI

Many things can happen when you let Storage DRS run fully automated. If you have it on from the beginning it will probably only give you good things. But enabling it on a large storage space imbalanced cluster might be a bit too risky.

Many things that Storage DRS is not aware of. Like your storage underneath running out of space on pool/aggregate or the operations is too IO heavy to run within business hours.

Call me a wimp, but in this case, it seems better to be in control and apply the recommendations little by little. But having to use the GUI is a pain, you need to go into Storage Cluster > Monitor > Storage DRS > Recommendations. And from here you need to override the selections and uncheck the boxes so you can run smaller batches of Storage vMotions.

I will just use VMware PowerCLI cmdlets…

Well, unfortunately not all of vSphere API is exposed through PowerCLI cmdlets, but after a bit of googling it seemed quite easy to call the SDK API directly from within PowerShell

One post that came to my attention where containing most of the code needed.

Solution:

I’m not that much into what the ServiceInstance or StorageRessoruceManager is. But I expect it to be the API instantiated by PowerShell where you then have each operation from where you can find the functionality that you are looking for.

 # DSC you want to work with
$dscName = 'DatastoreCluster'

# Get DSC info
$dsc = Get-View -ViewType StoragePod -Filter @{'Name'=$dscName}

# Get Service Intance
$si = Get-View ServiceInstance

# Get the StorageResourceManager
$storMgr = Get-View -Id $si.Content.StorageResourceManager

# Refresh SDRS Recommendation on DSC
$storMgr.RefreshStorageDrsRecommendation($dsc.MoRef)

# Update dsc object with fresh recommendation data
$dsc.UpdateViewData()

# Filter on reason for storage balance. Select only 40 VMs.
$balance = $dsc.PodStorageDrsEntry.Recommendation | Where-Object {$_.Reason -eq "balanceDatastoreSpaceUsage"}  | Select-Object -First 40

# Do a run of each VM and start the storage vMotion process
foreach($vm in $balance){
   $message = "Moving VM: {0} to datastore: {1}" -f $(get-vm -id $("VirtualMachine-"+$($vm.Action[0].Target.Value))).name, $(get-datastore -id $vm.Action[0].Destination).name
   write-host $message -ForegroundColor Green
   $storMgr.ApplyStorageDrsRecommendationToPod($dsc.MoRef,$vm.Key)
} 

Conclusion:

I was expecting to use some PowerCLI cmdlets to make my granular balance of the storage cluster. Unfortunately, that did not exist.

But from the great community, I found how to use the vSphere API through PowerShell and in the end got the functionalty I was looking for.

Maybe there is an easier way to do the same, if so, let me know. Until next time I have a bit of vSphere SDK googling to do.