VCD – Force delete network

In our v2t conversion, the NSX for Cloud Director migration tool has had some issues when doing cleanup. One of them is that it cant delete the old NSX-V backed network even though there is nothing left in VCD using it. The error message can be seen below.

2023-05-22 10:54:28,551 [connectionpool]:[_make_request]:452 [DEBUG] [tenant.01] | https://vcd.ramsgaard.me:443 "DELETE /cloudapi/1.0.0/orgVdcNetworks/urn:vcloud:network:ce108a33-fa5c-4cae-8c16-60edd536ad20 HTTP/1.1" 400 None
2023-05-22 10:54:28,556 [vcdOperations]:[deleteOrgVDCNetworks]:1090 [DEBUG] [tenant.01] | Failed to delete Organization VDC Network lan.[ 1ca6fd03-de82-4835-b12e-58c5c043b2bc ] Network lan cannot be deleted, because it is in use by the following vApp Networks: lan.
2023-05-22 10:54:28,556 [vcdNSXMigratorCleanup]:[run]:230 [ERROR] [tenant.01] | Failed to delete Org VDC networks ['lan'] - as it is in use
Traceback (most recent call last):
  File "src\vcdNSXMigratorCleanup.py", line 218, in run
  File "<string>", line 1, in <module>
  File "src\core\vcd\vcdValidations.py", line 53, in inner
  File "src\core\vcd\vcdOperations.py", line 1094, in deleteOrgVDCNetworks
Exception: Failed to delete Org VDC networks ['lan'] - as it is in use

I found someone else having this problem, where they discovered a forceful way to delete the network. I have used this but wrapped it in Powershell instead. In my case, it can get the network URN from the log of the migration tools. Else you can also easily see the URN from the GUI URL when in the context of the network.

### Variables
$vcdUrl = "https://vcd.ramsgaard.me"
$apiusername = "@system"
$password = ''
$networkUrn = "urn:vcloud:network:ce108a33-fa5c-4cae-8c16-60edd536ad20"

### Auth against API and enable TLS1.2 for PowerShell
$base64AuthInfo = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(("{0}:{1}" -f $apiusername,$password)))
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12
$auth =Invoke-WebRequest -Uri "$vcdUrl/api/sessions" -Headers @{Accept = "application/*;version=36.0";Authorization="Basic $base64AuthInfo"} -Method Post

### Get VirtualWire
$virtualWire = Invoke-RestMethod -Uri "$vcdUrl/cloudapi/1.0.0/orgVdcNetworks/$($networkUrn)" -Headers @{Accept = "application/json;version=36.0";Authorization="Bearer $($auth.Headers.'X-VMWARE-VCLOUD-ACCESS-TOKEN')"} -Method GET
$virtualWire

### Delete VirtualWire
$deleteStatus = Invoke-RestMethod -Uri "https://vcd.hostcenter.dk/cloudapi/1.0.0/orgVdcNetworks/$($networkUrn)?force=true" -Headers @{Accept = "application/json;version=36.0";Authorization="Bearer $($auth.Headers.'X-VMWARE-VCLOUD-ACCESS-TOKEN')"} -Method DELETE
$deleteStatus

Above PowerShell is used at your own risk 🙂

Basic VMware PhotonOS config

Still a bit new to PhotonOS. But it’s getting more and more that I use PhotonOS for deploying things with, like CSE for VCD.

I always struggle with the basics, even though I know a bit about Linux and Unix. So here is come quick tips on

  • First login
  • Network
  • SSH
  • Passwords
  • NTP

First login

When the VM is deployed you can now login to it with root and password “changeme”. It will then require you to change the password to something else.

Network

You can view network interfaces with networkctrl and IP a to show what is already configured.

If you want to set it to a static ip you have to create a new config file.

cat > /etc/systemd/network/10-static-eth0.network << "EOF"
 
[Match]
Name=eth0
 
[Network]
Address=172.16.4.225/24
Gateway=172.16.4.1
DNS=172.16.4.10
Domains=home.lab
EOF

After the file has been created you can set the correct permissions and restart the network and resolver. If you skip the chmod you will probably see a fail in network reboot due to the system not being able to read the new file

chmod 644 /etc/systemd/network/10-static-eth0.network
systemctl restart systemd-networkd
systemdctl restart systemd-resolved

SSH config

When you want to connect to PhotonOS with a password but also have either your pageant or native ssh console running you will try to authenticate with a public/private key. If your key is not added to the server you will get a “Too many failed authentications”. To get around this you can use a parameter.

ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no <hostname>

This will then make authentication with a password.

Forgot root password

If you forgot the root password and need to reset it you can do the following.

  1. reboot and press “e” to enter the grub boot loader. add the line rw init=/bin/bash shown on the screenshot. Press F10 to boot

2. You should now be in a shell on the system, from here you can run passwd

3. Unmount / and reboot the system

umount /
reboot -f

Disable password history

If you encounter the problem of trying to reset the password but the password you want to set says: Password has been already used. Choose another.

You can disable this by editing the /etc/pam.d/system-password.

By changing ‘remember ‘ from 5 to 0 we can disable the remember password count and reset the root password.

Paste into PhotonOS PuTTy session

When using vi there is a “bug” that prevents you from normal paste with right-click. To workaround this you need to write :set mouse= in vi command mode. After done you can not use paste with right-click.

Shift+Ins might be also used to paste on PuTTY.

NTP settings

Some apps running in PhotonOS are very time-sensitive, vcda is one of them. use watch -n 0,1 date on each of the appliances that need to communicate and verify that time is not skewed.

If you need to set up NTP post-deployment you can do as follows. Edit the timesyncd.conf file with a text editor such as vi:

vi /etc/systemd/timesyncd.conf

In the [Time] section edit the NTP entry with the correct NTP server address:

[Time]
#FallbackNTP=time1.google.com time2.google.com time3.google.com time4.google.com
NTP=ntpAddress

After having put in the NTP server you want to use restart the network and time sync service

systemctl restart systemd-networkd
systemctl restart systemd-timesyncd

Verify that the time on the appliance is now synchronizing with the NTP server.

Further troubleshooting is to see if the NTP service is running

systemctl status systemd-timesyncd

NSX 4.0.1 > 4.1.0 upgrade problems

Precheck gave warnings back for all the edge nodes. Where it stated the problem below.

Edge node 4006d386-a394-43a4-6b04b242f8b3 vmId is not found on NSX Manager. Please refer to https://kb.vmware.com/s/article/90072

The KB article states that NSX managers are missing the VM_ID for the edge nodes and gave an example of how to manually find the Edge VM moref and post it to the NSX API.

Using PowerShell to update the VM_ID

Instead of the manual procedure from the KB, I made a small script.

### Login details
$nsxUsername = "admin"
$nsxPassword = "Yi....kes12!"
$nsxmanager = "nsxt.home.lab"

### Connect to vcenter so that we can fetch moref
Connect-VIServer vcsa1.home.lab

### NSX Manager auth header
$Type = "application/json;charset=UTF-8"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword)) }
$nsxUri = "https://$($nsxmanager)"

### Edge Vm moref update
$edgenodes =  (Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes" -Headers $Header -Method GET -ContentType $Type).results | Where-Object {$_.node_deployment_info.deployment_type -eq "VIRTUAL_MACHINE"}

### Loop through the edge nodes
foreach($edgenode in $edgenodes){
write-host "Updating edge node - $($edgenode.display_name)"
$specEdge =  (Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)" -Headers $Header -Method GET -ContentType $Type).node_deployment_info.deployment_config

$vmid = ((get-vm $edgenode.display_name).Id).Split("-")[-1]
write-host "Found edge node moref in vcenter - vm-$vmid)"

write-host "Removing form factore and adding vm_id to object)"
$specEdge | Add-Member -NotePropertyName vm_id -NotePropertyValue "vm-$vmid"
$specEdge = $specEdge | Select-Object -Property * -ExcludeProperty form_factor

try {
    write-host "Updating against NSX API"
    Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)?action=addOrUpdatePlacementReferences" -Headers $Header -Method POST -ContentType "application/json" -Body $($specEdge|ConvertTo-Json -Depth 10)
}
catch {
   $streamReader = [System.IO.StreamReader]::new($_.Exception.Response.GetResponseStream())
   $ErrResp = $streamReader.ReadToEnd() | ConvertFrom-Json
   $streamReader.Close()
  }
if($ErrResp){
   write-host "$($ErrResp.error_message) - $($edgenode.display_name) not updated with success)" -ForegroundColor Red }
   else{
   write-host "$($edgenode.display_name) updated with success)"
   }
}

Unfortunately, even after updating vm_id the precheck still failed, with the same error. NSX API accepted with code 200 the post nothing happened behind the scenes

VMware support:

Opening a VMware SR, they got logs and the info that we already tried to update the vm_id. They asked us to do a couple of other things.

Refreshing the edge node config data

VMware SR stated us to try and do an API call for refreshing the edge node config data. VMware code NSX API info. With the info from the previous script, we can do a POST against NSX API.

Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)?action=refresh_node_configuration&resource_type=EdgeNode&read_only=true" -Headers $Header -Method POST

Reboot of edge nodes

Put an Edge node into NSX maintenance mode and afterward do a reboot of the node, initiated from vCenter with a “Restart Guest OS”. The reboot went fine and the Edge node was put into production again.

Reboot of NSX Managers

Rebooting the NSX Managers, one at a time, and of cause waiting for the rebooted node to come back online with no errors before continuing with the next one.

Result

Unfortunately, non of the above helped, precheck still gives the same error.

Further troubleshooting:

Looking at the API guide I stumbled over an edge node redeployment call. I have redeployed the edge nodes manual before, and I have to say, it’s a pain! It’s not hard, but it takes a lot of time. But this call will help to redeploy in a way that doesn’t affect the data plane.

  • Edge is being put into NSX maintenance mode
  • Edge is then deleted and a new one is deployed, with the same naming as the old one.
  • After the edge node is deployed and registered in the manager it exits maintenance mode and goes into production again

Executing the API call

First, we get the config for the specific edge. Afterward, we post to the redeploy API with the body that we got from the config get request.

The try-catch is helping you get a better error description, if something goes wrong.

$redeployBody = (Invoke-RestMethod -Uri $nsxUri/api/v1/transport-nodes/$($edgenode.id) -Headers $Header -Method GET) 

try {
    Invoke-RestMethod -Uri "$nsxUri/api/v1/transport-nodes/$($edgenode.id)?action=redeploy" -Headers $Header -Method POST -ContentType "application/json" -body $redeployBody
    }
catch {
    $streamReader = [System.IO.StreamReader]::new($_.Exception.Response.GetResponseStream())
    $ErrResp = $streamReader.ReadToEnd() | ConvertFrom-Json
    $streamReader.Close()
    $ErrResp
}

After redeployment, more problems….

After the redeployment of one edge node, the error with vm_id was no more for the redeployed node. Great! but now the upgrade coordinator gave a new precheck error… I can’t exactly remember, but it said something like “The Host Upgrade Unit Groups are not suitable for a T0.”

Google results pointed in the direction of VMware KB to reset the upgrade plan. But was not successful. (I then later found out that on the next page of the upgrade wizard, there is a “Reset plan” button)

Continued to redeploy all the remaining edge nodes, this helped clear all the errors from the upgrade precheck. The upgrade could then begin 😀

And another problem…

After the edge nodes that held the T0 gateway, one of the edge nodes was not negotiating up BGP to physical fabric. Tried a lot of things.

  • Redeployment of the edge node with the redeploy API action didn’t help.
  • Migrating the edge node to the same hosts as the working edge node was on, didn’t help.
  • Ping from working edge node to non-working on the VLAN uplink IPs worked.
  • Ping from non-working to physical fabric didn’t work.

Then tried to remove and add the interfaces of the T0 on the non-working edge node. Made everything work again. A quite random bug?

Conclusion

NSX upgrades compared to NSX-V upgrades seem to be quite troublesome or let’s say that there is room for improvements. The good thing is that when edge nodes are upgraded then all tenants are also. And the upgrade happens with zero downtime. The T0/T1 robustness is amazing, no drops, no IPSec go down.

VCD CPI/CCM – Load balancer

When a TKG cluster is deployed by Container Service Extension(CSE) it means that it lives within VMware Cloud Director(VCD).

Inside this TKG cluster, you will find a Cloud Controller Manager(CCM) pod under kube-system called “vmware-cloud-director-ccm”. CCM pod is part of Cloud Provider Interface(CPI) that gives you some capabilities on how for example to add a Persistent Volume(PV) or do load balancing with NSX Advanced Loadbalancer(ALB). Basically, the CCM will contact the VCD CPI API and from there orchestrate what you requested in your Kubernetes YAML files.

At the time of writing, L4 load balancing(LB) features through ALB are the only option available. This is because the CPI is not yet completely featured to create L7 LB with ALB. It’s on the roadmap though.

One-arm vs. two-arm…

I found that there are two conspects that are with knowing of, one-arm and two-arm load balanceres.

The two LB methods are described here. Since it’s an old article, AVI/ALB was not in the VMware portfolio back then. And also NSX-T has migrated away from the LB service where it lived as haproxy within tier1 and over to AVI/ALB service engines.

The default setting of load balancing with VCD CPI is two arms, meaning that it will tell VCD to create a DNAT rule towards a 192.168.8.x internal subnet used to create ALB VIPs.

WAN > T1 DNAT(185.139.232.x:80)> 192.168.8.x(LB internal subnet) > ALB SE > LB Pool members

Since L7 LB features in VCD are not yet available AND also it will become very costly. Most customers will probably choose to have an Nginx or Apache ingress controller inside their own Kubernetes cluster.

Since VCD 10.4 the two-arm config has been working and therefore it’s more desirable since you can use multiple ports on a single public IP. Where one-arm config would allocate one public IP pr Kubernetes service(correct me if I’m wrong).

If you are running your own ingress controller then some find the one-arm approach more desirable since ALB will then hold the public IP address.

WAN > T1 Static Route to ALB (185.139.232.x) > ALB SE > LB Pool members

How to change to one arm LB?

Hugo Phan has done a good write-up on this blog.

Basically, it’s downloading the existing config and changing the config map of VCD CPI removing the part where it 192.168.8.x subnet is defined. After this, you delete the existing CPI and then add it from the yaml file that you edited.

Snip from Hugo Phan blog – vmwire.com

How can I use/test this with my VCD TKG cluster?

If you don’t have a demo app that you prefer, then I can recommend either yelp or retrogames. Here I will do it with yelp. William Lam has done a good write-up and also hosts deployment files for yelp.

Step 1 – Deploy the application

kubectl create ns yelb
kubectl apply -f https://raw.githubusercontent.com/lamw/vmware-k8s-app-demo/master/yelb-lb.yaml

Step 2 – Check that all pods are running

jeram@QL4QJP2F4N ~ % kubectl -n yelb get pods
NAME                             READY   STATUS    RESTARTS   AGE
redis-server-74556bbcb7-f8c8f    1/1     Running   0          6s
yelb-appserver-d584bb889-6f2gr   1/1     Running   0          6s
yelb-db-694586cd78-27hl5         1/1     Running   0          6s
yelb-ui-8f54fd88c-cdvqq          1/1     Running   0          6s
jeram@QL4QJP2F4N ~ % 

The deployment file is asking k8s for a service of a load balancer, CCM picks this up and asks VCD CPI to have ALB creating the L4 load balancing.

Task view from VCD

Step 3 – Get the IP and go check out the yelb site

jeram@QL4QJP2F4N ~ % kubectl -n yelb get svc/yelb-ui
NAME      TYPE           CLUSTER-IP     EXTERNAL-IP       PORT(S)        AGE
yelb-ui   LoadBalancer   100.64.53.23   185.177.x.x   80:32047/TCP   5m52s

Step 4 – Scale the UI

Let’s see how many instances have from the initial deployment.

jeram@QL4QJP2F4N ~ % kubectl get rs --namespace yelb
NAME                       DESIRED   CURRENT   READY   AGE
redis-server-74556bbcb7    1         1         1       8m11s
yelb-appserver-d584bb889   1         1         1       8m11s
yelb-db-694586cd78         1         1         1       8m11s
yelb-ui-8f54fd88c          1         1         1       8m11s
jeram@QL4QJP2F4N ~ % 

We can then scale the UI to 3 and check again to see if that happens.

jeram@QL4QJP2F4N ~ % kubectl scale deployment yelb-ui --replicas=3 --namespace yelb
deployment.apps/yelb-ui scaled

jeram@QL4QJP2F4N ~ % kubectl get rs --namespace yelb
NAME                       DESIRED   CURRENT   READY   AGE
redis-server-74556bbcb7    1         1         1       9m45s
yelb-appserver-d584bb889   1         1         1       9m45s
yelb-db-694586cd78         1         1         1       9m45s
yelb-ui-8f54fd88c          3         3         3       9m45s
jeram@QL4QJP2F4N ~ % 

UI is now scaled to replicates of 3. Seen from the Load Balancer view I VCD it will only show the worker nodes. Since k8s is doing its own loadbalancing arose the pod instances.

Step 5 – Cleanup

jeram@QL4QJP2F4N ~ % kubectl -n yelb delete pod,svc --all && kubectl delete namespace yelb
pod "redis-server-74556bbcb7-f8c8f" deleted
pod "yelb-appserver-d584bb889-6f2gr" deleted
pod "yelb-db-694586cd78-27hl5" deleted
pod "yelb-ui-8f54fd88c-6llf7" deleted
pod "yelb-ui-8f54fd88c-9r6wf" deleted
pod "yelb-ui-8f54fd88c-cdvqq" deleted
service "redis-server" deleted
service "yelb-appserver" deleted
service "yelb-db" deleted
service "yelb-ui" deleted
namespace "yelb" deleted
jeram@QL4QJP2F4N ~ % 

Again, CCM will instruct VCD CPI to clean up. NICE!

Conclusion

We now have a good idea of how load balancing works with VCD TKG deployd K8s clusters. Off cause we are looking forward to the L7 features. But it’s a good start and VMware is working hard to help in making k8s deployment and day2 operations easier.

NSX-T – Topology view troubleshooting

Ever seen the beneath picture when trying to see the cool topology view in NSX-T?

If so, there is an API call that can help you resync the topology view so you again can see it.

# Login details
$nsxUsername = "admin",
$nsxPassword = "",
$nsxmanager = "nsxt.home.lab"

# NSX Manager auth header
$Type = "application/json;charset=UTF-8"
$Header = @{"Authorization" = "Basic " + [System.Convert]::ToBase64String([System.Text.Encoding]::UTF8.GetBytes($nsxUsername + ":" + $nsxPassword)) }
$nsxUri = "https://$($nsxmanager)"

# Request topology resync
Invoke-RestMethod -Uri "$nsxUri/policy/api/v1/ui/network-topology/resync" -Headers $Header -Method POST -ContentType $Type

You don’t get any feedback from the API call, but after a few minutes, you will again be able to see the topology again.

NSX-V Edge – Disable ALG

In NSX-V 6.4.11 and 6.4.12 there was introduced a bug to the Application Layer Gateway where it would cause package drop.

From the release notes of NSX-V 6.4.13, it states:

Fixed Issue 2886595: ALG-based services are not working in NSX Data Center for vSphere 6.4.11 and 6.4.12.:

Edge Firewall drops packets for ALG-based services FTP/SFTP/SNMP when using NSX Data Center for vSphere 6.4.11 and 6.4.12.

A temporary workaround is to disable ALG on the affected edge until the NSX-V installation can be upgraded to 6.4.13 or 6.4.14.

Procedure

  • Connect to the NSX Manager as admin and enter enable mode by typing: enable
    Enter engineering mode by typing: st en
  • Enter the NSX Manager root password: IAmOnThePhoneWithTechSupport
    Get the password for the Edge by typing:
    /home/secureall/secureall/sem/WEB-INF/classes/GetSpockEdgePassword.sh edge-ID
[root@nsxvmanager ~]# /home/secureall/secureall/sem/WEB-INF/classes/GetSpockEdgePassword.sh edge-1269
Edge root password:
        edge-1269       -> u8ORKdfFIM$hZ
[root@nsxvmanager ~]#
  • Access the Edge VM console, log in as the admin user and enter enable mode by typing: enable
  • Enable engineering mode by typing: debug engineeringmode enable
  • Enter the root shell on the Edge by typing the password obtained from the NSX manager: st en
  • Run commands as followings to make the workaround reboot permanent and to disable the ALG in the kernel without a reboot.
    echo “net.netfilter.nf_conntrack_helper = 1” >> /etc/sysctl.conf
    sysctl net.netfilter.nf_conntrack_helper=1

Conclusion

Since the procedure above is done directly on the edge it will not survive an edge redeploy. This is because a redeploy will take its configuration from the NSX manager and not look at what is done directly on the edge itself.

Off cause, the correct solution would be to have the NSX manager upgraded to the latest version and afterward upgrade the edge version.

NSX-T Edge node password management

The default NSX-T password expiration time is set to 90 days. But in a lab environment, this is not required. So here is a bit on how to disable the timer but also how to recover from an expired or forgotten password.

Reset exipred password

If the password for either of the users, audit, root, or admin, has expired you will see it when you try to log in with SSH. It will then prompt you to enter the current password followed by the new two times. Since this is only for home lab, and would like the previous password, I set a new and quick-to-remember password. Fimmer_old_password1. The SSH session then disconnects and you start a new connection with the new password.

nsx-edge> set user admin password My-New_VMware1!_Password old-password Fimmer_old_password1

After the reset and re-reset you now have 90 days of password again. or you could disable the password expiration…

If you find yourself in a situation of a forgotten admin password.

You will most likely be able to log in with the root account. Even if expired using the console of the Edge VM will always work. From there you can use the normal Linux password reset command to reset the admin account password.

passwd admin

And if you have tried the wrong password too much you can unlock the account with pam tally.

pam_tally2 --user admin --reset

Another note when you are logged in with root, users can still use nsxcli, just wrap your nsxcli commands with su admin -c ”

su admin '-c clear user audit password-expiration'

If your find yourself completely locked out of NSX-T

VMware has some good documentation on this. Basically it is

  • Connect to the console of the appliance and reboot the system. When the GRUB boot menu appears, press the left SHIFT or ESC key quickly. Press e to edit the menu. Press e to edit the selected option.
  • Search for the line starting with linux and add systemd.wants=PasswordRecovery.service to the end of the line. Press Ctrl-X to boot.

Set password to never expire

SSH to the edge node with the admin account. Using the nsxcli we can adjust the expiration to a maximum of 90 days. The commands below will set the password expiration to 9999 days and clear the expiration if already happened. VMware has it in their documentation here

nsx-edge> set user admin password-expiration 9999
nsx-edge> set user root password-expiration 9999
nsx-edge> set user audit password-expiration 9999
nsx-edge> clear user admin password-expiration
nsx-edge> clear user root password-expiration
nsx-edge> clear user audit password-expiration

SSH-keys

Something that is always better than passwords is SSH Keys. You can add multiple ssh-keys to the same users in NSX-T. The cool thing is that you have a label for the key so multiple users can have access with their own SSH key, this way you avoid some of the hassles of having to use passwords in with your SSH connections

nsx-edge> set user admin ssh-keys label jr type ssh-rsa value AAAAB3NzaC1yc2EAAAADAQABAAACAQC/VPq30qzyJHr8v6qh1vF1CVY8R9U09iCkqnIs9H6d9hBOeDu/e52rPj2BOQUfHwBmGRPVqZUyuOO20hDgT/BzP0QxISv9l2OpFariz8AmHu9m4kUwAdrBDvplw8fFeafppUwQF/aFsIF+t1PtFluz0Bp3N/sp3NQGWfkez7myctGc9X3eMc6oUAYrPPJeDZz1x5JoGdwdH/w6wjr3uK03kRx6TX1kNqxSypIQQ/8lYg1TG7yAuF5DhX4fJrPjpiLau1H6z0vChVpqY1q8oMntzHHtYtByFMrNtWFfAvG94BT27h/Lkmz5JM5d41TbL0YdZT8zCTrXzUG87wdEaRiB5ZeKy9LENgfxKO66scSU2gjiXwpyJTrHKZYz9g5EERH/41w+qMT90HAM3ArSIvk7pROoKhZy0IeOwfWbmMlQvKQFjS7OtKnFEeVUYRqnLvi3XeUiFbLxmW3ID8IqQy3iDNuESiVNRcp/PoN7lxL9cfGJdXBuJ3PBcaQZx/vQpePRqW9eBSmhhS1beIUlLV0UOFdRGTMMMjOlp7m7jaw5EnvztbInfPOdMPoUuSL9iGut7M0SVMgEzo0MiJDHNdLQYK0EKO8qrWMz76UHhpdnhOQNdi3/wtVVzxVUR/D9zBa1q2oL8ml7jKVubVbBd6Vm0lEEquDEN3I9Dan/Ev0j9w==

Conclusion

Having an expired password will cause you all sorts of trouble.

If you don’t have a PAM solution that can help you to automatically change the password, then setting the expiration to 9999 days will for sure help your manageability.

Putting your SSH key onto the nodes and managers will help you in the long run, and is in my opinion also a more secure solution than having passwords.

ESXi network routes

I honestly don’t know why this is still a problem. Support for routed vmotion traffic was added back at vSphere6. Here we are vSphere7 and still have to set your gateway/routes for the vmotion stack through esxcli.

Either way, here is how it works

vMotion stack

Each tcp/ip stack can only have one gateway, that makes sense. And if you want to keep your management and vMotion traffic separated you need two tcp/ip stacks.

It’s nicely done through vSphere vCenter GUI and there is a KB for it. And you even have the option to override the default gateway and specify the right one for your vMotion stack.

[root@dc1esxcompx-xx:~] esxcli network ip route ipv4 list -N vmotion
Network     Netmask        Gateway  Interface  Source
----------  -------------  -------  ---------  ------
10.1.115.0  255.255.255.0  0.0.0.0  vmk1       MANUAL

But when looking at the routing table from esxcli it is not set. If you know, feel free to give me a kick and enlighten me.

ESXCLI add a static route

So for me to set an actually default route I have to do it as shown below

[root@dc1esxcompx-xx:~] esxcli network ip route ipv4 add -g 10.1.115.1 -n 0.0.0.0/0 -N vmotion

[root@dc1esxcompx-xx:~] esxcli network ip route ipv4 list -N vmotion
Network     Netmask        Gateway     Interface  Source
----------  -------------  ----------  ---------  ------
default     0.0.0.0        10.1.115.1  vmk1       MANUAL
10.1.115.0  255.255.255.0  0.0.0.0     vmk1       MANUAL

PowerCLI

Need to do it on a cluster with multiple hosts? No problem LucD from VMware community got you covered. I only did a little customization and it works for my needs.

connect-viserver -Server 

$stackName = 'vmotion'
$ipGateway = '10.1.115.1'
$ipDevice = 'vmk3'
$cluster = "computexx"
$vmhosts = get-cluster $cluster | get-vmhost

foreach($vmhost in $vmhosts)
{
$esx = Get-VMHost -Name $vmhost
$netSys = Get-View -Id $esx.ExtensionData.ConfigManager.NetworkSystem
$stack = $esx.ExtensionData.Config.Network.NetStackInstance | where{$_.Key -eq 'vmotion'}
$config = New-Object VMware.Vim.HostNetworkConfig
$spec = New-Object VMware.Vim.HostNetworkConfigNetStackSpec
$spec.Operation = [VMware.Vim.ConfigSpecOperation]::edit
$spec.NetStackInstance = $stack
$spec.NetStackInstance.ipRouteConfig.defaultGateway = $ipGateway
$spec.NetStackInstance.ipRouteConfig.gatewayDevice = $ipDevice
$config.NetStackSpec += $spec
$netsys.UpdateNetworkConfig($config,[VMware.Vim.HostConfigChangeMode]::modify)
}

Conclusion

Manipulating the vmotion stack route table with either esxcli or PowerCLI is working great.

Need to know more? there are plenty of good bloggers and KBs out here.

ssacli – CLI configuring SmartArray

When you install an HPE server with the VMware custom image for HPE servers you automatically get all the HPE tools for configuring the hardware. Neat.

Here is a small guide on how to clean and setup the array. The ssacli is located in /opt/smartstorageadmin/ssacli/bin on the ESXi serve – start here and follow the next commands

Show the existing config:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 ld all show
Smart Array P440ar in Slot 0 (Embedded)
   Array A
      logicaldrive 1 (279.37 GB, RAID 1, Failed)
   Array B
      logicaldrive 2 (3.27 TB, RAID 1+0, OK)

Delete the existing logical drives:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 ld 1 delete
Warning: Deleting an array can cause other array letters to become renamed.
         E.g. Deleting array A from arrays A,B,C will result in two remaining
         arrays A,B ... not B,C 
Warning: Deleting the specified device(s) will result in data being lost.
         Continue? (y/n)  y
[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 ld 2 delete
Warning: Deleting the specified device(s) will result in data being lost.
         Continue? (y/n)  y
[root@esx2:/opt/smartstorageadmin/ssacli/bin] 

Show all physical drives in the server:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 pd all show
Smart Array P440ar in Slot 0 (Embedded)
   Array A
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:14 (port 1I:box 1:bay 14, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:15 (port 1I:box 1:bay 15, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:16 (port 1I:box 1:bay 16, SAS HDD, 900 GB, OK)

Create a new volume with available physical drives:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 create type=ld drives=1I:1:9,1I:1:10,1I:1:11,1I:1:12,1I:1:13,1I:1:14,1I:1:15,1I:1:16 raid=6

Warning: Controller cache is disabled. Enabling logical drive cache will not take effect until this has been resolved.
[root@esx2:/opt/smartstorageadmin/ssacli/bin] 

Conclusion

Nice and easy – Give the HPE ssacli manual a read for more commands, starting page 57 and forward or use ssacli help.

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli help

CLI Syntax
   A typical SSACLI command line consists of three parts: a target device, 
   a command, and a parameter with values if necessary. Using angle brackets to
   denote a required variable and plain brackets to denote an optional 
   variable, the structure of a typical SSACLI command line is as follows:

      <target> <command> [parameter=value]

   <target> is of format:
      [controller all|slot=#|serialnumber=#]
      [array all|<id>]
...........

VMware CSE – Stuck cluster deployment

After upgrading to CSE 3.1.3 with VCD 10.3.1 I encountered a problem when creating clusters from the Ubuntu 20.04 native cluster template.

Basically, the mstr node would be deployed and started, VMTools will become ready and the first script injection would happen. Then all of a sudden the VM would reboot and the cluster creation will fail because it can’t see the process anymore. This will sometimes leave a cluster in the “Creation in progress” status but somehow it can not be managed anymore.

22-06-02 10:42:34 | cluster_service_2_x:2811 - _wait_for_tools_ready_callback | DEBUG :: waiting for guest tools, status: vm='vim.VirtualMachine:vm-835608', status=guestToolsNotRunning
22-06-02 10:42:39 | cluster_service_2_x:2811 - _wait_for_tools_ready_callback | DEBUG :: waiting for guest tools, status: vm='vim.VirtualMachine:vm-835608', status=guestToolsRunning
22-06-02 10:42:41 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: waiting for process 1706 on vm 'vim.VirtualMachine:vm-835608' to finish (1)
22-06-02 10:42:46 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: process [0, <Response [200]>, <Response [200]>] on vm 'vim.VirtualMachine:vm-835608' finished, exit code: 0
22-06-02 10:42:46 | cluster_service_2_x:2869 - _execute_script_in_nodes | DEBUG :: about to execute script on mstr-7e34 (vm='vim.VirtualMachine:vm-835608'), wait=True
22-06-02 10:42:48 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: waiting for process 1729 on vm 'vim.VirtualMachine:vm-835608' to finish (1)
22-06-02 10:42:58 | cluster_service_2_x:2896 - _execute_script_in_nodes | ERROR :: Error executing script in node mstr-7e34: process not found (pid=1729) (vm='vim.VirtualMachine:vm-835608')
Traceback (most recent call last):
  File "/opt/vmware/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x.py", line 2879, in _execute_script_in_nodes
    callback=_wait_for_guest_execution_callback)

I created an SR request with Cloud Director GSS for both the failed deployment and for the stuck clusters that now couldn’t be deleted. Multiple screen sharing sessions later and no result.

Then I found the GitHub for Container Service Extension, the issue page had a very tempting title Failed deployments using TKGm on VCD. Many seem to have the same problem, no fix on the deployments but it seems that one guy had the fix for deletion of the stuck clusters.

The workaround

You need to find the ID of the user that owns the cluster. You can in the More>Kubernetes Clusters menu in VCD see who the owner is.

When you have the owner you can go into Administration > User > <User>. Then then the URL with contain the ID of the user.

vcd.ramsgaard.me/tenant/tenant1/administration/access-control/users/v9993018-ebf5-4ded-8134-27ddcc4ccbf0/general

With the userId you can fill out the body for the next API call.

$vdchost = "vcd.ramsgaard.me"
$apiusername = "svc-cse@system"
$password = 'Ye.........iks12!'

$base64AuthInfo = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(("{0}:{1}" -f $apiusername,$password)))
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12
$auth =Invoke-WebRequest -Uri "https://$vdchost/api/sessions" -Headers @{Accept = "application/*;version=32.0";Authorization="Basic $base64AuthInfo"} -Method Post

$accessBody = '{
    "grantType": "MembershipAccessControlGrant",
    "accessLevelId": "urn:vcloud:accessLevel:FullControl",
    "memberId": "urn:vcloud:user:e96cf9e8-535f-45d8-8a87-b9dac659f85f"
  }' | ConvertFrom-Json

$status = Invoke-RestMethod -Uri "https://$vdchost/cloudapi/1.0.0/entities/urn:vcloud:type:cse:nativeCluster:2.1.0/accessControls" -Headers @{Accept = "application/json;version=36.1";Authorization="Bearer $($auth.Headers.'X-VMWARE-VCLOUD-ACCESS-TOKEN')"} -ContentType "application/json" -Method post -Body ($accessBody | ConvertTo-Json)

When the API call is done you should now be able to delete the stuck cluster.

If you should be so unfortunate that the cluster is stuck in a “not resolved” state and the deletion through VCD GUI still fails you need to use the vcd cse cli.

### Login to VCD system or tenant organistaion
vcd login vcd.ramsgaard.me system jr
### Show clusters
vcd cse cluster list
### Force delete the cluster
vcd cse cluster delete tanzu1 --force

Conclusion:

The problem occurred in the first place due to a bug in VCD 10.3.1, the MQTT bus had some bug and therefore the cluster creation failed. 10.3.2 or 10.3.3 fixed the bug. (Off cause the VMware Tanzy Grid version should be used in the future)

It took some time to find the workaround, I hope the future of CSE will be more fault tolerant so these situations would not appear.

Until then there is a way to get out of the stuck cluster situation.

Disk mapping Windows <-> VMware – Part 2

A couple of years ago I did a post on how to map your windows disk with the real disk in VMware. The post will be an extension of it but with updated commands.

Why do I need to know the mapping? It happens when you stumble upon a VM disk with many disks attached. If the many disks vary in size you normally can look at those numbers and match them with the disks in VMware, but when all disks have the same size that approach become difficult.

Windows serial number:

In windows, we can retrieve the serial number on the disk we need to expand and then map the serial number to the VMware disk. In newer Windows Server versions it’s fairly easy to find but when dealing with older than 2012 you are missing the PowerShell cmdlets like get-disk. Someone on StackOverflow got a way that works on Windows Server 2008 > 2022.

$DriveLetter = "C:"
Get-CimInstance -ClassName Win32_DiskDrive |
Get-CimAssociatedInstance -Association Win32_DiskDriveToDiskPartition |
Get-CimAssociatedInstance -Association Win32_LogicalDiskToPartition |
Where-Object DeviceId -eq $DriveLetter |
Get-CimAssociatedInstance -Association Win32_LogicalDiskToPartition |
Get-CimAssociatedInstance -Association Win32_DiskDriveToDiskPartition |
Select-Object -Property SerialNumber

VMware disk:

From VMware’s side, it’s straightforward to find the disk and its serial number. Below is an scripted way of finding the disk and then adding the extra capacity.

Connect-VIServer ""

$VMname = ""
$disksn = "6000c295ec128b3d14472bdbf8e65aee"
$vmDisk = (Get-VM $VMname | Get-HardDisk) | Where-Object {$_.ExtensionData.Backing.uuid.Replace("-","") -eq $disksn } 

$ExpandSizeGb = 50
$vmDisk | Set-HardDisk -CapacityGB ($vmDisk.CapacityGB + $ExpandSizeGb) -Confirm:$false 

Conclusion:

Instead of having to guess what disk in windows is mapping to the VMware disk you here have a more automated way. The disk serial number retrieve commands are compatible with up to Windows Server 2022.

Brocade – FiberChannel zoning with CLI

When having todo zoning in Brocade world you still have to use an old java GUI client. Not always easy to open the GUI since not all Brocade switches have been updated and therefore all kinds of java security will prevent you from doing your job. With CLI we can skip the GUI and get to business instant.

CLI config does also have the great advantage to be sort of self-documentation 🙂 Here we will look a bit into doing what I would call basic zoning. A new device has been added to your fiber channel fabric and you need to create a zone for the new WWN and the storage system.

Adding ssh key

This is, in my opinion, a good thing to set up, makes it way easier for you to log in the next time. Off-cause is not optional, but a great advantage.

The public ssh key is gathered from a system that allows SSH-based logins. See below for info on how to set it up.

SAN-SW-03:admin> sshutil importpubkey
Enter user name for whom key is imported:admin
Enter IP address:10.1.100.20
Enter remote directory:/home/username
Enter public key name(must have .pub suffix):username.pub
Enter login name:username
username@10.1.100.20's password: 
public key is imported successfully.

Create alias

Below you first see the syntax, and afterward the actual command used in production.

# Syntax
alicreate “ALIAS_NAME”, “WWPN”

# Production command
alicreate "dc2esxmgmt1_11", "50:01:43:80:31:80:50:98"

Create zone

# Syntax
zonecreate "ZONE_NAME", "WWPN_alias_1;WWPN_alias_2"

# Production command
zonecreate "dc2storv5030_dc2esxmgmt1_11", "dc2esxmgmt1_11;dc2storv5030_01_01_p1v;dc2storv5030_01_01_p2v;dc2storv5030_01_02_p1v;dc2storv5030_01_02_p2v"

Add zone to config

cfgadd “fabric1”,”dc2storv5030_dc2esxmgmt1_11"

SAN-SW-03:admin> cfgsave
WARNING!!!
The changes you are attempting to save will render the
Effective configuration and the Defined configuration
inconsistent. The inconsistency will result in different
Effective Zoning configurations for switches in the fabric if
a zone merge or HA failover happens. To avoid inconsistency
it is recommended to commit the configurations using the
'cfgenable' command.

Do you want to proceed with saving the Defined
zoning configuration only?  (yes, y, no, n): [no] yes
Updating flash ...

Enable new config

SAN-SW-03:admin> cfgenable fabric1 
You are about to enable a new zoning configuration.
This action will replace the old zoning configuration with the
current configuration selected. If the update includes changes 
to one or more traffic isolation zones, the update may result in  
localized disruption to traffic on ports associated with
the traffic isolation zone changes.
Do you want to enable 'fabric1' configuration  (yes, y, no, n): [no] yes
zone config "fabric1" is in effect
Updating flash ...

Conclusion

It’s easy to do basic fiber channel zoning. The only thing that I miss from the GUI, and haven’t found yet, is a view to see all discovered WWN and then choose the WWN to the new alias you create. The server management system, in this case, shows the WWN name of the FC adapter and it’s easy to copy-paste into CLI commands.

So if you don’t have HPE OneView or some other fancy FC provisioning tool then you know how a basic and lowkey way of doing it.

And again, did I mention that when using CLI it basically document itself? 😉

Juniper upgrade process

Junos is in my opinion an awesome OS for your network. I enjoy the CLI, where commands are alike across all of Juniper’s products. Also, the many features and the fact that it’s not cisco.

BUT it also has its drawbacks. Honestly, I have seen some weird bugs. And keeping track of all the PRs from Juniper is a full-time job. And last but not least, the software upgrades are kind of a pain. especially on Junos devices older than 18.x.

EX3400 – format/install

For this case, I had a new EX3400, but with older firmware, 15.1X53-D58.3. I needed to upgrade to the latest SR in the newest train but from the CLI of the device only jumping 3 firmware versions are supported.

15.1> 18.1 > 18.4 > 19.3 > 20.2 > 21.1

But you can also do a format/install where you interrupt the boot process and then load a new firmware image on the device from a TFTP server. This is all done outside of Junos. This way you can jump to whatever version you want.

Jumping many version might make your config invalid, so beaware.

Juniper has a LOT of kb articles for this process and they all vary. So here is the process in my own writing

Process of format install

First, we need to get the right image from the juniper support side. It needs to the install image and the extension is .tgz

  • Download the image into your TFTP server.

In my case, the TFTP is a Linux box. If you prefer windows then TFTPd3264 is the way to go. Or MacOS then look here.

root@tftp:/srv/tftp# wget -O junos-install-media-net-ex-arm-32-21.4R1.12.tgz  'https://cdn.juniper.net/software/junos/21.4R1.12/junos-install-media-net-ex-arm-32-21.4R1.12.tgz?SM_USER=jv......5ce43fbdad2'
Resolving cdn.juniper.net (cdn.juniper.net)... 23.78.40.231
Connecting to cdn.juniper.net (cdn.juniper.net)|23.78.40.231|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 393745989 (376M) [application/octet-stream]
Saving to: ‘junos-install-media-net-ex-arm-32-21.4R1.12.tgz’

junos-install-media-net-ex-arm-32-21.4R1.12.tgz      100%[==================================================================>] 375.50M  3.48MB/s    in 2m 4s

2022-01-26 20:47:46 (3.03 MB/s) - ‘junos-install-media-net-ex-arm-32-21.4R1.12.tgz’ saved [393745989/393745989]

root@tftp:/srv/tftp# ls
junos-install-media-net-ex-arm-32-21.4R1.12.tgz
  • Now let’s reboot the switch and interrupt the “first” boot loader. just keep hitting ctrl+c after you powered rebooted when you see the “=>” you are in the right place. Here we set the IP address on the me0 interface and boot into the next boot loader.
Board: EX3400-24T
Base MAC: C00380FAAD2E
arm_clk=1000MHz, axi_clk=500MHz, apb_clk=125MHz, arm_periph_clk=500MHz
Net:   Registering eth
Broadcom BCM IPROC Ethernet driver 0.1
Using GMAC0 (0x18022000)
et0: ethHw_chipAttach: Chip ID: 0xdc14; phyaddr: 0x1
et0: gmac_serdes_init read sdctl(0xf4141c)
et0: gmac_serdes_init() serdes_status0: 0xf100ff00; serdes_status1: 0xf00
et0: gmac_serdes_init() PLL ready brought up exit
serdes_reset_core pbyaddr(0x1) id2(0xf)
bcmiproc_eth-0
Last Reset Reason: 0
Hit ^C to stop autoboot:  0
=>setenv ipaddr 10.1.100.253
=>setenv gatewayip 10.1.100.1
=>setenv netmask 255.255.255.0
=>setenv serverip 10.1.101.130
=>save
=>boot
Saving Environment to SPI Flash...
SF: Detected MX25L6405D with page size 256 Bytes, erase size 64 KiB, total 8 MiB, mapped at 0001faa0
Erasing SPI flash...Writing to SPI flash...done
Erasing SPI flash...Writing to SPI flash...done
SF: Detected MX25L6405D with page size 256 Bytes, erase size 64 KiB, total 8 MiB
device 0 offset 0x3c0000, size 0x10000
SF: 65536 bytes @ 0x3c0000 Read: OK
  • Wait for a few seconds for the next bootloader to appear and press ctrl+c again. Now you will see a menu, this menu you choose 5 and 5 and you should see “loader>”
Hit ^C to stop autoboot:  0 
Options Menu

1.  Recover [J]unos volume
2.  Recovery mode - [C]LI

3.  Check [F]ile system
4.  Enable [V]erbose boot
5.  [B]oot prompt
6.  [M]ain menu
Choice: 
Type 'menu' to go back to the menu
Type 'boot-junos' to boot into Junos
Type 'reboot' to reboot

5 5
  • We now set use the install format with the TFTP location of the image we downloaded in the first step.
Type '?' for a list of commands, 'help' for more detailed help.
loader> install --format tftp://10.1.101.130/junos-install-media-net-ex-arm-32-21.4R1.12.tgz
/kernel text=0x105b888 data=0x640fc+0x1fbf04 syms=[0x4+0x914a0+0x4+0x9b821]
/ex3400.dtb size=0x1f76
/crypto.ko text=0x419e0 data=0xe58+0x2a0 syms=[0x4+0x4740+0x4+0x2ba5]
/iflib.ko text=0x11f10 data=0x910+0x58 syms=[0x4+0x2b10+0x4+0x2194]
/miibus.ko text=0x19f38 data=0x10c4+0x78 syms=[0x4+0x51f0+0x4+0x3491]
/if_gmac.ko text=0xbc3c data=0x688+0xc syms=[0x4+0x1cc0+0x4+0x15ad]
/contents.iso size=0x279b000
Using DTB from loaded file '/ex3400.dtb'.
Kernel entry at 0xc1000180...
Kernel args: (null)
---<<BOOT>>---
GDB: no debug ports present
K cache
Release APs
WARNING: WITNESS option enabled, expect reduced performance.
mwill now attempt to reach the remote host.
<====== LOADS OF OUTPUT TO CONSOLE ======>
<====== LOADS OF OUTPUT TO CONSOLE ======>
Downloading /junos-install-media-net-ex-arm-32-21.4R1.12.tgz from 10.1.101.130 ...
rmed on 1024 samples passed.t-up health tests perfo
  300.6MB  03:52random: unblocking device.
  393.7MB  05:04
Installing Junos OS release ...

After 15-20 minutes the switch will have the install finished and ready for you to log into and start loading your config.

FreeBSD/arm (Amnesiac) (ttyu0)
login: 

Conclusion

This is a very helpful process and might come in handy when having new switches with old firmware that need to be applied. Skipping the smaller version jumps is a time saver.

This format install process can also be done with a USB key. This process is also quite simple but requires you to have physical access to the switch.

In my case, I have a console over ssh and can manage the switch out-of-band so TFTP is the easy way.

Veeam – retrive saved passwords from VBR

Ever needed to retrieve a saved Veeam password? I did – Found the process for it on the Veeam forum.

  • Open SQL Studio as administrator and connect to the Veeam DB instance
  • Run query from below on the VeeamBackup database
SELECT TOP (1000) [id]
,[user_name]
,[password]
,[usn]
,[description]
,[visible]
,[change_time_utc]
FROM [VeeamBackup].[dbo].[Credentials]
Query the Veeam DB for all stored credentials to backup infrastructure components

Get the password hash from the results (match the description to the one you need). Then run PowerShell below with the hash you grabbed.

Add-Type -Path "C:\Program Files\Veeam\Backup and Replication\Backup\Veeam.Backup.Common.dll"
$encoded = 'AQAAANCM....RhQ'
[Veeam.Backup.Common.ProtectedStorage]::GetLocalString($encoded)
Password revealed and ready to use

Conclusion:

Is this a security problem? Depends, but it will give you a reminder of how important it is to keep your Veeam VBR server safe. Never domain join and have the firewall closed as much as possible. If a malicious person comes by your Veeam server they can grab the keys for the rest of your infrastructure, including your backup of cause. In most cases that would mean game over.

Faster and more scripted way:

$instance = (Get-ItemProperty -Path "HKLM:\SOFTWARE\Veeam\Veeam Backup and Replication" -name SqlInstanceName).SqlInstanceName
$server = (Get-ItemProperty -Path "HKLM:\SOFTWARE\Veeam\Veeam Backup and Replication" -name SqlServerName).SqlServerName
$result = Invoke-Sqlcmd -Query "SELECT TOP (1000) [user_name],[password],[description] FROM [VeeamBackup].[dbo].[Credentials]" -ServerInstance "$server\$instance"
Add-Type -Path "C:\Program Files\Veeam\Backup and Replication\Backup\Veeam.Backup.Common.dll"
$result | ForEach-Object { [Veeam.Backup.Common.ProtectedStorage]::GetLocalString($($_.password))}

Cloud Director 10.3 – Update certificates

Since my last article on how to update Cloud Director SSL certificates, there has been a major change. No more binary java truststore – jaaay.

Cloud Director has changed over too, what I think, is a better and more normal way of storing the private and public keys, which is in PEM format. From release notes, the change actually happened in 10.2, but the certificate path changed again in 10.3. If you are in doubt of where the certificate path is then look inside global.properties

/opt/vmware/vcloud-director/etc/global.properties

VMware’s own documentation state that we can now just swap the .pem files, use the cell-management tool to import and restart the cell.

What we will do and what is needed

  • Get a new public signed certificate
    • Either in PEM format as .key and .pem(certificate including intermediate)
    • Or in PFX so it can be exported
  • Backup existing certificates
  • Replace existing certificates with your new certificate
  • Run VCD tool to import and define the private key encryption password
  • Restart cell(s)

Process

If you have a pfx you can use this article to extract the key and cert. If you already have the two files, .key end .pem then you can proceed.

We will follow VMware documentation and create a backup of the existing files.

cp /opt/vmware/vcloud-director/etc/user.http.pem /opt/vmware/vcloud-director/etc/user.http.pem.original
cp /opt/vmware/vcloud-director/etc/user.http.key /opt/vmware/vcloud-director/etc/user.http.key.original
cp /opt/vmware/vcloud-director/etc/user.consoleproxy.pem /opt/vmware/vcloud-director/etc/user.consoleproxy.pem.original
cp /opt/vmware/vcloud-director/etc/user.consoleproxy.key /opt/vmware/vcloud-director/etc/user.consoleproxy.key.original

Now we can wither SCP in our key and certificate or edit and replace the content of the files on the server by copying and pasting in content from the files you have. Whatever you find to be the easiest.

Forgot your root password for the Cloud Director appliance, off cause not. But anyway, here is a link to reset it....

After the “user.http.pem/key” and “user.consoleproxy.pem/key” files have been updated with the new certificate data we can tell Cloud Dictor to update its config with the commands below. This is done to update the encryption password for the private key.

If you don’t care about security you can also update without –key-password, then off cause your private key will need to be in an unencrypted format in the .key files.

/opt/vmware/vcloud-director/bin/cell-management-tool certificates -j --cert /opt/vmware/vcloud-director/etc/user.consoleproxy.pem --key /opt/vmware/vcloud-director/etc/user.consoleproxy.key --key-password PASSWD
/opt/vmware/vcloud-director/bin/cell-management-tool certificates -p --cert /opt/vmware/vcloud-director/etc/user.http.pem --key /opt/vmware/vcloud-director/etc/user.http.key --key-password PASSWD

If everything works out it will tell you the certificates have been updated and you need to restart VCD for it to take effect.

SSL configuration has been updated. You will need to restart the cell for changes to take effect.

Now safely shut down your cell(s) with the command below. this will ensure that VCD is the first shutdown when all tasks are done.

/opt/vmware/vcloud-director/bin/cell-management-tool cell -i $(service vmware-vcd pid cell) -s

Start again with the command below

systemctl start vmware-vcd

Conclusion

VMware has made it much easier to change a certificate in Cloud Director. The new way of storing certificates is a warm welcome change.

I did see a few different placements for the .key and .pem files depending on versions or if the cells have been created with raw Linux or an appliance, but you can always look in the conflig file placed in the same folder as the certificates.

Storage DRS recommendations – with PowerCLI

Many things can happen when you let Storage DRS run fully automated. If you have it on from the beginning it will probably only give you good things. But enabling it on a large storage space imbalanced cluster might be a bit too risky.

Many things that Storage DRS is not aware of. Like your storage underneath running out of space on pool/aggregate or the operations is too IO heavy to run within business hours.

Call me a wimp, but in this case, it seems better to be in control and apply the recommendations little by little. But having to use the GUI is a pain, you need to go into Storage Cluster > Monitor > Storage DRS > Recommendations. And from here you need to override the selections and uncheck the boxes so you can run smaller batches of Storage vMotions.

I will just use VMware PowerCLI cmdlets…

Well, unfortunately not all of vSphere API is exposed through PowerCLI cmdlets, but after a bit of googling it seemed quite easy to call the SDK API directly from within PowerShell

One post that came to my attention where containing most of the code needed.

Solution:

I’m not that much into what the ServiceInstance or StorageRessoruceManager is. But I expect it to be the API instantiated by PowerShell where you then have each operation from where you can find the functionality that you are looking for.

 # DSC you want to work with
$dscName = 'DatastoreCluster'

# Get DSC info
$dsc = Get-View -ViewType StoragePod -Filter @{'Name'=$dscName}

# Get Service Intance
$si = Get-View ServiceInstance

# Get the StorageResourceManager
$storMgr = Get-View -Id $si.Content.StorageResourceManager

# Refresh SDRS Recommendation on DSC
$storMgr.RefreshStorageDrsRecommendation($dsc.MoRef)

# Update dsc object with fresh recommendation data
$dsc.UpdateViewData()

# Filter on reason for storage balance. Select only 40 VMs.
$balance = $dsc.PodStorageDrsEntry.Recommendation | Where-Object {$_.Reason -eq "balanceDatastoreSpaceUsage"}  | Select-Object -First 40

# Do a run of each VM and start the storage vMotion process
foreach($vm in $balance){
   $message = "Moving VM: {0} to datastore: {1}" -f $(get-vm -id $("VirtualMachine-"+$($vm.Action[0].Target.Value))).name, $(get-datastore -id $vm.Action[0].Destination).name
   write-host $message -ForegroundColor Green
   $storMgr.ApplyStorageDrsRecommendationToPod($dsc.MoRef,$vm.Key)
} 

Conclusion:

I was expecting to use some PowerCLI cmdlets to make my granular balance of the storage cluster. Unfortunately, that did not exist.

But from the great community, I found how to use the vSphere API through PowerShell and in the end got the functionalty I was looking for.

Maybe there is an easier way to do the same, if so, let me know. Until next time I have a bit of vSphere SDK googling to do.

Veeam – Network Extention Appliance performance

This post will give a brief write-up on what to expect from a network perspective when using the Veeam Network Extention.

Since you found this post I don’t think an introduction is needed. But anyway. A quick write-up of the network so you can visualize how the test is performed.

  • Greenline indicated the L2 VPN made from both NEA to CloudGateway
  • The on-prem environment with 10gbit internet uplink
  • Service provider with multiple 10gbit internet uplinks
  • 4ms between on-prem and service provider

Tests:

So when a replica VM has been failover and the NEA L2 is running. What to expect? Veeam does not give you any info on the performance of the NEA. Veeam support is not either able to give out a performance chart. So here a the results from ping and iperf test.

Test 1 – ping of latency over the L2 tunnel:

Ping to 185.177.120.140 showing the latency over the internet. Ping to 192.168.12.151 showing the latency over the L2 VPN.

So from a latency perspective, it seems good. Only adding 1ms to the internet latency. Which is pretty good.

Test 2 – iperf over tunnel

iperf test from a VM in service provider side to an on-prem server.

About 110Mbit, not very good compared to the internet being able of doing 10Gbit.

iperf test from VM in service provider side to an on-prem server. This time with -P 8 for 8 parallel threads.

8 threads are not giving any further bandwidth.

Test 3 – Multiple VLAN with multiple NEA

It’s always interesting to find where the bottleneck could be. Since iperf over the internet is giving a completely different result. then it must be within NEA. When I tried to do multiple VLAN bridges to the cloud resources in Cloud Director I get the same results pr NEA. Meaning it could be something in NEA or its components making the bottleneck.

The good news is off cause that you will see the same result pr NEA even when doing iperf test to the same target in the other end. So NEA will scale linearly.

A look from the Veeam Cloud Connect gateway being the broker of the L2 VPN connections.
View from the iperf server – showing the connections from the servers in the other end of L2 VPN.

Conclusion

NEA is a very helpful solution, especially when it comes to large migrations where L2 between datacenters is required meanwhile migrating. Bandwidth using this solution is not great, but I would say is ok. L2 connection should only be used shortly when doing actually migrations.

In numbers, it seems NEA will add +1ms to the latency seen over the internet between the two environments. Bandwith is between 110 to 140mbit pr sec.

Manual mount VMFS datastore

Have a datastore that shows Not Consumed? From time to time I stumble across them and from what I have found there is really only one way to get around it, manually mount the datastores from the shell of the ESXi host.

Not sure what the root cause for it is, but if you know, then please let me know 🙂

Workaround

What we need to do is have the partitions UUID’s on the block device listed and afterwards mount the datastore with that UUID.

### Listing all available datastores that is not mounted
esxcfg-volume –l
### Mount a specefic datastore with the UUID found with -l 
esxcfg-volume –M <UUID>

Conclusion

After mounting with esxcfg-volume it should be mounted permanatly. Hope it works for you to.

Change VM MoRef in VBR database

This information can be found in many other places on the big internet, but since I can never find it myself, I will make a post more about the procedure.

When you switch ESXi host, vCenter, or remove and add from inventory your VM will get a new ID. In the world of VMware, it’s called MoRef ID.

When this happens Veeam will lose its coupling to the VM and backup will fail with:
– Virtual Machine <> is unavailable and will be skipped from processing.
– Nothing to process. All machines were excluded from task list.

How to verify there is a MoRef mismatch:

From a VMware perspective it’s easy:

connect-viserver <vcenter> -Credential $cred
Get-VM | select name, id

This will give you something like:

PS C:\Windows\system32> Get-VM | select name, id
Name Id
---- --
VirtualMachine-vm-71326

From Veeam perspective it’s a bit harder since you will need to query the MS SQL database that Veeam uses. So download the SQL Studio Manager from Microsoft.

Open the SQL Studio Manager as administrator on the server to gain access to the Veeam database. You can use the following query to find the MoRef that is in the Veeam database:

SELECT [dbo].[BObjects].id, [dbo].[BObjects].object_id, [dbo].[BObjects].host_id, [dbo].[BObjectsSensitiveInfo].object_name, [dbo].[BObjectsSensitiveInfo].path
FROM [dbo].[BObjects]
INNER JOIN [dbo].[BObjectsSensitiveInfo] ON [dbo].[BObjectsSensitiveInfo].bObject_id=[dbo].[BObjects].id  
WHERE object_name = '<vmname>'

Verify:

So we can now see that the VM in VMware has MoRef “vm-71326”. But Veeam database has “vm-992”. From here on you know what’s wrong and you need to open a Veeam support case to get the supported procedure.

If you don’t care about supported procedures you can update the database with VMware VM new MoRef ID and your VBR job should be running again. The SQL query would look like this:

UPDATE [dbo].[BOobjects]
SET [object_id] = 'new-id'
WHERE [object_id] = 'old-id'

Conclusion

It’s not that had to change the MoRef in the VBR database. But remember, if you care about having a supported installation. Then you need to create a Veeam support case and have them help you. Something could have changed in the VBR database schema since this post.

Massdelete files

Not sure what happened, but thousands of small files piled up in the path below.

C:\ProgramData\Microsoft\Crypto\RSA\S-1-5-8

My google-fu wasn’t sufficient enough to find out what files in that folder actually did. An article from Kofax said that just deleting files could be troublesome. But also found another article that stated that he deleted all files older than 30 days and hasn’t had problems yet. So I dare to do the same. The command was found on a blog, so all credit goes to this guy.

forfiles /D -30 /C "cmd /C attrib -s @file & echo @file & del @file"

forfiles is a very nice command that iterates through the files in a folder according to its parameters. /D -30 iterates through all files more than 10 days old. attrib -s takes off the System attribute, which is needed for DEL to work. The echo is there so you can see that it is doing its job.

The PowerShell equivalent would be:

Get-ChildItem -Path C:\programdata\Microsoft\Crypto\rsa\S-1-5-18 -Include *.* -Recurse | foreach { $_.Delete()}