October 2022 – ramsgaard.me

When you install an HPE server with the VMware custom image for HPE servers you automatically get all the HPE tools for configuring the hardware. Neat.

Here is a small guide on how to clean and setup the array. The ssacli is located in /opt/smartstorageadmin/ssacli/bin on the ESXi serve – start here and follow the next commands

Show the existing config:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 ld all show
Smart Array P440ar in Slot 0 (Embedded)
   Array A
      logicaldrive 1 (279.37 GB, RAID 1, Failed)
   Array B
      logicaldrive 2 (3.27 TB, RAID 1+0, OK)

Delete the existing logical drives:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 ld 1 delete
Warning: Deleting an array can cause other array letters to become renamed.
         E.g. Deleting array A from arrays A,B,C will result in two remaining
         arrays A,B ... not B,C 
Warning: Deleting the specified device(s) will result in data being lost.
         Continue? (y/n)  y
[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 ld 2 delete
Warning: Deleting the specified device(s) will result in data being lost.
         Continue? (y/n)  y
[root@esx2:/opt/smartstorageadmin/ssacli/bin]

Show all physical drives in the server:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 pd all show
Smart Array P440ar in Slot 0 (Embedded)
   Array A
      physicaldrive 1I:1:9 (port 1I:box 1:bay 9, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:10 (port 1I:box 1:bay 10, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:11 (port 1I:box 1:bay 11, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:12 (port 1I:box 1:bay 12, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:13 (port 1I:box 1:bay 13, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:14 (port 1I:box 1:bay 14, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:15 (port 1I:box 1:bay 15, SAS HDD, 900 GB, OK)
      physicaldrive 1I:1:16 (port 1I:box 1:bay 16, SAS HDD, 900 GB, OK)

Create a new volume with available physical drives:

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli ctrl slot=0 create type=ld drives=1I:1:9,1I:1:10,1I:1:11,1I:1:12,1I:1:13,1I:1:14,1I:1:15,1I:1:16 raid=6

Warning: Controller cache is disabled. Enabling logical drive cache will not take effect until this has been resolved.
[root@esx2:/opt/smartstorageadmin/ssacli/bin]

Conclusion

Nice and easy – Give the HPE ssacli manual a read for more commands, starting page 57 and forward or use ssacli help.

[root@esx2:/opt/smartstorageadmin/ssacli/bin] ./ssacli help

CLI Syntax
   A typical SSACLI command line consists of three parts: a target device, 
   a command, and a parameter with values if necessary. Using angle brackets to
   denote a required variable and plain brackets to denote an optional 
   variable, the structure of a typical SSACLI command line is as follows:

      <target> <command> [parameter=value]

   <target> is of format:
      [controller all|slot=#|serialnumber=#]
      [array all|<id>]
...........

After upgrading to CSE 3.1.3 with VCD 10.3.1 I encountered a problem when creating clusters from the Ubuntu 20.04 native cluster template.

Basically, the mstr node would be deployed and started, VMTools will become ready and the first script injection would happen. Then all of a sudden the VM would reboot and the cluster creation will fail because it can’t see the process anymore. This will sometimes leave a cluster in the “Creation in progress” status but somehow it can not be managed anymore.

22-06-02 10:42:34 | cluster_service_2_x:2811 - _wait_for_tools_ready_callback | DEBUG :: waiting for guest tools, status: vm='vim.VirtualMachine:vm-835608', status=guestToolsNotRunning
22-06-02 10:42:39 | cluster_service_2_x:2811 - _wait_for_tools_ready_callback | DEBUG :: waiting for guest tools, status: vm='vim.VirtualMachine:vm-835608', status=guestToolsRunning
22-06-02 10:42:41 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: waiting for process 1706 on vm 'vim.VirtualMachine:vm-835608' to finish (1)
22-06-02 10:42:46 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: process [0, <Response [200]>, <Response [200]>] on vm 'vim.VirtualMachine:vm-835608' finished, exit code: 0
22-06-02 10:42:46 | cluster_service_2_x:2869 - _execute_script_in_nodes | DEBUG :: about to execute script on mstr-7e34 (vm='vim.VirtualMachine:vm-835608'), wait=True
22-06-02 10:42:48 | cluster_service_2_x:2817 - _wait_for_guest_execution_callback | DEBUG :: waiting for process 1729 on vm 'vim.VirtualMachine:vm-835608' to finish (1)
22-06-02 10:42:58 | cluster_service_2_x:2896 - _execute_script_in_nodes | ERROR :: Error executing script in node mstr-7e34: process not found (pid=1729) (vm='vim.VirtualMachine:vm-835608')
Traceback (most recent call last):
  File "/opt/vmware/cse/python/lib/python3.7/site-packages/container_service_extension/rde/backend/cluster_service_2_x.py", line 2879, in _execute_script_in_nodes
    callback=_wait_for_guest_execution_callback)

I created an SR request with Cloud Director GSS for both the failed deployment and for the stuck clusters that now couldn’t be deleted. Multiple screen sharing sessions later and no result.

Then I found the GitHub for Container Service Extension, the issue page had a very tempting title Failed deployments using TKGm on VCD. Many seem to have the same problem, no fix on the deployments but it seems that one guy had the fix for deletion of the stuck clusters.

The workaround

You need to find the ID of the user that owns the cluster. You can in the More>Kubernetes Clusters menu in VCD see who the owner is.

When you have the owner you can go into Administration > User > <User>. Then then the URL with contain the ID of the user.

vcd.ramsgaard.me/tenant/tenant1/administration/access-control/users/v9993018-ebf5-4ded-8134-27ddcc4ccbf0/general

With the userId you can fill out the body for the next API call.

$vdchost = "vcd.ramsgaard.me"
$apiusername = "svc-cse@system"
$password = 'Ye.........iks12!'

$base64AuthInfo = [Convert]::ToBase64String([Text.Encoding]::ASCII.GetBytes(("{0}:{1}" -f $apiusername,$password)))
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.SecurityProtocolType]::Tls12
$auth =Invoke-WebRequest -Uri "https://$vdchost/api/sessions" -Headers @{Accept = "application/*;version=32.0";Authorization="Basic $base64AuthInfo"} -Method Post

$accessBody = '{
    "grantType": "MembershipAccessControlGrant",
    "accessLevelId": "urn:vcloud:accessLevel:FullControl",
    "memberId": "urn:vcloud:user:e96cf9e8-535f-45d8-8a87-b9dac659f85f"
  }' | ConvertFrom-Json

$status = Invoke-RestMethod -Uri "https://$vdchost/cloudapi/1.0.0/entities/urn:vcloud:type:cse:nativeCluster:2.1.0/accessControls" -Headers @{Accept = "application/json;version=36.1";Authorization="Bearer $($auth.Headers.'X-VMWARE-VCLOUD-ACCESS-TOKEN')"} -ContentType "application/json" -Method post -Body ($accessBody | ConvertTo-Json)

When the API call is done you should now be able to delete the stuck cluster.

If you should be so unfortunate that the cluster is stuck in a “not resolved” state and the deletion through VCD GUI still fails you need to use the vcd cse cli.

### Login to VCD system or tenant organistaion
vcd login vcd.ramsgaard.me system jr
### Show clusters
vcd cse cluster list
### Force delete the cluster
vcd cse cluster delete tanzu1 --force

Conclusion:

The problem occurred in the first place due to a bug in VCD 10.3.1, the MQTT bus had some bug and therefore the cluster creation failed. 10.3.2 or 10.3.3 fixed the bug. (Off cause the VMware Tanzy Grid version should be used in the future)

It took some time to find the workaround, I hope the future of CSE will be more fault tolerant so these situations would not appear.

Until then there is a way to get out of the stuck cluster situation.

Month: October 2022

ssacli – CLI configuring SmartArray