Prevent Drive Failure at 32,768 Hours

This is one of the nasty bugs. Some SSD models will fail after they have been powered on for more than 32768 hours. Imagine running vSAN and you bought x amount of disks that where affected. They will all fail at the same time, so you are left alone with your backup(hopefully).

I seen this one time before, where Intel disks where the problems. Unfortunately the Intel SSDs where metadata disks in a Ceph storage cluster, and since they all failed at the same time, the cluster died!

This is of cause due to that nobody where informed of the bug. When buying hardware from HPE and other enterprise hardware vendors we cat a mail letting us know of the problem before it becomes a disaster.

https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us

Update procedure – VMware

We had to do a firmware update of the disks, we are running VMware and vSAN. And gladly HPE have allready released the patch. Also with guidance for VMware.

https://support.hpe.com/hpsc/swd/public/detail?swItemId=MTX_6089c15599b647aca0c049ce24#tab2

  1. download the patch, copy it to /tmp of the ESXi servers.
  2. Unzip and make the .vmexe file executable chmod +x CP****.vmexe
  3. Put one of the hosts into maintenance mode.
  4. Run the CP***.vmexe – ./CP****.vmexe. It will lists the disks that it found and you tell it the disk numbers for those you want to have firmware upgraded.
  5. After upgrade I did a reboot anyway.

Remember that reboot of vSAN nodes can take a long time, 10-30 min. On the console of the server it says: “vSAN initialising SSD XXX” Give it time, it will boot.

Fetching firmware version

When you use the HPE custom VMware image then we have all the HPE tools on the server, so that we can query hardware etc.

  1. cd /opt/smartstorageadmin/ssacli/bin
  2. Execute ./ssacli ctrl slot=0 pd all show detail

For more a command cheat sheet you could look at https://wiki.phoenixlzx.com/page/ssacli/ or the official documentation https://support.hpe.com/hpsc/doc/public/display?docId=c03909334

This will give you all info on the disks behind the controller. The model number is the one that you can look up on HPEs site to see if its affected.