NetApp – ServiceProcessor stuck updating

After update to OnTap 9.7 the service processors where stuck in “updating”. They never came up again, not even after rebooting it.

Procedure:

  • Disable auto update
  • Reboot one SP, wait for it to show online.
  • Run the update parameter manuel
  • If its online and updated then enable auto update again.
Here we see that the ctrl02 is online again, but with wrong firmware.
After Manuel update we now have the correct firmware and they are online. Ready to enable autoupdate

https://kb.netapp.com/app/answers/answer_view/a_id/1028746/~/service-processor-firmware-update-fails-

Install and use MegaCLI on VMware host

Over the last decade, I had the fun of how having to manage an LSI based RAID controller. Never on Windows machines, where the GUI based Storage Manager tools are simple to work with.

Even though I usually find the vib and get it installed I always struggle to remember how it’s installed and what the commands are. This time I will write it down for future me, or you?

Procedure

  • Find the MegaCLI vib file and download it…
  • Copy vib to ESXi host
  • Install vib
  • use MegaCLI to whatever purpose you got

Finding the vib

This is where I struggle the most. LSI was bought by Avago and soon after Avago was bought by Broadcom. So the support links for the downloads have been 404 and using Broadcom’s support site is an education degree that I do not own. This time the link was this, giving you a zip file containing the MegaCLI package for all platforms.

If the link does not work for next time, or maybe a newer version is out. I also managed to find it on https://www.broadcom.com/support/download-search. Make a keyword search for megacli, expand the “management software and tools” from the results and choose the newest “MegaCLI x.x Px” For now its MegaCLI 5.5 P2 version 8.07.14.

Install MegaCLI

We now got the zip, extract it and under the “VmwareMN” folder there is the vib that we are gonna be needing.

### SCP it to the host
jr@mbp:~ jr$ scp /Users/jr/Download/8-07-07_MegaCLI/VmwareMN/vmware-esx-MegaCLI-8-07-07.vib root@[ESXHOST]:/tmp/

### SSH to the ESXi host and install. Reboot afterwards
[root@esxhost:~] esxcli software vib install -v /tmp/vmware-esx-MegaCLI-8-07-07.vib

If you are lucky and get a “Could not find a trusted signer” when trying to install the vib the workaround is to add “–no-sig-check” at the end of the esxcli command, after the file path. Since I downloaded it from Broadcom’s own site, I trust it.

After the host reboot(which is very annoying, but necessary). We can not find MegaCLI binary under /opt/lsi/MegaCLI/

Useful MegaCLI commands

### Enclosure information
 ./opt/lsi/MegaCLI/MegaCli -EncInfo -aALL

### Virtual drive information

/opt/lsi/MegaCLI/MegaCli -LDInfo -Lall -aALL

### Physical drive information
/opt/lsi/MegaCLI/MegaCli -PDList -aALL

### Silence active alarm
/opt/lsi/MegaCLI/MegaCli -AdpSetProp AlarmSilence -aALL

### Disable alarm
/opt/lsi/MegaCLI/MegaCli -AdpSetProp AlarmDsbl -aALL

### Enable alarm
/opt/lsi/MegaCLI/MegaCli -AdpSetProp AlarmEnbl -aALL

### Prepare for removal
/opt/lsi/MegaCLI/MegaCli -PdPrpRmv -PhysDrv [E:S] -aN

### Unconfigured Bad to good
/opt/lsi/MegaCLI/MegaCli -PDMakeGood -PhysDrv[E:S] -aN

I could a guy how did some bit more advanced MegeCLI scripting, its bit old but still very useful. You can find the site here. I have done some copy-pasting from the script, but all credit goes to the guy behind the link.

### List disk status
/opt/lsi/MegaCLI/MegaCli -PDlist -aALL -NoLog | egrep 'Slot|state' | awk '/Slot/{if (x)print x;x="";}{x=(!x)?$0:x" -"$0;}END{print x;}' | sed 's/Firmware state
://g'

Conclusion

CLI is awesome, so many possibilities and so flexible. In my opinion its a bit hard to find, but after you got it installed its easy. I have tested this on ESXi6.7 and it world as it should. I hope you can use some of it.

Ceph MDS stuck in ‘rejoin’

CephFS filesystem suddenly dies, what do you do? Well, It’s relaying on the MDS(MetaDataService) to keep an online filesystem. When looking at the Ceph status it gives us that the MDS cache is oversized and files system is degraded. This is only health warning, but the filesystem is not available due to it, that’s good in a way because then there is nothing wrong with the Ceph cluster itself. Looking more into the problem, MDS seems to be in a recovery limbo state. hmm…

MDS services are in “rejoin” limbo state. Never coming back up.

Looking at the MDS status documentation the rejoin state indicated that it’s trying to load the old cache back in before going into “up” state. https://docs.ceph.com/docs/master/cephfs/mds-states/.

Closer looking at the logs the process is stating over and over again, but never finishes. This is due to a mechanism from the monitors kicking in and restarting the MDS service when not responding in to cluster with up in a timely fashion. So it never gets done with what’s its doing. Hmm, and CephFS is still unavailable.

Trying to set mds_beacon_grace to a wicked number did also not help, don’t know if the grace should be doing anything. But going from the default 15 to 1500 did not help. I was hoping to give the MDS time to load in the old cache.

From the logs its respawning the MDS due to lost contact to the cluster.

Going through an endless number of Ceph thread I were reading that others have encountered this exact problem. This thread gave me the info on how to remediate and get back online with the cluster. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028981.html

Setting the MDS to wipe all client session where the first one. I set this though the Ceph GUI since I find it easier to find compared to the CLI command.

Default its “false”, but setting it to global “true” no client connections are made.

The second one was to delete the MDS “mds*_openfiles.0” from the CephFS metadata pool. Looking into the pool I could the there where many objects referring to open files. But the post stated only to delete the .0 objects. Need to be done for all the MDS services that you have running. The “openfiles” objects are open file hints. It’s safe to delete them. Read more on rados commands on https://docs.ceph.com/docs/giant/man/8/rados/

### Delete for all the MDS you have running. mds1,mds2, mds3 etc.
[root@dspp-mon-a-01 cephadm]# rados -p cephfs_metadata rm mds0_openfiles.0

After deleting the open file objects I stopped all MDS services on all nodes. Some of them did not stop, so I killed the process. Probably should have stopped them first before deleting the open file objects…..

[root@dspp-osd-a-06 cephadm]# systemctl stop ceph-mds.target

After starting up the MDS services again it recovered in a couple of seconds. CephFS is available and “ceph -s” showing healthy condition. Set the wipe_sessions back to false and now CephFS could be mounted again.

What to conclude? there is a fix in 14.2.5. So when it’s ready its is time to update the Ceph cluster. The ticket for it should be this one. https://tracker.ceph.com/issues/41467. These guys also did a good help to resolve the problem https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/.