Brocade – FiberChannel zoning with CLI

When having todo zoning in Brocade world you still have to use an old java GUI client. Not always easy to open the GUI since not all Brocade switches have been updated and therefore all kinds of java security will prevent you from doing your job. With CLI we can skip the GUI and get to business instant.

CLI config does also have the great advantage to be sort of self-documentation 🙂 Here we will look a bit into doing what I would call basic zoning. A new device has been added to your fiber channel fabric and you need to create a zone for the new WWN and the storage system.

Adding ssh key

This is, in my opinion, a good thing to set up, makes it way easier for you to log in the next time. Off-cause is not optional, but a great advantage.

The public ssh key is gathered from a system that allows SSH-based logins. See below for info on how to set it up.

SAN-SW-03:admin> sshutil importpubkey
Enter user name for whom key is imported:admin
Enter IP address:10.1.100.20
Enter remote directory:/home/username
Enter public key name(must have .pub suffix):username.pub
Enter login name:username
username@10.1.100.20's password: 
public key is imported successfully.

Create alias

Below you first see the syntax, and afterward the actual command used in production.

# Syntax
alicreate “ALIAS_NAME”, “WWPN”

# Production command
alicreate "dc2esxmgmt1_11", "50:01:43:80:31:80:50:98"

Create zone

# Syntax
zonecreate "ZONE_NAME", "WWPN_alias_1;WWPN_alias_2"

# Production command
zonecreate "dc2storv5030_dc2esxmgmt1_11", "dc2esxmgmt1_11;dc2storv5030_01_01_p1v;dc2storv5030_01_01_p2v;dc2storv5030_01_02_p1v;dc2storv5030_01_02_p2v"

Add zone to config

cfgadd “fabric1”,”dc2storv5030_dc2esxmgmt1_11"

SAN-SW-03:admin> cfgsave
WARNING!!!
The changes you are attempting to save will render the
Effective configuration and the Defined configuration
inconsistent. The inconsistency will result in different
Effective Zoning configurations for switches in the fabric if
a zone merge or HA failover happens. To avoid inconsistency
it is recommended to commit the configurations using the
'cfgenable' command.

Do you want to proceed with saving the Defined
zoning configuration only?  (yes, y, no, n): [no] yes
Updating flash ...

Enable new config

SAN-SW-03:admin> cfgenable fabric1 
You are about to enable a new zoning configuration.
This action will replace the old zoning configuration with the
current configuration selected. If the update includes changes 
to one or more traffic isolation zones, the update may result in  
localized disruption to traffic on ports associated with
the traffic isolation zone changes.
Do you want to enable 'fabric1' configuration  (yes, y, no, n): [no] yes
zone config "fabric1" is in effect
Updating flash ...

Conclusion

It’s easy to do basic fiber channel zoning. The only thing that I miss from the GUI, and haven’t found yet, is a view to see all discovered WWN and then choose the WWN to the new alias you create. The server management system, in this case, shows the WWN name of the FC adapter and it’s easy to copy-paste into CLI commands.

So if you don’t have HPE OneView or some other fancy FC provisioning tool then you know how a basic and lowkey way of doing it.

And again, did I mention that when using CLI it basically document itself? 😉

NetApp – ServiceProcessor stuck updating

After update to OnTap 9.7 the service processors where stuck in “updating”. They never came up again, not even after rebooting it.

Procedure:

  • Disable auto update
  • Reboot one SP, wait for it to show online.
  • Run the update parameter manuel
  • If its online and updated then enable auto update again.
### Disable autoupdate
system service-processor image modify -node <nodename> -autoupdate false

### Reboot the service processor
system service-processor reboot-sp -node <nodename>

### Initiate update
system service-processor image update -node <nodename>

### Verify version and SP status
system service-processor show

### Enable autoupdate
system service-processor image modify -node <nodename> -autoupdate true
Here we see that the ctrl02 is online again, but with wrong firmware.
After Manuel update we now have the correct firmware and they are online. Ready to enable autoupdate

https://kb.netapp.com/app/answers/answer_view/a_id/1028746/~/service-processor-firmware-update-fails-

Install and use MegaCLI on VMware host

Over the last decade, I had the fun of how having to manage an LSI based RAID controller. Never on Windows machines, where the GUI-based Storage Manager tools are simple to work with.

Even though I usually find the vib and get it installed I always struggle to remember how it’s installed and what the commands are. This time I will write it down for the future me, or you?

Procedure

  • Find the MegaCLI vib file and download it…
  • Copy vib to ESXi host
  • Install vib
  • Use MegaCLI for whatever purpose you got

Finding the vib

This is where I struggle the most. LSI was bought by Avago and soon after Avago was bought by Broadcom. So the support links for the downloads have been 404 and using Broadcom’s support site is an education degree that I do not own. This time the link was this, giving you a zip file containing the MegaCLI package for all platforms.

If the link does not work for next time, or maybe a newer version is out. I also managed to find it on https://www.broadcom.com/support/download-search. Make a keyword search for MegaCLI, expand the “management software and tools” from the results and choose the newest “MegaCLI x.x Px” For now it’s MegaCLI 5.5 P1 version 8.07.07.

Install MegaCLI

We now got the zip, extract it and under the “VmwareMN” folder there is the vib that we are gonna be needing.

### SCP it to the host
jr@mbp:~ jr$ scp /Users/jr/Download/8-07-07_MegaCLI/VmwareMN/vmware-esx-MegaCLI-8-07-07.vib root@[ESXHOST]:/tmp/

### SSH to the ESXi host and install. Reboot afterwards
[root@esxhost:~] esxcli software vib install -v /tmp/vmware-esx-MegaCLI-8-07-07.vib

If you are lucky and get a “Could not find a trusted signer” when trying to install the vib the workaround is to add “–no-sig-check” at the end of the esxcli command, after the file path. Since I downloaded it from Broadcom’s own site, I trust it.

After the host reboot(which is very annoying, but necessary). We can not find MegaCLI binary under /opt/lsi/MegaCLI/

Useful MegaCLI commands

### Enclosure information
 /opt/lsi/MegaCLI/MegaCli -EncInfo -aALL

### Virtual drive information
/opt/lsi/MegaCLI/MegaCli -LDInfo -Lall -aALL

### Physical drive information
/opt/lsi/MegaCLI/MegaCli -PDList -aALL

### Silence active alarm
/opt/lsi/MegaCLI/MegaCli -AdpSetProp AlarmSilence -aALL

### Disable alarm
/opt/lsi/MegaCLI/MegaCli -AdpSetProp AlarmDsbl -aALL

### Enable alarm
/opt/lsi/MegaCLI/MegaCli -AdpSetProp AlarmEnbl -aALL

### Prepare for removal
/opt/lsi/MegaCLI/MegaCli -PdPrpRmv -PhysDrv [E:S] -aN

### Unconfigured Bad to good
/opt/lsi/MegaCLI/MegaCli -PDMakeGood -PhysDrv[E:S] -aN

I found a guy that did a bit more advanced MegaCLI scripting, its bit old but still very useful. You can find the site here. I have done some copy-pasting from the script, but all credit goes to the guy behind the link.

### List disk status
/opt/lsi/MegaCLI/MegaCli -PDlist -aALL -NoLog | egrep 'Slot|state' | awk '/Slot/{if (x)print x;x="";}{x=(!x)?$0:x" -"$0;}END{print x;}' | sed 's/Firmware state
://g'

Conclusion

CLI is awesome, so many possibilities and so flexible. In my opinion its a bit hard to find, but after you got it installed its easy. I have tested this on ESXi6.7 and it world as it should. I hope you can use some of it.

Ceph MDS stuck in ‘rejoin’

CephFS filesystem suddenly dies, what do you do? Well, It’s relaying on the MDS(MetaDataService) to keep an online filesystem. When looking at the Ceph status it gives us that the MDS cache is oversized and files system is degraded. This is only health warning, but the filesystem is not available due to it, that’s good in a way because then there is nothing wrong with the Ceph cluster itself. Looking more into the problem, MDS seems to be in a recovery limbo state. hmm…

MDS services are in “rejoin” limbo state. Never coming back up.

Looking at the MDS status documentation the rejoin state indicated that it’s trying to load the old cache back in before going into “up” state. https://docs.ceph.com/docs/master/cephfs/mds-states/.

Closer looking at the logs the process is stating over and over again, but never finishes. This is due to a mechanism from the monitors kicking in and restarting the MDS service when not responding in to cluster with up in a timely fashion. So it never gets done with what’s its doing. Hmm, and CephFS is still unavailable.

Trying to set mds_beacon_grace to a wicked number did also not help, don’t know if the grace should be doing anything. But going from the default 15 to 1500 did not help. I was hoping to give the MDS time to load in the old cache.

From the logs its respawning the MDS due to lost contact to the cluster.

Going through an endless number of Ceph thread I were reading that others have encountered this exact problem. This thread gave me the info on how to remediate and get back online with the cluster. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028981.html

Setting the MDS to wipe all client session where the first one. I set this though the Ceph GUI since I find it easier to find compared to the CLI command.

Default its “false”, but setting it to global “true” no client connections are made.

The second one was to delete the MDS “mds*_openfiles.0” from the CephFS metadata pool. Looking into the pool I could the there where many objects referring to open files. But the post stated only to delete the .0 objects. Need to be done for all the MDS services that you have running. The “openfiles” objects are open file hints. It’s safe to delete them. Read more on rados commands on https://docs.ceph.com/docs/giant/man/8/rados/

### Delete for all the MDS you have running. mds1,mds2, mds3 etc.
[root@dspp-mon-a-01 cephadm]# rados -p cephfs_metadata rm mds0_openfiles.0

After deleting the open file objects I stopped all MDS services on all nodes. Some of them did not stop, so I killed the process. Probably should have stopped them first before deleting the open file objects…..

[root@dspp-osd-a-06 cephadm]# systemctl stop ceph-mds.target

After starting up the MDS services again it recovered in a couple of seconds. CephFS is available and “ceph -s” showing healthy condition. Set the wipe_sessions back to false and now CephFS could be mounted again.

What to conclude? there is a fix in 14.2.5. So when it’s ready its is time to update the Ceph cluster. The ticket for it should be this one. https://tracker.ceph.com/issues/41467. These guys also did a good help to resolve the problem https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/.