Ceph MDS stuck in ‘rejoin’

CephFS filesystem suddenly dies, what do you do? Well, It’s relaying on the MDS(MetaDataService) to keep an online filesystem. When looking at the Ceph status it gives us that the MDS cache is oversized and files system is degraded. This is only health warning, but the filesystem is not available due to it, that’s good in a way because then there is nothing wrong with the Ceph cluster itself. Looking more into the problem, MDS seems to be in a recovery limbo state. hmm…

MDS services are in “rejoin” limbo state. Never coming back up.

Looking at the MDS status documentation the rejoin state indicated that it’s trying to load the old cache back in before going into “up” state. https://docs.ceph.com/docs/master/cephfs/mds-states/.

Closer looking at the logs the process is stating over and over again, but never finishes. This is due to a mechanism from the monitors kicking in and restarting the MDS service when not responding in to cluster with up in a timely fashion. So it never gets done with what’s its doing. Hmm, and CephFS is still unavailable.

Trying to set mds_beacon_grace to a wicked number did also not help, don’t know if the grace should be doing anything. But going from the default 15 to 1500 did not help. I was hoping to give the MDS time to load in the old cache.

From the logs its respawning the MDS due to lost contact to the cluster.

Going through an endless number of Ceph thread I were reading that others have encountered this exact problem. This thread gave me the info on how to remediate and get back online with the cluster. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028981.html

Setting the MDS to wipe all client session where the first one. I set this though the Ceph GUI since I find it easier to find compared to the CLI command.

Default its “false”, but setting it to global “true” no client connections are made.

The second one was to delete the MDS “mds*_openfiles.0” from the CephFS metadata pool. Looking into the pool I could the there where many objects referring to open files. But the post stated only to delete the .0 objects. Need to be done for all the MDS services that you have running. The “openfiles” objects are open file hints. It’s safe to delete them. Read more on rados commands on https://docs.ceph.com/docs/giant/man/8/rados/

### Delete for all the MDS you have running. mds1,mds2, mds3 etc.
[root@dspp-mon-a-01 cephadm]# rados -p cephfs_metadata rm mds0_openfiles.0

After deleting the open file objects I stopped all MDS services on all nodes. Some of them did not stop, so I killed the process. Probably should have stopped them first before deleting the open file objects…..

[root@dspp-osd-a-06 cephadm]# systemctl stop ceph-mds.target

After starting up the MDS services again it recovered in a couple of seconds. CephFS is available and “ceph -s” showing healthy condition. Set the wipe_sessions back to false and now CephFS could be mounted again.

What to conclude? there is a fix in 14.2.5. So when it’s ready its is time to update the Ceph cluster. The ticket for it should be this one. https://tracker.ceph.com/issues/41467. These guys also did a good help to resolve the problem https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/AOYWQSONTFROPB4DXVYADWW7V25C3G6Z/.

Ceph – slow recovery speed

Onsite at customer they had a 36bays OSD node down in there 500TB cluster build with 4TB HDDs. When it came back online the Ceph cluster started to recover from it and rebalance the cluster.

Problem was, it was dead slow. 78Mb/s is not much when you have a 500TB Cluster. So what to do?

There a several settings within Ceph you can adjust. Here are the two settings that worked for me.

osd max backfills:
Description: The maximum number of backfills allowed to or from a single OSD.
Default value: 1

I set it to 8, and the recovery went to 350Mb/s. Set it to 16 and recovery was 700Mb/s, but clients where also affected. So 8 was a more moderat setting.

osd recovery max active

Description: The number of active recovery requests per OSD at one time. More requests will accelerate recovery, but the requests places an increased load on the cluster.
Default value: 3

Set it up a notch to 4.