Ceph Outage on /cwork

As noted in the earlier section describing the process to add new hardware to the cluster, we described the process that we used to expand the number of metadata services supporting the /cwork cluster. As part of that process we enabled directory fragmentation in order to better support large numbers of files in a directory. Unfortunately, we did so on the root file system for /cwork as we had not created a subvolume for that mount point. We should not have been able to enable directory fragmentation on the root file system for /cwork, but due to a bug in Ceph we were able to add it. This has led to a catastrophic metadata pool corruption of the /cwork volume resulting from fragmenting the root directory fragmentation on /cwork. The /cwork directory will be unavailable until a full rebuild of all of the metadata is completed – which could last for more than a week. The Ceph development team will be putting out a patch to prevent this issue and we have informed our users of the issue. Luckily this only affects the /cwork shared volume – the volume designed to store long term researcher data was unaffected as it is a separate storage volume.

Work is underway with many steps completed – an update will be provided when services are fully restored

Ceph Cluster Expansion

OSD Rebalancing

After receiving a batch of 8 new OSD hosts, we brought them online and configured them and prepared to add them to our cluster. We began by adding a single host with storage for 12 OSDs. However, after adding the host, we noticed our cluster became sluggish, and many OSD daemons were reporting “slow ops.”

It would seem that as soon as the first of the twelve new OSDs came online (and since we have our failure domain defined at the “host” level) the algorithm then decided that the single OSD on that host must hold all PGs designated for that host to handle host level failure domain (a 12x oversubscription for that single OSD). This then caused too many PGs to get mapped to that OSD, and the system became very sluggish. This was evident by the new OSD showing over 600 PGs allocated to it (healthy OSDs in our system have around 180 PGs).

To recover, we had to fail the newly added OSD and allow the system to rebalance. Due to the inadvertent oversubscription on the new OSD, this rebalancing process is still ongoing several weeks later (we estimate it to take 3-4 weeks in total!).

To prevent this: When adding a new host, be first to set the “norebalance” flag (ceph osd set norebalance) BEFORE adding new hosts. Once new host(s) are added, and the OSDs defined, the “norebalance” flag can be unset, thereby triggering the expansion and rebalance onto the new OSDs. Once this rebalance completes, we will be testing this by adding the remainder of the new hosts pending.

Ceph Monitors will eat your SSD

The Ceph Monitor daemon maintains the primary copy of the cluster map, a critical and dynamic piece of the Ceph architecture. This data is stored locally to the host running the Monitor daemon in a RocksDB database within /var/lib/ceph/, and sustains very frequent writes via fsync calls. As such, our initial configuration had /var/lib/ceph/ sharing the root filesystem of a standard commodity SSD. The write endurance of the backing hardware was quickly (within months) exhausted to the point of needing replacement.

To better accommodate this write-heavy workload, we purchased additional systems with additional more endurant SSDs, created a dedicated filesystem on them and mounted at /var/lib/ceph so that the RocksDB is stored there. Once these new hosts were added to the Ceph cluster, we labeled these new hosts and updated the Ceph Orchestrator deployment strategy for the Monitor daemons to only place them on these new hosts. Our hope is that this more endurant hardware will last substantially longer, improving performance and reliability.