March | 2025 | FAST Research Storage

Outage

On Jul 31, 2024, multiple MDS daemons were brought online to serve the /cwork filesystem in an attempt to improve performance and possibly alleviate MDS deadlocks. Due to this filesystem being originally created without a subvolume, we could not enable the desired “distributed ephemeral pinning” balancing method (which dynamically pins immediate subdirectories based on a consistent hash of the subdirectories inode number across the pool of active MDSs). Absent a specified balancing strategy, it’s unclear how the MDSs were sharing load at this stage.

On Aug 6, 2024, with support and direction from 42on, a stopgap method was attempted to make use of multiple MDSs by fragmenting the root inode (/cwork/) to cause its subdirectories (user directories i.e. /cwork/<user>/) to distribute across the MDS daemons.

What was not known at the time is that fragmentation of the root inode is not supported in CephFS, and causes metadata corruption when an MDS daemon is initialized. This happened (likely as part of a daemon failover) on Aug 14, 2024, and caused the filesystem to fail. This was later replicated by 42on in a lab environment and confirmed by a Ceph developer, and therefore is likely to be the definitive root cause of this corruption and outage.

Recovery

For nearly 2 weeks that followed, a recovery procedure was followed, with some individual steps in the process taking up to 43 hours to complete. This recovery process essentially created a new filesystem (“cwork2”) re-using the previous data pool, and reconstructing a new metadata pool.

On Aug 25, 2024, the recovery was complete and the filesystem mounted for internal validation. At this point we learned that we had lost all empty directories, user/group ownership, permissions and ACLs, and directory timestamps. File timestamps appeared intact, and file contents are presumed to be intact.

We then set basic file ownership across the filesystem to allow users access to their files again, and mounted the filesystem read-only on Globus for users to retrieve their data at /cwork_orig/.

A new filesystem (internally called “cwork3”) was created and configured with a subvolume, multi-MDS, and distributed ephemeral pinning enabled. Subdirectories were created for all users and mounted at /cwork across the DCC, Globus, and Bartesaghi cluster.

Current Status – Aug 29, 2024

The new-and-improved /cwork is available to users.
Users are able to retrieve their previous data and transfer it using Globus, to be disconnected on Nov 1 and destroyed thereafter.
Upon request, we can perform data migration on behalf of users. Note that this is merely a convenience to the user, there is no technical benefit or improvement in speed.
6 new OSD hosts were recently added as well, and the system is currently rebalancing (3.7PiB raw, 25% used). 1 new OSD host still remains to be added.

Goal

Our objective was to safely remove an active Ceph node with 12 active Object Storage Daemons (OSDs) from our cluster safely and without degradation of data redundancy and minimal impact to performance of other concurrent workloads. This is a documented procedure which was expected to take on the order of 1-2 weeks in order to migrate active data off of the target OSDs.

Process

We issued the ceph orch host rm command to remove a single host with 12 active OSDs, and the removal process began by draining 10 of the OSDs on the host. The cluster began reporting over 5 GiB/sec recovery speed, and a number of “misplaced” Placement Groups (PGs). This number of misplaced PGs began decreasing, but after several days the recovery speed began slowing and nearing zero, and the number of misplaced PGs stopped decreasing. Upon further investigation, it appeared that the data migration process had become stuck, and required setting each affected stuck OSD to the “down” state (but not “out”), which then caused the system to quickly bring the OSD back “up” and this seemed to restart the data migration process again. This was done using ceph pg dump |grep recovering to identify affected OSDs, then issuing ceph osd down (or restarting the OSD daemon) to get the migration process restarted. This manual procedure continued, akin to “whack-a-mole” to shepherd the migration process along as each OSD would continually get stuck.

We also made some configuration changes, starting with max_osd_draining_count to enable more efficient draining of all 12 OSDs on the host to be removed simultaneously:
ceph config set mgr mgr/cephadm/max_osd_draining_count 12

This saw the system begin to drain all 12 OSDs, however the OSDs still continued to get stuck during the drain process.

Additional configuration changes were made to hopefully improve the drain process:
ceph config set osd osd_mclock_override_recovery_settings true ceph config set osd osd_max_backfills 2 ceph config set osd osd_mclock_profile balanced

Eventually, 8 of the 12 OSDs were fully drained, with the last 4 remaining stuck regardless of them being “whack-a-mole’d” by being set to “down” or restarting their associated OSD daemons. This was after 6 weeks of draining since the initial host removal command was issued. At this point, we opted to forcibly remove the host and its 12 OSDs, causing a partially degraded state for the remainder of the PGs stored on the final 4 draining OSDs. This was done with ceph orch host rm –force –offline. The cluster then entered a partially degraded state and rebuilt for less than 1 day, at which point the cluster returned to a healthy status and the node was fully removed.

Lessons Learned

Before removing OSDs, it is recommended to

set mgr/cephadm/max_osd_draining_count

to the maximum number of OSDs that reside on a given host that will be removed. This value defaults to 10 in the Reef release of Ceph, and needed to be increased to 12 in our case as each of our storage nodes has 12 OSDs.

Additional tuning of osd_max_backfills andosd_mclock_profile may also be advisable depending on the deployment (which requires osd_mclock_override_recovery_settings be set to true). This is applicable to the mclock scheduler, which became the default in Ceph as of the Quincy release.

Future Testing
One hypothesis is that the draining issues we observed could be caused by bugs in the “mclock” scheduler. In the future, we would like to perform more testing by switching our cluster to use the older Weighted Priority Queue (WPQ) scheduler instead of mclock, and see if the host/OSD removal process progresses more smoothly and consistently.

FAST Research Storage

Monthly Archives: March 2025

/cwork Outage Summary Report

Outage

Recovery

Current Status – Aug 29, 2024

Attempting to Safely Remove a Ceph Storage Node