Outage
On Jul 31, 2024, multiple MDS daemons were brought online to serve the /cwork filesystem in an attempt to improve performance and possibly alleviate MDS deadlocks. Due to this filesystem being originally created without a subvolume, we could not enable the desired “distributed ephemeral pinning” balancing method (which dynamically pins immediate subdirectories based on a consistent hash of the subdirectories inode number across the pool of active MDSs). Absent a specified balancing strategy, it’s unclear how the MDSs were sharing load at this stage.
On Aug 6, 2024, with support and direction from 42on, a stopgap method was attempted to make use of multiple MDSs by fragmenting the root inode (/cwork/) to cause its subdirectories (user directories i.e. /cwork/<user>/) to distribute across the MDS daemons.
What was not known at the time is that fragmentation of the root inode is not supported in CephFS, and causes metadata corruption when an MDS daemon is initialized. This happened (likely as part of a daemon failover) on Aug 14, 2024, and caused the filesystem to fail. This was later replicated by 42on in a lab environment and confirmed by a Ceph developer, and therefore is likely to be the definitive root cause of this corruption and outage.
Recovery
For nearly 2 weeks that followed, a recovery procedure was followed, with some individual steps in the process taking up to 43 hours to complete. This recovery process essentially created a new filesystem (“cwork2”) re-using the previous data pool, and reconstructing a new metadata pool.
On Aug 25, 2024, the recovery was complete and the filesystem mounted for internal validation. At this point we learned that we had lost all empty directories, user/group ownership, permissions and ACLs, and directory timestamps. File timestamps appeared intact, and file contents are presumed to be intact.
We then set basic file ownership across the filesystem to allow users access to their files again, and mounted the filesystem read-only on Globus for users to retrieve their data at /cwork_orig/.
A new filesystem (internally called “cwork3”) was created and configured with a subvolume, multi-MDS, and distributed ephemeral pinning enabled. Subdirectories were created for all users and mounted at /cwork across the DCC, Globus, and Bartesaghi cluster.
Current Status – Aug 29, 2024
- The new-and-improved /cwork is available to users.
- Users are able to retrieve their previous data and transfer it using Globus, to be disconnected on Nov 1 and destroyed thereafter.
- Upon request, we can perform data migration on behalf of users. Note that this is merely a convenience to the user, there is no technical benefit or improvement in speed.
- 6 new OSD hosts were recently added as well, and the system is currently rebalancing (3.7PiB raw, 25% used). 1 new OSD host still remains to be added.