Attempting to Safely Remove a Ceph Storage Node

Goal

Our objective was to safely remove an active Ceph node with 12 active Object Storage Daemons (OSDs) from our cluster safely and without degradation of data redundancy and minimal impact to performance of other concurrent workloads. This is a documented procedure which was expected to take on the order of 1-2 weeks in order to migrate active data off of the target OSDs.

Process

We issued the ceph orch host rm command to remove a single host with 12 active OSDs, and the removal process began by draining 10 of the OSDs on the host. The cluster began reporting over 5 GiB/sec recovery speed, and a number of “misplaced” Placement Groups (PGs). This number of misplaced PGs began decreasing, but after several days the recovery speed began slowing and nearing zero, and the number of misplaced PGs stopped decreasing. Upon further investigation, it appeared that the data migration process had become stuck, and required setting each affected stuck OSD to the “down” state (but not “out”), which then caused the system to quickly bring the OSD back “up” and this seemed to restart the data migration process again. This was done using ceph pg dump |grep recovering to identify affected OSDs, then issuing ceph osd down (or restarting the OSD daemon) to get the migration process restarted. This manual procedure continued, akin to “whack-a-mole” to shepherd the migration process along as each OSD would continually get stuck.

We also made some configuration changes, starting with max_osd_draining_count to enable more efficient draining of all 12 OSDs on the host to be removed simultaneously:

ceph config set mgr mgr/cephadm/max_osd_draining_count 12

This saw the system begin to drain all 12 OSDs, however the OSDs still continued to get stuck during the drain process.

Additional configuration changes were made to hopefully improve the drain process:

ceph config set osd osd_mclock_override_recovery_settings true
ceph config set osd osd_max_backfills 2
ceph config set osd osd_mclock_profile balanced

Eventually, 8 of the 12 OSDs were fully drained, with the last 4 remaining stuck regardless of them being “whack-a-mole’d” by being set to “down” or restarting their associated OSD daemons. This was after 6 weeks of draining since the initial host removal command was issued. At this point, we opted to forcibly remove the host and its 12 OSDs, causing a partially degraded state for the remainder of the PGs stored on the final 4 draining OSDs. This was done with ceph orch host rm –force –offline. The cluster then entered a partially degraded state and rebuilt for less than 1 day, at which point the cluster returned to a healthy status and the node was fully removed.

Lessons Learned

Before removing OSDs, it is recommended to

set mgr/cephadm/max_osd_draining_count

to the maximum number of OSDs that reside on a given host that will be removed. This value defaults to 10 in the Reef release of Ceph, and needed to be increased to 12 in our case as each of our storage nodes has 12 OSDs.

Additional tuning of osd_max_backfills andosd_mclock_profile may also be advisable depending on the deployment (which requires osd_mclock_override_recovery_settings be set to true). This is applicable to the mclock scheduler, which became the default in Ceph as of the Quincy release.

Future Testing
One hypothesis is that the draining issues we observed could be caused by bugs in the “mclock” scheduler. In the future, we would like to perform more testing by switching our cluster to use the older Weighted Priority Queue (WPQ) scheduler instead of mclock, and see if the host/OSD removal process progresses more smoothly and consistently.


This entry was posted on Friday, March 28th, 2025 at 8:22 pm and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Leave a Reply

Your email address will not be published. Required fields are marked *