OSD Rebalancing
After receiving a batch of 8 new OSD hosts, we brought them online and configured them and prepared to add them to our cluster. We began by adding a single host with storage for 12 OSDs. However, after adding the host, we noticed our cluster became sluggish, and many OSD daemons were reporting “slow ops.”
It would seem that as soon as the first of the twelve new OSDs came online (and since we have our failure domain defined at the “host” level) the algorithm then decided that the single OSD on that host must hold all PGs designated for that host to handle host level failure domain (a 12x oversubscription for that single OSD). This then caused too many PGs to get mapped to that OSD, and the system became very sluggish. This was evident by the new OSD showing over 600 PGs allocated to it (healthy OSDs in our system have around 180 PGs).
To recover, we had to fail the newly added OSD and allow the system to rebalance. Due to the inadvertent oversubscription on the new OSD, this rebalancing process is still ongoing several weeks later (we estimate it to take 3-4 weeks in total!).
To prevent this: When adding a new host, be first to set the “norebalance” flag (ceph osd set norebalance) BEFORE adding new hosts. Once new host(s) are added, and the OSDs defined, the “norebalance” flag can be unset, thereby triggering the expansion and rebalance onto the new OSDs. Once this rebalance completes, we will be testing this by adding the remainder of the new hosts pending.
Ceph Monitors will eat your SSD
The Ceph Monitor daemon maintains the primary copy of the cluster map, a critical and dynamic piece of the Ceph architecture. This data is stored locally to the host running the Monitor daemon in a RocksDB database within /var/lib/ceph/, and sustains very frequent writes via fsync calls. As such, our initial configuration had /var/lib/ceph/ sharing the root filesystem of a standard commodity SSD. The write endurance of the backing hardware was quickly (within months) exhausted to the point of needing replacement.
To better accommodate this write-heavy workload, we purchased additional systems with additional more endurant SSDs, created a dedicated filesystem on them and mounted at /var/lib/ceph so that the RocksDB is stored there. Once these new hosts were added to the Ceph cluster, we labeled these new hosts and updated the Ceph Orchestrator deployment strategy for the Monitor daemons to only place them on these new hosts. Our hope is that this more endurant hardware will last substantially longer, improving performance and reliability.