Initial Ceph Cluster Build Challenges

Our first Ceph cluster built on production hardware was a learning experience. We built the cluster in a largely “vanilla” or “out of the box” fashion with little tweaks or tuning. While the cluster functioned, we experienced some stability problems that were not immediately obvious and presented when the system was under heavy continuous load such as during IO500 benchmarking.

IO scheduling

We saw notices such as these in our Ceph logs, indicating some object storage daemons (OSDs) were experiencing slow performance:

health: HEALTH_WARN


1 slow ops, oldest one blocked for 60 sec, osd.138 has slow ops (SLOW_OPS)
One pg (4.e) in active+clean+scrubbing+deep status for 5 days
"pg not deep-scrubbed in time"

These log entries reporting reported slow operations would persist indefinitely. Some PGs would remain in scrubbing state for days and never complete. Disk health reported by SMART was healthy for all disks despite these messages. Overall performance was not grossly impacted, possibly a testament to the robustness and redundancy of the Ceph system.

This turned out to be related to the internal Ceph ‘mclock’ scheduling system performing internal benchmarks of the backend storage devices and obtaining far too ambitious results. Our spinning disks were reporting much higher than expected IOPS (triple or more), causing the mclock scheduler to overload the disks with I/O during routine scrub operations. To resolve this, we set the configuration option osd_mclock_skip_benchmark which disabled the internal benchmark, and also set the value of osd_mclock_max_capacity_iops_hdd to 500, a reasonable number of IOPS for our Seagate Exos X18 disks. After these changes, the mclock scheduler was able to appropriately schedule an amount of IO that the disks could handle, and we no longer saw the persistent SLOW_OPS messages in our logs.

CephFS client lockups & Placement Groups

During IO500 benchmarking, some clients with CephFS mounted would cease all I/O to the CephFS filesystem, and /sys/kernel/debug/ceph/*/osdc would show numerous pending in-flight I/O operations. These operations would never complete, essentially locking up the client’s ability to access the CephFS filesystem, and preventing the benchmarks from completing.

By default, Ceph attempts to automatically size placement groups (PGs) for each pool. In our experience the autoscaling did not result in appropriately sized pools in our configuration, so we enabled the ‘noautoscale’ option and manually sized the PGs for our pools based on recommendations from consultants and online PG size calculators. This fixed the CephFS client lockup issue and resulted in more full utilization of our spinning disks in parallel to serve up data.

Disabling HDD write caching

Ceph attempts to ensure write-through cache behavior when writing data to its backing storage devices. Since most modern HDD have hardware caches, this actually degrades performance as by default the HDD caches writes into its cache, Ceph then issues a cache flush, the data is written to the disk, and the drive reports back to Ceph that the write has been committed.

A better configuration is to specifically instruct the HDD device to disable its write cache, thus making writes go straight to the platters and skip the onboard cache entirely. This resulted in an improvement of around 3-4x for specific small block random-write write-through workloads.
To disable onboard disk caches persistently, we employed a udev rule which calls the ‘hdparm’ utility to disable write caches on all rotational disks on servers hosting OSDs.

Operating system and package repositories

Our initial Ceph builds were performed on Alma Linux 9, a clone of Red Hat Enterprise Linux, which worked well in our testing.
During our development for this project, Red Hat announced changes in how they distribute source code, casting uncertainty into the future feasibility of Alma Linux and other RHEL clones. Since Ceph utilizes the ‘cephadm’ orchestrator and runs all processes within Podman containers, this offers greater flexibility for choice of OS. As such, we decided to switch our Ceph servers to Ubuntu Server LTS 22.04 for the production build.

Some minimal but crucial changes were required in our system deployment Ansible playbooks to support changing OS distributions, primarily around network configuration, required package names, and Ceph package repositories. Additionally, scripts utilizing the ‘ceph’ command had to be changed to “cephadmin shell — ceph” to support execution within a container, since packages providing the ceph command are no longer being installed on the base OS, instead deployed only into containers using cephadm.

With the switch to Ubuntu, we are now only using the ‘cephadm’ package from the Universe repository for bootstrapping and managing our cluster in Podman containers, instead of installing all Ceph packages on the OS itself.

CephFS client repository

Initially we used the CentOS Storage SIG repository was used for installing the Ceph packages onto client systems mounting the CephFS filesystem, however there were package version conflicts between packages in the base repository and the Storage SIG repository. This lead us to instead use the official repositories from ceph.com for our client systems running CentOS Stream 8. These repositories do not have the aforementioned package conflicts, and seem to provide excellent CephFS client functionality.

ARP cache tuning

When mounting the CephFS filesystem on over 1,300 nodes, the Ceph monitor node kernel panicked. This turned out to be caused by an overflow of the ARP table due to low defaults configured in the OS.
This was resolved by increasing the values of net.ipv4.neigh.default.gc_thresh* sysctl parameters to enable caching of a sufficient number of ARP entries to support the number of clients expected to be connecting.

NVMe namespacing

Our architecture employs a hybrid OSD design, with 12 spinning HDD and 2 NVMe drives per server. Each OSD consists of one HDD and 1/6th of one NVMe. Our initial build utilized LVM to split the NVMe disks into logical partitions for use in each OSD. For the production rebuild, we are utilizing NVMe namespaces by creating 6 total namespaces per NVMe, allocating one namespace to each OSD for its Bluestore database. This better utilizes the IO capability available by the NVMe hardware by addressing specific areas of NAND through namespacing, as opposed to the abstraction inherent with LVM partitioning.

Building Object Storage Daemons (OSDs) with spec files

Initially we used a basic shell script for adding our storage devices to create OSDs. For the production deployment, we switched to an OSD spec .yaml definition of our storage devices and allowing Ceph to automatically build the OSD devices. The resulting allocation of block devices is the same, but this automatic definition allows for automatic rebuilds in the event of a failed disk being replaced (assuming it matches the vendor/model definitions in our OSD spec).

Prometheus monitoring

In order to integrate metrics in to our external Prometheus instance, we needed to federate the Ceph Prometheus instance with our standard external Instance. This allowed us to graph the metrics inside of Ceph with our other monitoring systems.

FAST Research Storage

Initial Ceph Cluster Build Challenges

Leave a Reply Cancel reply