/cwork Outage Summary Report

Outage

On Jul 31, 2024, multiple MDS daemons were brought online to serve the /cwork filesystem in an attempt to improve performance and possibly alleviate MDS deadlocks. Due to this filesystem being originally created without a subvolume, we could not enable the desired “distributed ephemeral pinning” balancing method (which dynamically pins immediate subdirectories based on a consistent hash of the subdirectories inode number across the pool of active MDSs). Absent a specified balancing strategy, it’s unclear how the MDSs were sharing load at this stage.

On Aug 6, 2024, with support and direction from 42on, a stopgap method was attempted to make use of multiple MDSs by fragmenting the root inode (/cwork/) to cause its subdirectories (user directories i.e. /cwork/<user>/) to distribute across the MDS daemons.

What was not known at the time is that fragmentation of the root inode is not supported in CephFS, and causes metadata corruption when an MDS daemon is initialized. This happened (likely as part of a daemon failover) on Aug 14, 2024, and caused the filesystem to fail. This was later replicated by 42on in a lab environment and confirmed by a Ceph developer, and therefore is likely to be the definitive root cause of this corruption and outage.

Recovery

For nearly 2 weeks that followed, a recovery procedure was followed, with some individual steps in the process taking up to 43 hours to complete. This recovery process essentially created a new filesystem (“cwork2”) re-using the previous data pool, and reconstructing a new metadata pool.

On Aug 25, 2024, the recovery was complete and the filesystem mounted for internal validation. At this point we learned that we had lost all empty directories, user/group ownership, permissions and ACLs, and directory timestamps. File timestamps appeared intact, and file contents are presumed to be intact.

We then set basic file ownership across the filesystem to allow users access to their files again, and mounted the filesystem read-only on Globus for users to retrieve their data at /cwork_orig/.

A new filesystem (internally called “cwork3”) was created and configured with a subvolume, multi-MDS, and distributed ephemeral pinning enabled. Subdirectories were created for all users and mounted at /cwork across the DCC, Globus, and Bartesaghi cluster.

Current Status – Aug 29, 2024
  • The new-and-improved /cwork is available to users.
  • Users are able to retrieve their previous data and transfer it using Globus, to be disconnected on Nov 1 and destroyed thereafter.
  • Upon request, we can perform data migration on behalf of users. Note that this is merely a convenience to the user, there is no technical benefit or improvement in speed.
  • 6 new OSD hosts were recently added as well, and the system is currently rebalancing (3.7PiB raw, 25% used). 1 new OSD host still remains to be added.

Attempting to Safely Remove a Ceph Storage Node

Goal

Our objective was to safely remove an active Ceph node with 12 active Object Storage Daemons (OSDs) from our cluster safely and without degradation of data redundancy and minimal impact to performance of other concurrent workloads. This is a documented procedure which was expected to take on the order of 1-2 weeks in order to migrate active data off of the target OSDs.

Process

We issued the ceph orch host rm command to remove a single host with 12 active OSDs, and the removal process began by draining 10 of the OSDs on the host. The cluster began reporting over 5 GiB/sec recovery speed, and a number of “misplaced” Placement Groups (PGs). This number of misplaced PGs began decreasing, but after several days the recovery speed began slowing and nearing zero, and the number of misplaced PGs stopped decreasing. Upon further investigation, it appeared that the data migration process had become stuck, and required setting each affected stuck OSD to the “down” state (but not “out”), which then caused the system to quickly bring the OSD back “up” and this seemed to restart the data migration process again. This was done using ceph pg dump |grep recovering to identify affected OSDs, then issuing ceph osd down (or restarting the OSD daemon) to get the migration process restarted. This manual procedure continued, akin to “whack-a-mole” to shepherd the migration process along as each OSD would continually get stuck.

We also made some configuration changes, starting with max_osd_draining_count to enable more efficient draining of all 12 OSDs on the host to be removed simultaneously:

ceph config set mgr mgr/cephadm/max_osd_draining_count 12

This saw the system begin to drain all 12 OSDs, however the OSDs still continued to get stuck during the drain process.

Additional configuration changes were made to hopefully improve the drain process:

ceph config set osd osd_mclock_override_recovery_settings true
ceph config set osd osd_max_backfills 2
ceph config set osd osd_mclock_profile balanced

Eventually, 8 of the 12 OSDs were fully drained, with the last 4 remaining stuck regardless of them being “whack-a-mole’d” by being set to “down” or restarting their associated OSD daemons. This was after 6 weeks of draining since the initial host removal command was issued. At this point, we opted to forcibly remove the host and its 12 OSDs, causing a partially degraded state for the remainder of the PGs stored on the final 4 draining OSDs. This was done with ceph orch host rm –force –offline. The cluster then entered a partially degraded state and rebuilt for less than 1 day, at which point the cluster returned to a healthy status and the node was fully removed.

Lessons Learned

Before removing OSDs, it is recommended to

set mgr/cephadm/max_osd_draining_count

to the maximum number of OSDs that reside on a given host that will be removed. This value defaults to 10 in the Reef release of Ceph, and needed to be increased to 12 in our case as each of our storage nodes has 12 OSDs.

Additional tuning of osd_max_backfills andosd_mclock_profile may also be advisable depending on the deployment (which requires osd_mclock_override_recovery_settings be set to true). This is applicable to the mclock scheduler, which became the default in Ceph as of the Quincy release.

Future Testing
One hypothesis is that the draining issues we observed could be caused by bugs in the “mclock” scheduler. In the future, we would like to perform more testing by switching our cluster to use the older Weighted Priority Queue (WPQ) scheduler instead of mclock, and see if the host/OSD removal process progresses more smoothly and consistently.

Ceph Outage on /cwork

As noted in the earlier section describing the process to add new hardware to the cluster, we described the process that we used to expand the number of metadata services supporting the /cwork cluster. As part of that process we enabled directory fragmentation in order to better support large numbers of files in a directory. Unfortunately, we did so on the root file system for /cwork as we had not created a subvolume for that mount point. We should not have been able to enable directory fragmentation on the root file system for /cwork, but due to a bug in Ceph we were able to add it. This has led to a catastrophic metadata pool corruption of the /cwork volume resulting from fragmenting the root directory fragmentation on /cwork. The /cwork directory will be unavailable until a full rebuild of all of the metadata is completed – which could last for more than a week. The Ceph development team will be putting out a patch to prevent this issue and we have informed our users of the issue. Luckily this only affects the /cwork shared volume – the volume designed to store long term researcher data was unaffected as it is a separate storage volume.

Work is underway with many steps completed – an update will be provided when services are fully restored

Monitoring Ceph with Prometheus and Grafana

Ceph leverages prometheus for monitoring of all of the nodes and services. We feed data from the prometheus collectors into our Grafana instance which supports both display and alerts. Below are screenshots of the data availablle.

Below is the main screen for alerts and slow OSDs (Object Storage Daemon) on our production cluster from earlier this year.

Ceph Monitor Grafana

This screenshot shows ceph specific reporting details for capacity, IOPs, Throughput and object related data.

Ceph Grafana Dashboard #2

This screenshot shows ceph specific reporting around object counts and status of PGs (Placement Groups)

Bill of Materials – Phase 2 Expansion

Phase 2 Expansion of the ceph cluster consisted of:

3 additional 1U servers (ThinkSystem SR630_v2) to be dedicated for various ceph management tasks, ceph-mon, NFS gateway.

  • These servers were purchased with higher endurant NVMe storage specifically for the monitor’s DB. (see Ceph Monitors will eat your SSD)

8 additional 2U servers (ThinkSystem SR650_v2) each with 12x 18TB HDD to provide additional storage.

  • Each node will roughly add 120TB usable to the cluster (85% of usable after encoding)

Ceph BOM Storage Servers

Ceph BOM Management Servers

Storage Server

Part number Product Description Qty
7Z73CTO1WW Ceph Storage Server-5YR-18TB : ThinkSystem SR650 V2-3yr Warranty 1
BMJV ThinkSystem 2U 3.5″ Chassis with 8 or 12 Bays v2 1
BB3M Intel Xeon Gold 5315Y 8C 140W 3.2GHz Processor 2
B964 ThinkSystem 32GB TruDDR4 3200 MHz (2Rx4 1.2V) RDIMM 8
BJHJ ThinkSystem 4350-16i SAS/SATA 12Gb HBA 1
BCFH ThinkSystem 3.5″ 18TB 7.2K SATA 6Gb Hot Swap 512e HDD 12
BNEG ThinkSystem 2.5″ U.2 P5620 1.6TB Mixed Use NVMe PCIe 4.0 x4 HS SSD 2
B8LT ThinkSystem 2U 12×3.5″ SAS/SATA Backplane 1
BDY7 ThinkSystem 2U 4×2.5″ Middle NVMe Backplane 2
B5XH ThinkSystem M.2 SATA 2-Bay RAID Adapter 1
AUUV ThinkSystem M.2 128GB SATA 6Gbps Non-Hot Swap SSD 2
BE4T ThinkSystem Mellanox ConnectX-6 Lx 10/25GbE SFP28 2-Port OCP Ethernet Adapter 1
AUZX ThinkSystem Broadcom 5720 1GbE RJ45 2-Port PCIe Ethernet Adapter 1
BFH2 Lenovo 25Gb SR SFP28 Ethernet Transceiver 2
B8LQ ThinkSystem 2U PCIe Gen4 x16/x16 Slot 1&2 Riser 1 or 2 1
B8LJ ThinkSystem 2U PCIe Gen4 x16/x8/x8 Riser 1 or 2 1
BQ0W ThinkSystem V2 1100W (230Vac/115Vac) Platinum Hot Swap Power Supply 2

 

Management Server

Part number Product Description Qty
7Z71CTO1WW Ceph Management Server-5YR : ThinkSystem SR630 V2-3yr Warranty 1
BH9Q ThinkSystem 1U 2.5″ Chassis with 8 or 10 Bays 1
BB2W Intel Xeon Gold  6346 16C 205W 3.1GHz Processor 2
B965 ThinkSystem 32GB TruDDR4 3200 MHz (2Rx8 1.2V) RDIMM 8
5977 Select Storage devices – no configured RAID required 1
BGM1 ThinkSystem RAID 940-8i 4GB Flash PCIe Gen4 12Gb Adapter for U.3 1
BNF1 ThinkSystem 2.5″ U.3 7450 MAX 800GB Mixed Use NVMe PCIe 4.0 x4 HS SSD 2
BB3T ThinkSystem 1U 10×2.5″ AnyBay Backplane 1
B8P9 ThinkSystem M.2 NVMe 2-Bay RAID Adapter 1
BS2P ThinkSystem M.2 7450 PRO 480GB Read Intensive NVMe PCIe 4.0 x4 NHS SSD 2
B8N2 ThinkSystem 1U PCIe Gen4 x16/x16 Riser 1 1
B8NC ThinkSystem 1U LP+LP BF Riser Cage Riser 1 1
B8MV ThinkSystem 1U PCIe Gen4 x16 Riser 2 1
BE4T ThinkSystem Mellanox ConnectX-6 Lx 10/25GbE SFP28 2-Port OCP Ethernet Adapter 1
BFH2 Lenovo 25Gb SR SFP28 Ethernet Transceiver 2
BHTU ThinkSystem 750W 230V/115V Platinum Hot-Swap Gen2 Power Supply v2 2

Open Science Data Federation (OSDF) and FAST Research Storage

The Open Science Data Federation (OSDF) is a method to share data and also a method to provide the Open Science Grid (OSG) with tools to cache data close to where the compute is occurring.

FAST Research Storage is deploying OSDF’s Pelican software suite on a Kubernetes (k8s) cluster deployed in Duke’s High Performance Computing (HPC) network. The HPC network is in private address space, but Duke’s Science DMZ allows bypass of latency inducing network security tools (both firewalls and Intrusion Prevention Systems(IPS)) using SDN bypass networks. See Duke’s Science DMZ and FAST Research Storage

Rancher (Rancher) was used to deploy a k8s cluster that was used to install and configure Pelican. Access to FAST Research Storage via CephFS is facilitated by Ceph-CSI. Duke can provide access to both cache data for OSG users in the region but also share Duke sourced data with the larger research world.

Duke Data Attic – Built on FAST Research Storage

The Compute and Data Services Alliance for Research (CDSA) will be piloting a new data storage service, the Duke Data Attic, on the established FAST Research Storage Services this Fall. The Duke Data Attic is a low cost, easy to use storage archive system for electronic data generated from research activities, intended to improve support and streamline access to data archiving services. It is an un-curated, flexible place to store and share infrequently used data or data that may be reused.

Highlights of this service include:
• Free entitlement- for all faculty and PhD students, with affordable options to add storage
• Rapid support response – OIT staff are there quickly to support users needing assistance
• Flexible and easy to use – web-based interface for ease of data uploads and downloads. Easily extendable for large/complex data transfers, while allowing external sharing with collaborators
• Submitting data is a breeze – easy self-service submission process for long-term storage with built-in reporting of centralized resources
• Easy migration process – to the Research Data Repository at the conclusion of research projects

The Duke Data Attic is directly supported by the FAST-Research Storage (NSF OAC: 2232810, PI Charley Kneifel) $500,000 award which developed a shared research storage system based on open-source software (CEPH) and commodity data center hardware to provide a flexible, high-performance, cost-effective, tiered data storage system. Initial start-up for the pilot is granted from this storage.

Using this storage along with Globus, a researcher focused data transfer tool from University of Chicago, and Object storage on Ceph, we expect to have:
• Easy transfers in and out of storage
• Extensibility for large / complex data transfers (CLI, creation of automated flows for data packaging using Suitcase.ctl and sharing)
• And an opportunity to explore auto tiering to cloud to further reduce costs for cold data

The CDSA is excited to launch the pilot of the Duke Data Attic in the Fall of 2024 with a select group of use-case studies.

The CDSA is a dynamic faculty-driven collaboration among the Office of Information Technology, the Duke University Libraries, and the Office for Research & Innovation. The CDSA supports the increasingly complex computing and data service needs of faculty, staff and students. For more information, please visit: Compute and Data Services Alliance

A video demonstration of the Duke Data Attic is available here: Duke Data Attic Demo

Project MISTRAL’s use of FAST Research Storage

The MISTRAL project utilizes FAST storage for flow and protocol analysis data for research lab network captures. This data is stored in FAST storage for archival purposes. The MISTRAL team developed a pipeline which collects raw network data from several sensors, separates the data into both raw and normalized files, and then uploads these files to FAST storage via rclone. The data is segmented via several buckets based on both data type as well as research lab. These buckets not only provide logical separation between the various datasets, but also provides more granular access control for the types of data stored.

In addition to using FAST storage for data archives, the MISTRAL team also developed a method for retrieving and analyzing datasets in FAST storage using Python and the Boto3 Python library. This method allows researchers to download data directly from FAST storage to memory, both improving performance of analysis and allowing data to be analyzed without needing to save a local copy of the data. This method was used by both the MISTRAL Code+ and Data+ teams to rapidly analyze and query the MISTRAL datasets stored in FAST storage.

MISTRAL – Massive Internal System Traffic Research Analysis and Logging – NSF Award #2232819

More information about mistral can be found here: MISTRAL Project

Duke’s Science DMZ

Duke Science DMZ Overview

1. Background

ESnet within the US Department of Energy (DoE) developed the Science DMZ architecture [1] to help facilitate the transfer of research data by minimizing unnecessary friction. Although the ESnet architecture [2] shows a parallel network in support of the Science DMZ within their documentation, that is not required, and Duke is currently transitioning from a more traditional architecture to a Software-Defined Science DMZ architecture.

2. Current Science DMZ Architecture at Duke

Duke has resilient 100G connections to the North Carolina Research and Education Network (NCREN) operated by MCNC. These connections provide connectivity to Internet2 as well as commodity ISPs. Moreover, these circuits are virtualized and support 802.1Q VLAN tags for connectivity to other research institutions via MCNC/NCREN and Internet2 AL2S. The internet-edge routers provide external connectivity to general resources at Duke via multiple tiers of security appliances including a perimeter firewall supporting Intrusion Prevention System (IPS) services and an interior firewall cluster as illustrated in Fig 2.1:


Figure 2.1 – Duke Single-Line Network Architecture

Although the multi-tier cybersecurity architecture provides flexibility, the appliances within the green and blue layers can introduce friction, latency, jitter and impose other constraints that can limit the transfer of large volumes of research data. To address this, Duke has deployed a traditional Science DMZ architecture as shown in Fig 2.2:


Figure 2.2 – Current Science DMZ Architecture at Duke

Within the current Duke Science DMZ architecture, two hypervisor hosts are directly attached to an edge router with Mellanox ConnectX-5 NICs supporting 40/100G links. Today, these two hypervisors are each directly attached to the edge with 40G transceivers and can transition to 100G with a change in optics if needed in the future. This approach places the hypervisor hosts outside of the perimeter IPS and interior firewalls. Moreover, these hypervisor hosts are directly attached to internal data center switches at 40G with direct access to storage located within the Duke Compute Cluster (DCC) Virtual Routing and Forwarding (VRF) instance. Today, the hypervisors host Globus VMs to facilitate efficient transfer of large research data sets with friction-free access to the Internet/Internet2 and storage within DCC. Other VMs can be provisioned on these hypervisors as needed to take advantage of the Science DMZ architecture.

In addition to the hypervisors, we also have a bare-metal Linux host attached directly to an edge router at 10G hosting a containerized version of PerfSONAR [3]. The 10G NIC on the host leverages PCIe SR-IOV virtual functions and presents the PerfSONAR container with a hardware-driven virtual NIC thereby eliminating any bottlenecks associated with software-based virtual NICs and virtual switches typically used within container hosts.

3. Next-Generation Science DMZ Architecture at Duke

In 2019, Duke developed an NSF grant proposal called “Archipelago” [4] that was awarded with $1M in funding to develop and deploy a hybrid multi-node campus Software-Defined Network (SDN) architecture in support of agility and enhanced cybersecurity [5]. One aspect of Archipelago involved exploring state-of-the-art SDN switches as well as Smart Network Interface Cards (SmartNICs) [6]. As a part of the development of the next generation production network architecture at Duke, we realized that some of the products evaluated in Archipelago could provide significant flexibility and cost savings in the production network shown in Fig 2.1. We use the salmon colored “Rapid Response SDN” layer to block unwanted traffic, without introducing friction to permitted traffic, before it arrives at the perimeter IPS. Duke developed software to control this layer via a web interface (as shown in Fig. 2.3) as well as via a REST API that the IT Security Office (ITSO) makes heavy use of.


Figure 2.3 – Perimeter Rapid Response (Black Hole++) Block Interface

Referring to Fig. 2.1, within the green perimeter IPS and blue firewall functional blocks, we leverage dedicated SDN switches to provide horizontal scalability, bypass and service chaining as shown in Fig. 2.3.


Figure 2.4 – IPS and Firewall Bypass and Service Chaining

Duke also developed software to program the SDN switches supporting the IPS functional block with bypass capabilities. The IPS bypass functionality permits ITSO to selectively bypass the IPS appliances as needed as shown in Fig. 2.4. When the request is submitted, the SDN policies are automatically deployed to bypass the IPS.

In the blue-colored firewall functional blocks shown in Fig. 2.1, we also leverage the architecture outlined in Fig. 2.3 but have deployed two pairs of active/standby interior firewalls. The SDN switches support delivering a subset of traffic to each pair of firewalls allowing us to reduce costs of the firewall layer and scale horizontally. Although we have not yet implemented the SDN-driven firewall bypass capabilities, as we have done with the IPS, these features are on our development roadmap and will be a key part of our future Science DMZ architecture described in §3. We also leverage the SDN switches within the firewall functional block for service chaining as shown in Fig. 2.5:


Figure 2.5 – SDN-Driven Service Chaining

As shown within Fig 2.5, we can steer different types of traffic through different tiers of appliances as needed. As an example, for “outlands”, associated with student instructional computing, we can force traffic to flow through the perimeter IPS even though it originates within Duke before it then hits the interior firewall functional block. The service-chaining functionality is defined via policy and not physical cabling topologies as handled in legacy environments.

By combining the use of SDN-driven bypass and service chaining our future Science DMZ architecture at Duke will be virtualized to permit Science DMZ applications to be hosted in a variety of locations without requiring special physical connectivity to a parallel network or to have direct connectivity to the edge router as shown in Fig. 2.6:


Figure 2.6 – SDN-Driven Friction Bypass

The architecture shown in Fig. 2.6 will provide enhanced protection to Virtual Science DMZ applications via the Rapid Response SDN layer which will permit undesirable traffic to be blocked via API without introducing friction to desirable traffic. This is a significant advantage over our current architecture in Fig. 2.2 where static ACLs or host-based policies are required. Moreover, we are currently developing an SDN-driven monitoring fabric and horizontally scalable sensor array as a part of our NSF-funded MISTRAL [7] project at Duke. We envisage developing a closed-loop friction-free feedback system in support of dynamic control of research data flows with our SDN-driven Science DMZ.

3. References

Dart, Eli, et al. “The science dmz: A network design pattern for data-intensive science.” Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2013.

ESnet Science DMZ Architecture: https://fasterdata.es.net/science-dmz/science-dmz-architecture/

PerfSONAR: https://www.perfsonar.net/

Archipelago: https://sites.duke.edu/archipelago (NSF OAC 1925550)

Brockelsby, William, and Rudra Dutta. “Archipelago: A Hybrid Multi-Node Campus SDN Architecture.” 2023 26th Conference on Innovation in Clouds, Internet and Networks and Workshops (ICIN). IEEE, 2023.

Brockelsby, William, and Rudra Dutta. “Performance implications of problem decomposition approaches for SDN pipelines.” GLOBECOM 2020-2020 IEEE Global Communications Conference. IEEE, 2020.

MISTRAL: https://sites.duke.edu/mistral/ (NSF OAC 2319864)

Ceph Cluster Expansion

OSD Rebalancing

After receiving a batch of 8 new OSD hosts, we brought them online and configured them and prepared to add them to our cluster. We began by adding a single host with storage for 12 OSDs. However, after adding the host, we noticed our cluster became sluggish, and many OSD daemons were reporting “slow ops.”

It would seem that as soon as the first of the twelve new OSDs came online (and since we have our failure domain defined at the “host” level) the algorithm then decided that the single OSD on that host must hold all PGs designated for that host to handle host level failure domain (a 12x oversubscription for that single OSD). This then caused too many PGs to get mapped to that OSD, and the system became very sluggish. This was evident by the new OSD showing over 600 PGs allocated to it (healthy OSDs in our system have around 180 PGs).

To recover, we had to fail the newly added OSD and allow the system to rebalance. Due to the inadvertent oversubscription on the new OSD, this rebalancing process is still ongoing several weeks later (we estimate it to take 3-4 weeks in total!).

To prevent this: When adding a new host, be first to set the “norebalance” flag (ceph osd set norebalance) BEFORE adding new hosts. Once new host(s) are added, and the OSDs defined, the “norebalance” flag can be unset, thereby triggering the expansion and rebalance onto the new OSDs. Once this rebalance completes, we will be testing this by adding the remainder of the new hosts pending.

Ceph Monitors will eat your SSD

The Ceph Monitor daemon maintains the primary copy of the cluster map, a critical and dynamic piece of the Ceph architecture. This data is stored locally to the host running the Monitor daemon in a RocksDB database within /var/lib/ceph/, and sustains very frequent writes via fsync calls. As such, our initial configuration had /var/lib/ceph/ sharing the root filesystem of a standard commodity SSD. The write endurance of the backing hardware was quickly (within months) exhausted to the point of needing replacement.

To better accommodate this write-heavy workload, we purchased additional systems with additional more endurant SSDs, created a dedicated filesystem on them and mounted at /var/lib/ceph so that the RocksDB is stored there. Once these new hosts were added to the Ceph cluster, we labeled these new hosts and updated the Ceph Orchestrator deployment strategy for the Monitor daemons to only place them on these new hosts. Our hope is that this more endurant hardware will last substantially longer, improving performance and reliability.