Ceph Outage on /cwork

As noted in the earlier section describing the process to add new hardware to the cluster, we described the process that we used to expand the number of metadata services supporting the /cwork cluster. As part of that process we enabled directory fragmentation in order to better support large numbers of files in a directory. Unfortunately, we did so on the root file system for /cwork as we had not created a subvolume for that mount point. We should not have been able to enable directory fragmentation on the root file system for /cwork, but due to a bug in Ceph we were able to add it. This has led to a catastrophic metadata pool corruption of the /cwork volume resulting from fragmenting the root directory fragmentation on /cwork. The /cwork directory will be unavailable until a full rebuild of all of the metadata is completed – which could last for more than a week. The Ceph development team will be putting out a patch to prevent this issue and we have informed our users of the issue. Luckily this only affects the /cwork shared volume – the volume designed to store long term researcher data was unaffected as it is a separate storage volume.

Work is underway with many steps completed – an update will be provided when services are fully restored

Monitoring Ceph with Prometheus and Grafana

Ceph leverages prometheus for monitoring of all of the nodes and services. We feed data from the prometheus collectors into our Grafana instance which supports both display and alerts. Below are screenshots of the data available.

Below is the main screen for alerts and slow OSDs (Object Storage Daemon) on our production cluster from earlier this year.

Ceph Monitor Grafana

This screenshot shows ceph specific reporting details for capacity, IOPs, Throughput and object related data.

Ceph Grafana Dashboard #2

This screenshot shows ceph specific reporting around object counts and status of PGs (Placement Groups)

Bill of Materials – Phase 2 Expansion

Phase 2 Expansion of the ceph cluster consisted of:

3 additional 1U servers (ThinkSystem SR630_v2) to be dedicated for various ceph management tasks, ceph-mon, NFS gateway.

  • These servers were purchased with higher endurant NVMe storage specifically for the monitor’s DB. (see Ceph Monitors will eat your SSD)

8 additional 2U servers (ThinkSystem SR650_v2) each with 12x 18TB HDD to provide additional storage.

  • Each node will roughly add 120TB usable to the cluster (85% of usable after encoding)

Ceph BOM Storage Servers

Ceph BOM Management Servers

Storage Server

Part number Product Description Qty
7Z73CTO1WW Ceph Storage Server-5YR-18TB : ThinkSystem SR650 V2-3yr Warranty 1
BMJV ThinkSystem 2U 3.5″ Chassis with 8 or 12 Bays v2 1
BB3M Intel Xeon Gold 5315Y 8C 140W 3.2GHz Processor 2
B964 ThinkSystem 32GB TruDDR4 3200 MHz (2Rx4 1.2V) RDIMM 8
BJHJ ThinkSystem 4350-16i SAS/SATA 12Gb HBA 1
BCFH ThinkSystem 3.5″ 18TB 7.2K SATA 6Gb Hot Swap 512e HDD 12
BNEG ThinkSystem 2.5″ U.2 P5620 1.6TB Mixed Use NVMe PCIe 4.0 x4 HS SSD 2
B8LT ThinkSystem 2U 12×3.5″ SAS/SATA Backplane 1
BDY7 ThinkSystem 2U 4×2.5″ Middle NVMe Backplane 2
B5XH ThinkSystem M.2 SATA 2-Bay RAID Adapter 1
AUUV ThinkSystem M.2 128GB SATA 6Gbps Non-Hot Swap SSD 2
BE4T ThinkSystem Mellanox ConnectX-6 Lx 10/25GbE SFP28 2-Port OCP Ethernet Adapter 1
AUZX ThinkSystem Broadcom 5720 1GbE RJ45 2-Port PCIe Ethernet Adapter 1
BFH2 Lenovo 25Gb SR SFP28 Ethernet Transceiver 2
B8LQ ThinkSystem 2U PCIe Gen4 x16/x16 Slot 1&2 Riser 1 or 2 1
B8LJ ThinkSystem 2U PCIe Gen4 x16/x8/x8 Riser 1 or 2 1
BQ0W ThinkSystem V2 1100W (230Vac/115Vac) Platinum Hot Swap Power Supply 2

 

Management Server

Part number Product Description Qty
7Z71CTO1WW Ceph Management Server-5YR : ThinkSystem SR630 V2-3yr Warranty 1
BH9Q ThinkSystem 1U 2.5″ Chassis with 8 or 10 Bays 1
BB2W Intel Xeon Gold  6346 16C 205W 3.1GHz Processor 2
B965 ThinkSystem 32GB TruDDR4 3200 MHz (2Rx8 1.2V) RDIMM 8
5977 Select Storage devices – no configured RAID required 1
BGM1 ThinkSystem RAID 940-8i 4GB Flash PCIe Gen4 12Gb Adapter for U.3 1
BNF1 ThinkSystem 2.5″ U.3 7450 MAX 800GB Mixed Use NVMe PCIe 4.0 x4 HS SSD 2
BB3T ThinkSystem 1U 10×2.5″ AnyBay Backplane 1
B8P9 ThinkSystem M.2 NVMe 2-Bay RAID Adapter 1
BS2P ThinkSystem M.2 7450 PRO 480GB Read Intensive NVMe PCIe 4.0 x4 NHS SSD 2
B8N2 ThinkSystem 1U PCIe Gen4 x16/x16 Riser 1 1
B8NC ThinkSystem 1U LP+LP BF Riser Cage Riser 1 1
B8MV ThinkSystem 1U PCIe Gen4 x16 Riser 2 1
BE4T ThinkSystem Mellanox ConnectX-6 Lx 10/25GbE SFP28 2-Port OCP Ethernet Adapter 1
BFH2 Lenovo 25Gb SR SFP28 Ethernet Transceiver 2
BHTU ThinkSystem 750W 230V/115V Platinum Hot-Swap Gen2 Power Supply v2 2

Open Science Data Federation (OSDF) and FAST Research Storage

The Open Science Data Federation (OSDF) is a method to share data and also a method to provide the Open Science Grid (OSG) with tools to cache data close to where the compute is occurring.

FAST Research Storage is deploying OSDF’s Pelican software suite on a Kubernetes (k8s) cluster deployed in Duke’s High Performance Computing (HPC) network. The HPC network is in private address space, but Duke’s Science DMZ allows bypass of latency inducing network security tools (both firewalls and Intrusion Prevention Systems(IPS)) using SDN bypass networks. See Duke’s Science DMZ and FAST Research Storage

Rancher (Rancher) was used to deploy a k8s cluster that was used to install and configure Pelican. Access to FAST Research Storage via CephFS is facilitated by Ceph-CSI. Duke can provide access to both cache data for OSG users in the region but also share Duke sourced data with the larger research world.

Duke Data Attic – Built on FAST Research Storage

The Compute and Data Services Alliance for Research (CDSA) will be piloting a new data storage service, the Duke Data Attic, on the established FAST Research Storage Services this Fall. The Duke Data Attic is a low cost, easy to use storage archive system for electronic data generated from research activities, intended to improve support and streamline access to data archiving services. It is an un-curated, flexible place to store and share infrequently used data or data that may be reused.

Highlights of this service include:
• Free entitlement- for all faculty and PhD students, with affordable options to add storage
• Rapid support response – OIT staff are there quickly to support users needing assistance
• Flexible and easy to use – web-based interface for ease of data uploads and downloads. Easily extendable for large/complex data transfers, while allowing external sharing with collaborators
• Submitting data is a breeze – easy self-service submission process for long-term storage with built-in reporting of centralized resources
• Easy migration process – to the Research Data Repository at the conclusion of research projects

The Duke Data Attic is directly supported by the FAST-Research Storage (NSF OAC: 2232810, PI Charley Kneifel) $500,000 award which developed a shared research storage system based on open-source software (CEPH) and commodity data center hardware to provide a flexible, high-performance, cost-effective, tiered data storage system. Initial start-up for the pilot is granted from this storage.

Using this storage along with Globus, a researcher focused data transfer tool from University of Chicago, and Object storage on Ceph, we expect to have:
• Easy transfers in and out of storage
• Extensibility for large / complex data transfers (CLI, creation of automated flows for data packaging using Suitcase.ctl and sharing)
• And an opportunity to explore auto tiering to cloud to further reduce costs for cold data

The CDSA is excited to launch the pilot of the Duke Data Attic in the Fall of 2024 with a select group of use-case studies.

The CDSA is a dynamic faculty-driven collaboration among the Office of Information Technology, the Duke University Libraries, and the Office for Research & Innovation. The CDSA supports the increasingly complex computing and data service needs of faculty, staff and students. For more information, please visit: Compute and Data Services Alliance

A video demonstration of the Duke Data Attic is available here: Duke Data Attic Demo

Project MISTRAL’s use of FAST Research Storage

The MISTRAL project utilizes FAST storage for flow and protocol analysis data for research lab network captures. This data is stored in FAST storage for archival purposes. The MISTRAL team developed a pipeline which collects raw network data from several sensors, separates the data into both raw and normalized files, and then uploads these files to FAST storage via rclone. The data is segmented via several buckets based on both data type as well as research lab. These buckets not only provide logical separation between the various datasets, but also provides more granular access control for the types of data stored.

In addition to using FAST storage for data archives, the MISTRAL team also developed a method for retrieving and analyzing datasets in FAST storage using Python and the Boto3 Python library. This method allows researchers to download data directly from FAST storage to memory, both improving performance of analysis and allowing data to be analyzed without needing to save a local copy of the data. This method was used by both the MISTRAL Code+ and Data+ teams to rapidly analyze and query the MISTRAL datasets stored in FAST storage.

MISTRAL – Massive Internal System Traffic Research Analysis and Logging – NSF Award #2232819

More information about mistral can be found here: MISTRAL Project

Duke’s Science DMZ

Duke Science DMZ Overview

1. Background

ESnet within the US Department of Energy (DoE) developed the Science DMZ architecture [1] to help facilitate the transfer of research data by minimizing unnecessary friction. Although the ESnet architecture [2] shows a parallel network in support of the Science DMZ within their documentation, that is not required, and Duke is currently transitioning from a more traditional architecture to a Software-Defined Science DMZ architecture.

2. Current Science DMZ Architecture at Duke

Duke has resilient 100G connections to the North Carolina Research and Education Network (NCREN) operated by MCNC. These connections provide connectivity to Internet2 as well as commodity ISPs. Moreover, these circuits are virtualized and support 802.1Q VLAN tags for connectivity to other research institutions via MCNC/NCREN and Internet2 AL2S. The internet-edge routers provide external connectivity to general resources at Duke via multiple tiers of security appliances including a perimeter firewall supporting Intrusion Prevention System (IPS) services and an interior firewall cluster as illustrated in Fig 2.1:


Figure 2.1 – Duke Single-Line Network Architecture

Although the multi-tier cybersecurity architecture provides flexibility, the appliances within the green and blue layers can introduce friction, latency, jitter and impose other constraints that can limit the transfer of large volumes of research data. To address this, Duke has deployed a traditional Science DMZ architecture as shown in Fig 2.2:


Figure 2.2 – Current Science DMZ Architecture at Duke

Within the current Duke Science DMZ architecture, two hypervisor hosts are directly attached to an edge router with Mellanox ConnectX-5 NICs supporting 40/100G links. Today, these two hypervisors are each directly attached to the edge with 40G transceivers and can transition to 100G with a change in optics if needed in the future. This approach places the hypervisor hosts outside of the perimeter IPS and interior firewalls. Moreover, these hypervisor hosts are directly attached to internal data center switches at 40G with direct access to storage located within the Duke Compute Cluster (DCC) Virtual Routing and Forwarding (VRF) instance. Today, the hypervisors host Globus VMs to facilitate efficient transfer of large research data sets with friction-free access to the Internet/Internet2 and storage within DCC. Other VMs can be provisioned on these hypervisors as needed to take advantage of the Science DMZ architecture.

In addition to the hypervisors, we also have a bare-metal Linux host attached directly to an edge router at 10G hosting a containerized version of PerfSONAR [3]. The 10G NIC on the host leverages PCIe SR-IOV virtual functions and presents the PerfSONAR container with a hardware-driven virtual NIC thereby eliminating any bottlenecks associated with software-based virtual NICs and virtual switches typically used within container hosts.

3. Next-Generation Science DMZ Architecture at Duke

In 2019, Duke developed an NSF grant proposal called “Archipelago” [4] that was awarded with $1M in funding to develop and deploy a hybrid multi-node campus Software-Defined Network (SDN) architecture in support of agility and enhanced cybersecurity [5]. One aspect of Archipelago involved exploring state-of-the-art SDN switches as well as Smart Network Interface Cards (SmartNICs) [6]. As a part of the development of the next generation production network architecture at Duke, we realized that some of the products evaluated in Archipelago could provide significant flexibility and cost savings in the production network shown in Fig 2.1. We use the salmon colored “Rapid Response SDN” layer to block unwanted traffic, without introducing friction to permitted traffic, before it arrives at the perimeter IPS. Duke developed software to control this layer via a web interface (as shown in Fig. 2.3) as well as via a REST API that the IT Security Office (ITSO) makes heavy use of.


Figure 2.3 – Perimeter Rapid Response (Black Hole++) Block Interface

Referring to Fig. 2.1, within the green perimeter IPS and blue firewall functional blocks, we leverage dedicated SDN switches to provide horizontal scalability, bypass and service chaining as shown in Fig. 2.3.


Figure 2.4 – IPS and Firewall Bypass and Service Chaining

Duke also developed software to program the SDN switches supporting the IPS functional block with bypass capabilities. The IPS bypass functionality permits ITSO to selectively bypass the IPS appliances as needed as shown in Fig. 2.4. When the request is submitted, the SDN policies are automatically deployed to bypass the IPS.

In the blue-colored firewall functional blocks shown in Fig. 2.1, we also leverage the architecture outlined in Fig. 2.3 but have deployed two pairs of active/standby interior firewalls. The SDN switches support delivering a subset of traffic to each pair of firewalls allowing us to reduce costs of the firewall layer and scale horizontally. Although we have not yet implemented the SDN-driven firewall bypass capabilities, as we have done with the IPS, these features are on our development roadmap and will be a key part of our future Science DMZ architecture described in §3. We also leverage the SDN switches within the firewall functional block for service chaining as shown in Fig. 2.5:


Figure 2.5 – SDN-Driven Service Chaining

As shown within Fig 2.5, we can steer different types of traffic through different tiers of appliances as needed. As an example, for “outlands”, associated with student instructional computing, we can force traffic to flow through the perimeter IPS even though it originates within Duke before it then hits the interior firewall functional block. The service-chaining functionality is defined via policy and not physical cabling topologies as handled in legacy environments.

By combining the use of SDN-driven bypass and service chaining our future Science DMZ architecture at Duke will be virtualized to permit Science DMZ applications to be hosted in a variety of locations without requiring special physical connectivity to a parallel network or to have direct connectivity to the edge router as shown in Fig. 2.6:


Figure 2.6 – SDN-Driven Friction Bypass

The architecture shown in Fig. 2.6 will provide enhanced protection to Virtual Science DMZ applications via the Rapid Response SDN layer which will permit undesirable traffic to be blocked via API without introducing friction to desirable traffic. This is a significant advantage over our current architecture in Fig. 2.2 where static ACLs or host-based policies are required. Moreover, we are currently developing an SDN-driven monitoring fabric and horizontally scalable sensor array as a part of our NSF-funded MISTRAL [7] project at Duke. We envisage developing a closed-loop friction-free feedback system in support of dynamic control of research data flows with our SDN-driven Science DMZ.

3. References

Dart, Eli, et al. “The science dmz: A network design pattern for data-intensive science.” Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 2013.

ESnet Science DMZ Architecture: https://fasterdata.es.net/science-dmz/science-dmz-architecture/

PerfSONAR: https://www.perfsonar.net/

Archipelago: https://sites.duke.edu/archipelago (NSF OAC 1925550)

Brockelsby, William, and Rudra Dutta. “Archipelago: A Hybrid Multi-Node Campus SDN Architecture.” 2023 26th Conference on Innovation in Clouds, Internet and Networks and Workshops (ICIN). IEEE, 2023.

Brockelsby, William, and Rudra Dutta. “Performance implications of problem decomposition approaches for SDN pipelines.” GLOBECOM 2020-2020 IEEE Global Communications Conference. IEEE, 2020.

MISTRAL: https://sites.duke.edu/mistral/ (NSF OAC 2319864)

Ceph Cluster Expansion

OSD Rebalancing

After receiving a batch of 8 new OSD hosts, we brought them online and configured them and prepared to add them to our cluster. We began by adding a single host with storage for 12 OSDs. However, after adding the host, we noticed our cluster became sluggish, and many OSD daemons were reporting “slow ops.”

It would seem that as soon as the first of the twelve new OSDs came online (and since we have our failure domain defined at the “host” level) the algorithm then decided that the single OSD on that host must hold all PGs designated for that host to handle host level failure domain (a 12x oversubscription for that single OSD). This then caused too many PGs to get mapped to that OSD, and the system became very sluggish. This was evident by the new OSD showing over 600 PGs allocated to it (healthy OSDs in our system have around 180 PGs).

To recover, we had to fail the newly added OSD and allow the system to rebalance. Due to the inadvertent oversubscription on the new OSD, this rebalancing process is still ongoing several weeks later (we estimate it to take 3-4 weeks in total!).

To prevent this: When adding a new host, be first to set the “norebalance” flag (ceph osd set norebalance) BEFORE adding new hosts. Once new host(s) are added, and the OSDs defined, the “norebalance” flag can be unset, thereby triggering the expansion and rebalance onto the new OSDs. Once this rebalance completes, we will be testing this by adding the remainder of the new hosts pending.

Ceph Monitors will eat your SSD

The Ceph Monitor daemon maintains the primary copy of the cluster map, a critical and dynamic piece of the Ceph architecture. This data is stored locally to the host running the Monitor daemon in a RocksDB database within /var/lib/ceph/, and sustains very frequent writes via fsync calls. As such, our initial configuration had /var/lib/ceph/ sharing the root filesystem of a standard commodity SSD. The write endurance of the backing hardware was quickly (within months) exhausted to the point of needing replacement.

To better accommodate this write-heavy workload, we purchased additional systems with additional more endurant SSDs, created a dedicated filesystem on them and mounted at /var/lib/ceph so that the RocksDB is stored there. Once these new hosts were added to the Ceph cluster, we labeled these new hosts and updated the Ceph Orchestrator deployment strategy for the Monitor daemons to only place them on these new hosts. Our hope is that this more endurant hardware will last substantially longer, improving performance and reliability.