Amanda Randles Ph.D., Associate Professor of Biomedical Engineering at Duke University began working with Duke’s Office of Information Technology to develop a mobile application that would allow users to submit health data (“heart rate” and “step count” measurements) collected from wearable devices, such as smart watches and fitness trackers, for biomedical simulation research. A device like an Apple Watch will typically measure a user’s heart rate about every 5 minutes (roughly 288 measurements/day), with a similar number of measurements for a user’s step count. If the mobile application reached 1,000 simultaneous users, the total measurement count could reach 576,000 per day: a sizeable record count for any application. We knew we needed a system that would allow Dr. Randles and her team to store large amounts of data at a low, predictable cost, as well as leverage available on-premises resources.
Traditional relational database systems couple compute and storage, allowing the user to scale vertically (increasing compute and storage capacity on a single machine) but not horizontally (increasing compute and storage capacity by increasing the number of machines) without a significant amount of overhead. Enter: Apache Iceberg.
Apache Iceberg began at Netflix in 2017 and was donated to the Apache Software Foundation in 2018. Apache Iceberg is an open table format that allows users to work with large analytic data sets. Data is stored in Parquet files on disk, a columnar file format that allows for better compression and more efficient encoding than a row-based format like csv. To demonstrate how a file format like Parquet helped with our use case we should first explore a brief overview of row-based formatting and column-based formatting.
If you have ever opened a MS Excel spreadsheet, you are familiar with row-based formatting. A simple visual example:
| User_ID | Heart_Rate | Device |
|---|---|---|
| 1 | 67 | oura.ring |
| 1 | 69 | oura.ring |
| 1 | 64 | oura.ring |
| 1 | 65 | oura.ring |
In this example, each row represents an individual heart rate reading from a wearable device. If we take the entire row into account, every row appears unique due to the fluctuating (read: non-repeating) heart rate measurements. This uniqueness has disadvantages when we want to compress the data to decrease storage utilization.
A simple visual example of a column-based file format:
| User_ID | 1 | 1 | 1 | 1 |
|---|---|---|---|---|
| Heart_Rate | 67 | 69 | 64 | 65 |
| Device | oura.ring | oura.ring | oura.ring | oura.ring |
In this example, each row contains the values of a single column from the original data set. One of the many advantages of column-based formats like Parquet provide is compression through run-length encoding (RLE). In column-based example, the user_id and device values are repeated. RLE works by swapping sequences of values with a number that represents the number of times a value repeats, so the original value in the device column:
| oura.ring | oura.ring | oura.ring | oura.ring |
|---|
Becomes…
| 4Oura.Ring |
|---|
In Dr. Randles data set, attributes like device descriptions and user IDs do not change frequently and using a column-based format allowed for significant storage savings out of the box while avoiding the need for more complex data modeling and normalization.
Aside from compression, column-based formats can also provide a performance boost to computational analysis. In the column-based example, each value in the heart_rate row is stored next to each other on disk, in sequence. This means when a researcher writes code to “calculate average heart rate”, the compute layer can read sequential blocks of data quickly. By contrast, in row-based formats the compute layer must read each row then determine the column within the row that contains the data the program is interested in.
Using an open table format like Apache Iceberg within the Duke Data Attic has other advantages:
1. Since storage and compute are decoupled, it allows a researcher to better predict data storage costs since cost scales linearly with data volume.
2. Data is kept on-premises.
3. Data can be accessed using compute resources already available, like the Duke Computer Cluster, to the researcher.
4. Iceberg’s support for ACID transactions ensures that inserts, updates, and deletes happen reliably, even in distributed systems, eliminating the risk of partial writes or inconsistent states.
5. Iceberg’s other feature like schema evolution and partitioning support make it a good choice to analytics workflows that scale over time.
Screenshot of the data files in Globus for the “heart_rate” Apache Iceberg table.

Screenshot of the Apache Iceberg table being queried in a Jupyter notebook.

