We are sharing a dataset of digital breast tomosynthesis (DBT) volumes for 5,060 patients. The dataset contains:
- images in the DICOM format
- a spreadsheet indicating an assignment of each case to one of four groups
- annotation boxes, and
- a spreadsheet that provides an additional organization of patients/studies/views.
WHERE IS THE DATA LOCATED?
This data can be found on The Cancer Imaging Archive at this location: https://doi.org/10.7937/e4wt-cd02
Digital Breast Tomosynthesis (DBT) is an advanced breast cancer screening technology approved by the FDA in 2011. DBT is often referred as 3D Mammography since it produces quasi–three-dimensional (3D) images of the breast. In DBT, an X-ray machine is rotated to capture images of the breast tissue from different angles. These images are reconstructed to produce thin sections of the breast which have enhanced details in comparison to traditional 2D Mammography. Specifically, higher out-of-plane resolution in DBT allows for better visualization of masses and architectural distortions (https://pubs.rsna.org/doi/pdf/10.1148/rg.2019180046). Benefits of DBT as a screening tool are demonstrated in several prospective trials (https://pubs.rsna.org/doi/full/10.1148/radiol.2015141303) over the last decade (2011-2020).
Recent years have seen remarkable improvement in the applications of Artificial Intelligence (AI) in medical imaging. However, AI applications to DBT are not yet particularly prevalent in the scientific literature given that DBT is a more recent technique and publicly available DBT datasets are rare. Public release of medical imaging datasets for AI is a time-consuming and challenging.. Medical imaging data, in its raw form, contains patient information. Therefore, public release of medical imaging data needs to follow necessary de-identification and meet compliance standards from the institutions that release it for research purposes. Additionally, radiologists’ annotations on images are typically necessary for the algorithm development posing an additional challenge in creating large datasets. Finally, curation of data for public use is highly time consuming and often involves a coordination of large teams.
DETAILS OF THE IMAGING STUDIES AND THE DATA COLLECTION
Our DBT dataset consists of a collection of patients who had a DBT exam at Duke Health system within Jan. 1, 2014 to Jan. 30, 2018. The collection of the DBT data to prepare the cohort included cross-matching among the radiology reports, pathology reports, and DBT data from the Picture Archiving and Communication Systems (PACS) system at Duke. Duke Medicine’s DEDUCE (Duke Enterprise Data Unified Content Explorer) system was queried to obtain all radiology reports containing the word ‘tomosynthesis’ and all Pathology reports containing the word ‘breast’ within for the period mentioned above. The medical record numbers (MRNs) and dates identified in the radiology reports were used to download the encoded and compressed DICOM files (proprietary format of the imaging equipment manufacturer Hologic, Bedford, Mass.) files from Duke’s PACS system using a computer program. A decoding algorithm was run on the compressed DICOMS to obtain uncompressed DICOMs.
Our image download from PACS resulted in an initial collection of 16,802 studies (a radiology breast exam with a DBT on a particular date with a unique StudyInstanceUID DICOM tag) from 13,954 patients (unique MRNs) with at least one of the craniocaudal (CC) and mediolateral oblique (MLO) views available for the left or right breast. Each of the left or right CC and MLO views will be referred as volumes. Each individual two-dimensional slice in a volume will be referred as image. The dates of these studies within our initial collection lie between August 26, 2014 and January 29, 2018.
Two radiologists (18 years and 25 years of experience) at our institution annotated the studies. A subset of the available studies was considered for annotation such that these studies had either mass or architectural distortion which resulted in a biopsy with a benign or cancer finding. Each volume was annotated by one radiologist. During annotation, radiologists were provided with the corresponding radiology report and applicable pathology reports. Specifically, the radiologists drew a rectangular box enclosing a biopsied tumor in the central slice. All annotations were performed using an in-house computer software.
HOW TO REFERENCE THIS DATA?
A detailed description of the data is provided in the following paper. Please cite this paper if you use this data:
M. Buda, A. Saha, R. Walsh, S. Ghate, N. Li, A. Święcicki, J. Y. Lo, M. A. Mazurowski, Detection of masses and architectural distortions in digital breast tomosynthesis: a publicly available dataset of 5,060 patients and a deep learning model. arXiv preprint arXiv:2011.07995 (https://arxiv.org/abs/2011.07995).
A machine learning challenge associated with this dataset is here: spie-aapm-nci-dair.westus2.cloudapp.azure.com/competitions/4
Additional code related to this dataset can be found here: https://github.com/MaciejMazurowski/duke-dbt-data
Additional discussion related to this dataset can be found here: https://www.reddit.com/r/DukeDBTData/
This dataset, shared by the Mazurowski lab, and this description was an effort of many researchers and clinicians. Among others, I would like to particularly acknowledge Ashirbani Saha, Mateusz Buda, Rurth Walsh, Sujata Ghate, Niany Li, Albert Święcicki, Joseph Lo, Jichen Yang, Nicholas Konz, and Longfei Zhou. I would like to thank the administration at Duke University who have reviewed this data and allowed the public sharing. I would also like to acknowledge the NIH for providing funding for our research – NIH: 1 R01 EB021360 (PI: Mazurowski).
FREQUENTLY ASKED QUESTIONS
Below are some curated important questions for this dataset along with answers. For more discussion, visit https://www.reddit.com/r/DukeDBTData/
Question: Is there a code repository for reading images, drawing bounding boxes, and helper functions related to this database?
Answer: Yes. Please use the following link: https://github.com/MaciejMazurowski/duke-dbt-data
Question: Could you explain the image files/format present for each study?
Answer: One DICOM file or image consists of an entire 3D volume (view). These images are stored in compressed DICOM format.
Question: Which software/tools can I use to read the images?
Answer: You may use a variety of software packages to read the images. We successfully opened the images with the following software: 3D Slicer, ITK-SNAP, Radiant, MicroDICOM, Matlab, and GDCM.
Question: Do I need to know the pre-processing steps, provided in the code repository, for the images?
Answer: It is important to look at the pre-processing steps we provided in the code repository. Please see the Python functions for reading image data from a DICOM file into 3D array of pixel values in the proper orientation and for displaying “truth” boxes (if present). Please also see the readme file there for instructions. This is crucial as some of the image headers contain incorrect laterality or orientation. For these images, the reference standard “truth” boxes are provided with respect to the corrected image orientation.
Question: Are 4 views available for every study?
Answer: Though 4 views (2 per breast, craniocaudal and mediolateral oblique) are present for most of the studies, some exams have fewer than 4 views.
Question: What kind of encoding is used in the columns of the file ‘BCS-DBT labels-train-v2.csv’?
Answer: The columns “Cancer”, “Benign”, “Actionable”, “Normal” represent one-hot encoded assignment to a category. Details pertaining to these categories can be found in the Section 2.1 of preprint available at arXiv: https://arxiv.org/pdf/2011.07995.pdf.
Question: How to interpret the “Slice” column in the data provided in the file ‘BCS-DBT boxes-train-v2.csv’?
Answer: A: The “Slice” column corresponds to the central slice of a biopsied lesion. More details on the image annotation are provided in the paper https://arxiv.org/pdf/2011.07995.pdf (section 2.1.2). For evaluation, we assume that lesions span 25% of volume slices in each direction. It is reflected in the evaluation functions available on GitHub: https://github.com/MaciejMazurowski/duke-dbt-data/blob/master/duke_dbt_data.py
Question: How to interpret the columns of ‘BCS-DBT boxes-train-v2.csv’?
- PatientID: string – patient identifier
- StudyUID: string – study identifier
- View: string – view name, one of: RLL, LCC, RMLO, LMLO (you might see a numerical suffix after these if multiple images under one view are present)
- Subject: integer – encodes a radiologist who performed annotation
- Slice: integer- the central slice of a biopsied lesion
- X: integer – X coordinate (on the horizontal axis) of the left edge of the predicted bounding box in 0-based indexing (for the left-most column of the image x=0)
- Y: integer – Y coordinate (on the vertical axis) of the top edge of the predicted bounding box in 0-based indexing (for the top-most row of the image y=0)
- Width: integer – predicted bounding box width (along the horizontal axis)
- Height: integer – predicted bounding box height (along the vertical axis)
- Class: string – either benign or cancer
- AD: integer – 1 if architectural distortion is present, else 0
- VolumeSlices: integer – The total number of slices in volume containing the bounding box (used in evaluation function)
Question: Why are the number of bounding boxes much less than the number of training samples?
Answer: The bounding boxes are applicable only to cases with biopsy-proven benign and cancer findings. The training set consists of cases with normal, actionable, benign, and cancer findings. For details on these categories, see Section 2.1 of preprint available at arXiv: https://arxiv.org/pdf/2011.07995.pdf.
Question: Why I cannot find the path of a downloaded image in the csv file “Training set – Image paths for patients/studies/views (csv)”?
Answer: At times, you may need to replace “\” with “/” in the path of an image file to find the path in the csv file “Training set – Image paths for patients/studies/views (csv)”.
Question: Why are there both “descriptive_path” and “classic_path” in the .csv file ” Training set – Image paths for patients/studies/views (.csv)”?
Answer: When you download our data using the NBIA Data Retriever, there are two options (“Descriptive Directory Name” and “Classic Directory Name”) for selecting the Directory Type, that correspond to those two paths in the .csv file.
Question: Does the dataset contain microcalcifications?
Answer: It contains microcalcifications. However, they were not annotated and were not the cause for actionability or biopsy.
Question: Could you provide some examples of markers in the images?
Answer: Yes. Some examples are listed below. The images or their parts are taken from this collection under the CC BY-NC 4.0 licence (https://creativecommons.org/licenses/by-nc/4.0/).
1. Circle for a raised area on the skin such as a mole (image 13345.000000-24122 from this collection)
2. Line for a previous surgery (image 14338.000000-72252 from this collection)
3. Solid pellet for the nipple (image 8764.000000-63613 from this collection)
4. Markers for locations of pain (one or more spots, image 20566.000000-32081 from this collection)