Home » Shared Resources » Breast Cancer MRI Dataset

Breast Cancer MRI Dataset



Breast MRI is a common image modality to assess the extent of disease in breast cancer patients. Recent studies show that MRI has a potential in prognosis of patients’ short and long-term outcomes as well as predicting pathological and genomic features of the tumors. However, large, well annotated datasets are needed to make further progress in the field. We share such a dataset here.

In terms of design, the dataset is a single-institutional, retrospective collection of 922 biopsy-confirmed invasive breast patients, over a decade, having the following data components:

  1. Demographic, clinical, pathology, treatment, outcomes, and genomic data: Collected from a variety of sources including clinical notes, radiology report, and pathology reports and has served as a source for multiple published papers on radiogenomics, outcomes prediction, and other areas.
  2. Pre-operative dynamic contrast enhanced (DCE)-MRI: Downloaded from PACS systems and de-identified for The Cancer Imaging Archive (TCIA) release. These include axial breast MRI images acquired by 1.5T or 3T scanners in the prone positions. Following MRI sequences are shared in DICOM format: a non-fat saturated T1-weighted sequence, a fat-saturated gradient echo T1-weighted pre-contrast sequence, and mostly three to four post-contrast sequences.
  3. Locations of lesions in DCE-MRI: Annotations on the DCE-MRI images by radiologists.
  4. Imaging features from DCE-MRI: A set of 529 computer-extracted imaging features by inhouse software. These features represent a variety of imaging characteristics including size, shape, texture, and enhancement of both the tumor and the surrounding tissue, which is combined of features commonly published in the literature, as well as the features developed in our lab.




The data is shared in collaboration with The Cancer Imaging Archive (TCIA) and can be found under the following link:


Under the same link, locations of individual parts of the data are presented under “Data Access”.



The main publication describing this data is:

Saha, A., Harowicz, M.R., Grimm, L.J., Kim, C.E., Ghate, S.V., Walsh, R. and Mazurowski, M.A., 2018. A machine learning approach to radiogenomics of breast cancer: a study of 922 subjects and 529 DCE-MRI features. British journal of cancer, 119(4), pp.508-516.


Additionally, we published multiple manuscripts using various components of this dataset. Those publications are listed in the section PUBLICATIONS below.



The breast MRI dataset contains 922 patients gathered in Duke Hospital from 1 January, 2000 to 23 March, 2014 with invasive breast cancer and available pre-operative MRI at Duke Hospital. This shared set has the collection of the patients included in the main publication (in section HOW TO REFERENCE THIS DATA) after the necessary exclusions mentioned there.



The images (DICOMS) are located here: https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70226903.

Under “Data Access”, use “Data Type” value as “Images” to download. The respective paths to the image slices are indicated in “File Path mapping tables.xlsx”.



The Images Annotations are located here https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70226903.

Under “Data Access”, use “Data Type” value as “Annotation Boxes” to download the Annotation_Box.xlsx spreadsheet. Apart from the header, each row corresponds to a unique patient denoted by the “Patient ID” in the first column. The next four columns describe the annotation box as follows: the first two values are the start and end row of the box, the third and the fourth values are the start and end column of the box, the fifth and sixth values are the start and end slice number of the box. If the original box coordinates were not integers, they were rounded for this spreadsheet.

Other details pertaining to annotation: The boxes were drawn by 8 radiologists through inhouse graphical user interface developed in MATLAB. The MRI sequences that were involved in annotation were: (a) pre-contrast, (b) first post-contrast, and (c) subtracted (obtained by subtracting the pre-contrast from the first post-contrast).

The data was annotated in two parts with some differences in the procedures followed. The first procedure resulted in annotation of a subset of 271 patients, and the second one resulted in the annotation of the remaining 651 patients.

For 271 of the patients, a panel of 6 fellowship-trained radiologists was formed. One of 6 radiologists annotated a study randomly assigned to him/her. Each radiologist was responsible for annotating a subset only. The radiologists used a graphical user interface to draw a three-dimensional box around any areas of mass and non–mass‐enhancement for up to five lesions. If multiple lesions were annotated, the biopsied tumor was selected after further review of relevant radiology and pathology reports. If there were multiple biopsies, the largest biopsied tumor was selected for feature extraction.

For the remaining 651 patients, a panel of 4 fellowship-trained radiologists was formed and a slight modification in the annotation procedure was made. The radiologist was provided with location(s) of the biopsies and were told to annotate the largest biopsied lesion. One of 4 radiologists annotated each study randomly assigned to him/her. In contrast to the annotations in the first phase, radiologists had access to the PACS system, should they need it.



This data (tabular) is located here https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=70226903.

Under “Data Access”, use the “Data Type” value as “Clinical and Other Features” to download. Following are specific elements of this file:

MRI technical information. These fields were collected from the pre-contrast sequence and, if they were not available in the pre-contrast sequence, they were collected from the post-contrast sequences. To calculate the ‘Days to MRI from Diagnosis’, the date of diagnosis obtained in the clinical report was subtracted from the date of MRI acquisition.

  • Days to MRI from diagnosis
  • Manufacturer
  • Manufacturer model name
  • Scan option
  • Field strength (Tesla)
  • Patient position during MRI
  • Image position of patient
  • Contrast bolus volume (mL)
  • Repetition time
  • Echo time
  • Acquisition matrix
  • Slice thickness
  • Rows
  • Columns
  • Reconstruction diameter
  • Flip angle
  • Field of view (cm)


Demographics. These data were obtained from the Oncology clinic note in the electronic medical record.

  • Age in days at diagnosis
  • Menopausal status at diagnosis (based on clinical notes in the electronic medical record)
  • Race/ethnicity (White, Black, Asian, Native American, Hispanic, Multiethnic, Hawaiin, American Indian)
  • Metastatic disease at presentation (no,yes)


Tumor Characteristics. These data were obtained from the pathology biopsy report.

  • Estrogen receptor status (negative, positive)
  • Progesterone receptor status (negative, positive)
  • Human epidermal growth factor 2 receptor status (negative, positive)
  • Molecular subtype (luminal-like, ER/PR positive and HER2 positive, HER2, triple negative)
  • Oncotype score
  • TNM staging (based on combined pathologic and clinical staging)
  • Tumor grade (tubule, nuclear, and mitotic)
  • Nottingham grade (low, intermediate, high)
  • Histologic type (ductal carcinoma in situ, invasive ductal carcinoma, invasive lobular carcinoma, metaplastic, lobular carcinoma in situ, mixed type, micropapillary, colloid)
  • Tumor location (left,right)
  • Tumor position (clock face position, i.e. L 12 means left breast 12 o’clock)
  • Bilateral breast cancer (yes,no), if bilateral breast cancer different receptor status (yes,no)
  • Side annotated on the imaging (left,right)
  • For other side if bilateral: side of cancer, oncotype score, nottingham grade, ER status, PR status, HER2 status, molecular subtype)


MRI Findings. These data were obtained from the radiologist MRI report.

  • Multicentric/multifocal (no,yes)
  • Contralateral breast involvement (no,yes)
  • Lymphadenopathy or suspicious lymph nodes (no,yes)
  • Skin/nipple involvement (no,yes)
  • Pectoral muscle/chest involvement (no,yes


Surgery. These data were obtained from the Oncology clinic note in the electronic medical record.

  • Surgery status (no,yes)
  • Days to surgery from diagnosis
  • Definitive surgery type (breast conservation therapy, mastectomy)


Radiation Therapy. These data were obtained from the Oncology clinic note in the electronic medical record.

  • Neoadjuvant radiation (no,yes)
  • Adjuvant radiation (no,yes)


Tumor response. These data were obtained from the Oncology clinic note in the electronic medical record. Please note this data was obtained from the initial evaluation of the electronic medical record. Columns further to the right in the spreadsheet, labeled Pathological Response to Neo-Adjuvant Therapy and Near-Complete Response were obtained on second review of the electronic medical record with a few updates made to the data. 

  • Clinical response (obtained from radiologist imaging report)
  • Pathologic response to neoadjuvant therapy (complete response, not complete response, DCIS only remaining, LCIS only remaining, treatment response assessment unavailable, not applicable)


Recurrence. These data were obtained from the Oncology clinic note in the electronic medical record

  • No, yes
  • If yes: days to local recurrence and/or days to distant recurrence from date of diagnosis


Follow-up. These data were obtained from all clinical notes in the electronic medical record.

  • Days to death from diagnosis
  • Days to last local recurrence free assessment (based on clinical notes in the electronic medical record)
  • Days to last distant recurrence free assessment (based on clinical notes in the electronic medical record)
  • Days to last contact in electronic medical record (last time patient known to be alive, unless age of death is reported)


Mammography Characteristics. These date were obtained from radiologist preoperative mammogram report.

  • Age at mammogram
  • Breast density (heterogeneous, scattered, minimal, moderate, extremely, predominantly fatty)
  • Lesion shape (oval, irregular, lobular, reniform, stellate)
  • Lesion margin (obscured, spiculated, indistinct/ill-defined, circumscribed)
  • Architectural distortion (no,yes)
  • Lesion density
  • Calcifications (yes, pleomorphic, heterogeneous, microcalcification, linear, clustered, amorphous, branching)
  • Lesion size (cm)


Ultrasound (US) features. These data were obtained from the radiologist preoperative ultrasound report.

  • Lesion shape(oval, irregular, lobular)
  • Lesion margin (obscured, ill-defined, spiculated, indistinct, circumscribed, microlobulated, angular, irregular)
  • Lesion size (cm)
  • Lesion echogenicity (hypoechoic, hyperechoic, isoechoic, anechoic, irregular, mixed, boundary)
  • Solid
  • Posterior acoustic shadowing


Therapy data. Please note this data was obtained on second review of the electronic medical record with a few updates made to the data.

  • Chemotherapy:
  • Endocrine Therapy:
  • Anti-Her2/Neu Therapy:
  • Neo-Adjuvant Therapy:
  • Pathologic Response to Neo-Adjuvant Therapy:
  • Near-Complete Response:



1. Saha A, Yu X, Sahoo D, Mazurowski MA. Effects of MRI scanner parameters on breast cancer radiomics. Expert Syst Appl. 2017;87: 384–391. doi:10.1016/j.eswa.2017.06.029

  1. Saha A, Harowicz MR, Mazurowski MA. Breast cancer MRI radiomics: An overview of algorithmic features and impact of inter-reader variability in annotating tumors. Med Phys. Wiley-Blackwell; 2018;45: 3076–3085. doi:10.1002/mp.12925
  2. Saha A, Harowicz MR, Grimm LJ, Kim CE, Ghate S V., Walsh R, et al. A machine learning approach to radiogenomics of breast cancer: A study of 922 subjects and 529 dce-mri features. Br J Cancer. 2018;119: 508–516. doi:10.1038/s41416-018-0185-8
  3. Saha A, Harowicz MR, Wang W, Mazurowski MA. A study of association of Oncotype DX recurrence score with DCE-MRI characteristics using multivariate machine learning models. J Cancer Res Clin Oncol. 2018;144: 799–807. doi:10.1007/s00432-018-2595-7
  4. Cain EH, Saha A, Harowicz MR, Marks JR, Marcom PK, Mazurowski MA. Multivariate machine learning models for prediction of pathologic response to neoadjuvant therapy in breast cancer using MRI features: a study using an independent validation set. Breast Cancer Res Treat. 2019;173: 455–463. doi:10.1007/s10549-018-4990-9
  5. Mazurowski MA, Saha A, Harowicz MR, Cain EH, Marks JR, Marcom PK. Association of distant recurrence-free survival with algorithmically extracted MRI characteristics in breast cancer. J Magn Reson Imaging. 2019;49: e231–e240. doi:10.1002/jmri.26648


We would like to acknowledge all those who contributed to this dataset.