Electronic Health Records-Based Phenotyping


Primary Contributors
Contributing Editor
  • Gina Uhlenbrauck

See Acknowledgments for full list of contributors

Topic ChaptersIn the context of electronic health records (EHRs), a computable phenotype or simply phenotype refers to a clinical condition or characteristic that can be ascertained via a computerized query to an EHR system or clinical data repository using a defined set of data elements and logical expressions. These queries can identify patients with a particular condition, such as diabetes mellitus, obesity, or heart failure, and can be used to support a variety of purposes and data needs for observational and interventional research. Standardized computable phenotypes can enable large-scale pragmatic clinical trials across multiple health systems while ensuring reliability and reproducibility. We describe mechanisms for identifying and evaluating phenotype definitions, with a particular focus on standardization efforts from the NIH Health Care Systems Research Collaboratory (“Collaboratory”).

In this Topic:

Introduction and Definitions

[Back to top]

What is a phenotype?

A phenotype is the observable physical or biochemical expression of a specific trait in an organism, such as a disease, stature, or blood type, based on genetic information and environmental influences. The phenotype of an organism includes factors such as physical appearance, biochemical processes, and behavior. In short, the phenotype of an organism is the appearance it presents to observers.

A more contemporary interpretation of the term phenotype is understood as measurable biological (physiological, biochemical, and anatomical features), behavioral (psychometric pattern), or cognitive markers that are found more often in individuals with a disease or condition than in the general population.

What is a computable phenotype?

A computable phenotype is a clinical condition, characteristic, or set of clinical features that can be determined solely from the data in EHRs and ancillary data sources and does not require chart review or interpretation by a clinician. These can also be referred to as EHR condition definitions, EHR-based phenotype definitions, or simply phenotypes.

We use the term EHR broadly to reference data that are generated through healthcare delivery and reimbursement practices; in practice, these functions may be covered in multiple systems and can contain both practice management data and data that are strictly limited to the clinical domain. We use ancillary data sources to refer to sources such as disease registries, claims data, or supplemental data collection that are related to health care delivery but may not be directly integrated into the EHR system.

What are computable phenotype definitions?

Computable phenotype definitions are specifications for identifying patients or populations with a given characteristic or condition of interest from EHRs using data that are routinely collected in EHRs or ancillary data sources. Computable phenotype definitions can support reproducible queries of EHR data from multiple organizations. These queries can then be replicated at multiple sites in a consistent fashion, enabling efficiencies and also ensuring that populations identified from different healthcare organizations have similar features, or at least were identified in the same way.

Phenotype definitions are composed of data elements and logic expressions (AND, OR, NOT) that can be interpreted and executed by a computer. In other words, the syntax defining a computable phenotype is designed to be interpreted and executed programmatically without human intervention. Computable phenotype definitions rely on value sets derived from standardized coding systems and may employ hierarchies and weighting factors for data elements. Data elements and the difference between data elements and phenotypes will be described further in this chapter.

Why are computable phenotype definitions important?

The ability to identify people with particular conditions across healthcare organizations by using common definitions has value for clinical quality measurement, health improvement, and research. Standard phenotype definitions can enable direct identification of cohorts based on population characteristics, risk factors, and complications, allowing decision-makers to identify and target patients for screening tests and interventions that have been demonstrated to be effective in similar populations. This identification process can be integrated with the EHR for real-time clinical decision support.

Standard phenotype definitions can also streamline the development of registries and applications using healthcare data and can enable consistent inclusion criteria to support regional surveillance in the identification of infectious diseases and rare disease complications.

Finally, computable phenotype definitions are essential to the conduct of pragmatic clinical trials and comparative effectiveness research. These studies, which may involve multiple hospitals or health systems, rely on standard phenotype definitions for EHR-based inclusion/exclusion of participants and consistent data analysis and reporting across data sources. Computable phenotype definitions have applications in interventional, observational, prospective, and retrospective studies [1].

How do computable phenotypes relate to the true presence of a condition?

As shown in the figure below, phenotype definitions are composed of data constructs and coding systems available for providers to record patient data in EHR systems. These data from EHRs may reflect a patient’s state or disease status, but the data are generated from the perception, interpretation, and recording by the clinical staff that are observing the patient. The data in EHRs, therefore, represent a limited view of a patient’s condition, and are by definition incomplete and often biased.

EHR phenotyping figure

EHR phenotyping. Source: Hripcsak G, Albers DJ. J Am Med Inform Assoc 2013;20:117-121. (Used under Creative Commons license.)

EHR data are available only for those patients who are motivated (often by disease or illness) and able to see a provider. Other attributes related to the healthcare provider and providing organization influence the nature of the data in EHRs, including the experience of the provider, availability and use of diagnostic equipment and therapeutic procedures, interactions with clinical specialists, insurance coverage and limitations, and coding and reimbursement practices of the organization [2]. The quantitative impact of each of these features on the performance of phenotype definitions is largely unknown. The measurement and estimation of these factors, and the development of strategies to mitigate their impact on data quality, are active methodological research areas in health services research and informatics.

What are the benefits of “standard” phenotypes or condition definitions?

The explicit documentation of computable phenotype definitions can support their use in many different organizations or settings for the consistent identification of patient populations for various purposes. It is important to identify appropriate phenotype definitions for health policy and research. Differences across phenotype definitions can potentially affect their application in healthcare organizations and subsequent interpretation of data.

It is not proposed that a single phenotype definition—of type 2 diabetes mellitus or heart failure, for example—will be sufficient for all intended uses. Rather, the Collaboratory intends to research existing phenotype definitions and document a set of common, well-defined phenotype definitions appropriate for a given characteristic or condition and intended use. This work will support future standardization efforts, including realizing the vision of a standardized Table 1 for reporting baseline patient characteristics in research studies.

Standardization—the process of reconciling differences—can be applied in many different ways within the arena of phenotypes. We distinguish between data capture standardization, phenotype definition standardization, and phenotype representation standardization. These distinctions are important because researchers using secondary data for research purposes do not normally have the ability to enforce data capture standardization for the originating system.

The standardization of one or more phenotype definitions is a complex process that will necessarily engage many stakeholders, representing clinical, research (industry and academia), and patient perspectives. Future work of the Collaboratory will be to identify and promote standards in this area by supporting broader vetting and promotion of scientifically and clinically validated phenotype definitions.

What data sources are used?

Unfortunately, there are still only a limited number of data fields that are routinely collected across different EHR systems. Most phenotype definitions, therefore, use some combination of International Classification of Diseases codes (ICD-9), medication names, and/or laboratory tests. ICD-9-CM diagnosis codes can be found in technical billing, professional billing, and/or problem lists. In the future, EHRs will use ICD-10 codes for diagnoses and potentially Systematized Nomenclature of Medicine–Clinical Terms (SNOMED CT) codes for problem lists and other aspects of EHRs. EHRs also contain narrative (unstructured) data. The use of natural language processing techniques within the biomedical domain is evolving and may offer opportunities for leveraging clinically rich, narrative data within EHRs [3]. There are many opportunities to validate and improve these algorithms [4].

The U.S. Department of Health and Human Services’ (HHS) Office of the National Coordinator of Health Information Technology (ONC) maintains standards and implementation specifications for EHR systems to ensure that certified systems support the achievement of Meaningful Use criteria [5]. Accordingly, data elements required by ONC are able to be collected within all certified EHR systems in the United States in a manner consistent with ONC specifications.

Because EHR data may be available from different types of encounters, including inpatient, outpatient, and emergency department visits, phenotype definitions should take into consideration which sources are relevant to answering the question at hand. In some cases, multiple sources will be needed for complete data capture. For example, medication data can be obtained from reconciliation of various contexts, such as inpatient administration, provider ordering, or outpatient dispensing.

What terms are related to phenotype definitions?

Informatics and data standards groups use the following terms related to phenotype definitions:

  • Data element: the unit of data being queried, exchanged, or analyzed, which includes a descriptive name that represents the concept being described plus a specified value set and other descriptive metadata, such as a definition. As illustrated in the next section, phenotype definitions can be represented using one or more data elements.
  • Value set: the set of possible values, categories, or responses (and their codes) that are associated with a particular data element, often derived from established vocabularies or data standards [6].
  • Metadata: descriptive data about objects, including data objects. Metadata are data about data [7], such as version, author, concept, identifier, data type, definition, and preferred label for a particular data element in a data collection system or form.
  • Operationalization: a process by which a researcher defines how a concept is measured, observed, or manipulated within a particular study and available data sources; this process translates a theoretical, conceptual variable of interest into a set of specific operations or procedures that define the variable’s meaning in a specific study, allowing for examination of a hypothesis [8]. A phenotype definition can be considered an operationalization of a disease concept in electronic health data systems or clinical data repositories.

The standardization of data elements and their associated value sets will support consistent phenotype definitions across healthcare providers and organizations using different EHR systems. This is the goal of the ONC, using the Meaningful Use incentive program, and is supported in part by the NIH Common Data Element initiatives and the Value Set Authority Center of the National Library of Medicine.

How are data elements and phenotypes different?

Every data element has a value set, and value sets can vary in size and complexity. A value set might include a limited set of categorical values, or a more extensive list of codes from standardized coding systems such as ICD-9-CM or RxNorm. For example, the data element for “sex” includes a single variable with that name, along with a set of discrete values, and perhaps with a definition and associated descriptive metadata. To query the sex of a person, a single data element is assessed.

Example Data Elements with Associated Value Sets (Categorical Value Types)

Data ElementValue Set (Categorical Values)
SexMale, Female, Unknown/Not reported
RaceAmerican Indian or Alaska Native, Asian, Black or African American, Native Hawaiian or Other Pacific Islander, and White, Unknown/Not reported

Demographic characteristics are generally data elements in and of themselves, not combinations of data elements. They might be considered phenotypes themselves, but more often are used as component data elements for phenotype definitions of particular medical conditions.

Many data elements include long lists of values, called nominal value sets. Data elements with nominal value sets can reference entire coding systems or enumerated lists from standardized coding systems or controlled vocabularies.

Example Data Elements with Associated Value Sets (Nominal Value Types, Using Coding Systems)

Data ElementValue Sets (Nominal)
Final diagnosisICD-9-CM codes (all)
Final diagnosis of diabetes249.xx, 250.xx, 357.2, 362.01-06 , 366.41 (from ICD-9-CM)
Medications orderedLocal medication list; clinical drugs coded in RxNorm
Diabetes-related medications orderedAcarbose, Precose, Acetohexamide, Dymelor, etc.

Phenotype definitions are represented as logical query criteria using one or more data elements with a defined value set. For example, to infer that a patient has a clinical characteristic such as diabetes, evidence can come from one or many data elements:

Possible Data Elements to Identify the Presence of Diabetes

Data ElementValue Sets (Nominal)
ICD-9-CM codes for diabetes249.xx, 250.xx, 357.2, 362.01-06 , 366.41
Diabetes-related medicationsAcarbose, Precose, Acetohexamide, Dymelor, etc.
Hemoglobin A1c values suggestive of uncontrolled diabetes≥6.5%

Any one of the elements in the table above, or all of the elements collectively, could be used to define a phenotype definition for diabetes. Such a definition would specify that any or all of the data elements (and associated value criteria) must be present to classify a patient as having diabetes on the basis of the data recorded in the EHR.

Demographic characteristics such as sex are not really phenotype definitions, but rather data elements whose value sets are relatively short lists of category variables. However, they are included in this discussion because they are important person characteristics, frequently reported in research, and need to be standardized across pragmatic clinical trials. Data elements for such patient characteristics can also be part of a phenotype. For example, male sex could be component of a prostate cancer phenotype definition.

Finding Existing Phenotype Definitions

[Back to top]

Who has developed phenotype definitions and where can they be found?

There are several key groups involved with establishing phenotype definitions, and some authoritative sources are described in this section. Phenotype definitions may be developed by government entities, universities, health systems, professional societies, or clinical trial consortia. The Collaboratory is aware of the many related efforts and the dynamic nature of this field and is continually surveying for phenotype-related efforts in an attempt to keep this work in context while preventing duplication of any previous efforts.

Phenotypes Environmental Scan (survey of phenotype-related efforts)
Chronic Conditions Data Warehouse

The Centers for Medicare and Medicaid Services has developed the Chronic Conditions Data Warehouse, intended to enable research on 27 chronic conditions that were determined to be of particular importance to the Medicare beneficiaries. This resource includes the algorithms that define the 27 chronic conditions, as well as to links to the references that were used in the creation of the categories.

Clinical Classifications Software

The Healthcare Cost and Utilization Project is a well-established collection of databases and tools sponsored by the Agency for Healthcare Research and Quality. This project has produced Clinical Classifications Software that groups ICD-9-CM codes into clinically meaningful categories.

eMERGE: Electronic Medical Records and Genomics Network

The Electronic Medical Records and Genomics (eMERGE) Network was organized by the National Human Genome Research Institute to connect EHR data with specimens from biorepositories to enable genetic research. The ultimate goal is to provide genetic data for clinical care or personalized medicine. Equipped with genotyping data made available by the advent of the genome-wide association studies era, researchers are now turning to the expanding volume of clinical data in EHRs to identify genotype-phenotype associations. The Phenotype KnowledgeBase is a collaborative environment, organized via a website and facilitated by the eMERGE consortium, that enables access to validated phenotype definitions (“algorithms”), validation of existing phenotype algorithms on EHRs, collaboration on existing and new phenotype algorithms, and interaction with potential phenotype algorithm collaborators.


Mini-Sentinel is a project sponsored by the U.S. Food and Drug Administration (FDA) with the goal of creating a system of safety surveillance for drugs and medical devices after they have been approved for marketing (“postmarket surveillance”). The phenotyping efforts within this project include the accurate identification and characterization of clinical outcomes experienced by people using a specific FDA-regulated device or drug. Additional background information as well as methods and protocols can be accessed on the Mini-Sentinel website.


QualityNet is an effort sponsored by the Centers for Medicare and Medicaid Services aimed at improving the quality of healthcare for Medicare patients. QualityNet provides a secure environment for the exchange of healthcare information as well as tools and quality improvement news and information. QualityNet provides specifications for reporting quality measures that include definitions of clinical populations using standardized coding systems used in healthcare claims data.

Strategic Health IT Advanced Research Projects

The Strategic Health IT Advanced Research Projects (SHARP) program was established by the ONC to facilitate research that would enable increased adoption of health information technology. Area 4 of the SHARP project, known as SHARPn, is focused on enabling secondary use of EHR data. The SHARPn group has developed a resource called the Phenotype Portal for “…generating and executing Meaningful Use standards-based phenotyping algorithms that can be shared across multiple institutions and investigators.”

The Value Set Authority Center

The Value Set Authority Center (VSAC) is a repository hosted by the National Library of Medicine in collaboration with ONC and the Centers for Medicare and Medicaid Services. The VSAC provides access to the official versions of all value sets contained in the Meaningful Use 2014 Clinical Quality Measures (CQMs). Each value set consists of the numerical values (codes) and their respective human-readable names (terms). The value sets are derived from standard vocabularies such as SNOMED CT, RxNorm, Logical Observation Identifiers Names and Codes (LOINC), and ICD-10-CM, which are used to define clinical concepts for quality assessment purposes. The VSAC is expected to expand to incorporate value sets for other use cases, as well as for new measures and updated existing measures. The VSAC Data Element Catalog provides a listing of 2014 CQMs and value set names. Value sets are available for viewing or download after obtaining a free Unified Medical Language System Metathesaurus License (required due to usage restrictions on some of the codes included in the value sets). Detailed information on searching and downloading the value sets can be found on the VSAC website.

How can researchers find existing phenotype definitions in the literature?

The Collaboratory assembled information to provide suggestions for searching for phenotype definitions in the peer-reviewed literature. This guidance recommends first exploring authoritative sources for existing phenotype definitions, and then the published literature describing multisite studies or patient registries for particular conditions.

Evaluating Phenotype Definitions

[Back to top]

What makes a “good” phenotype definition?

Computable phenotype definitions should be explicit, reproducible, reliable, and valid. Specific details of the components of a definition (e.g., data elements; value sets) should be provided and should be sufficient to reproduce the query in another system or by another data operator. For a phenotype definition to be reliable, it must be able to produce a similar result with the same data set every time it is applied. For a phenotype definition to be valid, it must identify the condition for which it was developed and claims to identify and meet the desired degree of sensitivity and specificity.

Various performance metrics are used to measure the performance of a phenotype definition in different data sources or populations, analogous to measuring the performance of a case definition or diagnostic technique. These metrics include sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).

In addition, to become consistently used, computable phenotype definitions must leverage data that are routinely collected in most, if not all, EHRs and/or ancillary systems.

How can the validity of a phenotype definition be determined?

The validity of a phenotype definition refers to its ability to correctly measure or detect people with and without the intended condition; i.e., its ability to correctly identify which individuals exhibit the true phenotype and which do not.

The estimation of validity requires a gold standard, defined as the best classification available for assessing the true or actual phenotype status. Assessment of a gold standard is a resource-intensive process requiring careful manual review of current and historic individual patient data. Due to logistical and efficiency considerations, multiple clinical reviewers are usually involved in the process. However, to ensure consistency between conclusions drawn from the patient records, an initial training of the reviewers is crucial. Most studies utilize expert clinicians to review identified cases, but do not specify the training of the individuals or the details of their assessment of true disease/case status.

Many phenotype developers have conducted validation studies [9–11], but none appear to have used a controlled approach. Some investigators attempt to characterize the validity of a phenotype definition using agreement rates between the definition and a known standard, whereas others report the sensitivity or specificity of the definition compared with a known or gold standard. In this context, sensitivity is the ability to correctly identify individuals who have the phenotype, and specificity is the ability to correctly identify those who do not have the phenotype. The true phenotype status must be known to assess validity. Positive predictive value (PPV) provides an estimate of the prevalence of the true condition among individuals who have the phenotype, and negative predictive value (NPV) provides an estimate of the prevalence among those who do not have the phenotype. PPV and NPV give an indication of the success rate of the phenotype definitions when they are to be used in practice. Similar to sensitivity and specificity, PPV and NPV require knowledge about the true phenotype. These can be estimated based on sensitivity, specificity, and prevalence of the condition in the population being examined.

Researchers at Duke University’s Center for Predictive Medicine are developing and testing methods to quantify the validity and reliability of certain computable phenotype definitions (see presentation).

Determination of a gold standard is a critical complicating factor related to questions about data quality in EHRs and ultimately the “source of truth.” For conditions in which laboratory values are diagnostic, a laboratory value can be the gold standard, although the clinical context is critical in many cases. For behavioral or mental health conditions, the gold standard or best source of data to approximate the “truth” is often from the patient or from an observation by an expert clinician. For many diseases with complex etiology, subjective diagnosis, or a broad range of clinical presentations, the best source of data (or “truth”) is not clear. Likely, a variety of data sources must be used to determine a patient’s true state of disease or identify the condition.

How can the reliability and reproducibility of a phenotype definition be determined?

Reliability refers to the extent to which an experiment, test, or measuring procedure (or phenotype definition) yields the same results on repeated trials [12]. Reliability is an attribute of any computer-related component (software, hardware, or a network, for example) that consistently performs according to its specifications. One method for assessing reliability is to implement the phenotype definition algorithm multiple times and see if the results on the same patients are the same over repeated implementations.

In contrast, reproducibility refers to the consistency of results/implementation of the algorithm multiple times under similar conditions (perhaps with different person implementing). For reliability, one would repeatedly implement the algorithm on the same set of patients and check whether the phenotype results for the same patients match. For reproducibility, the algorithm can be implemented on either different or the same patient populations by different “coders.”

Ultimately what is required is an unequivocal algorithm that is implemented without any room for confusion. For most diseases (especially those with a subjective diagnosis or broad range of clinical presentations), a variety of data sources must be included in a phenotype definition. Unfortunately, the more complex the phenotype definition, the more difficult it can be to reproduce and the more likely errors can influence the reliability of the algorithm [13].

Several well-known issues can affect reliability, including coding terminology changes over time and coding practice variations at the provider, healthcare system, and regional levels. An active and future area of research involves studying data quality and testing various phenotypes in different settings or time periods to represent variations in data quality.

How can the reproducibility of a phenotype definition be optimized?

Two features of phenotype definitions can enhance the likelihood that they will be applied consistently: clearly articulated specifications for the definition and guidance for implementers. However, the development of meaningful specifications and documentation is complicated by the variation in healthcare information systems and lack of data standards for EHR data.

Ideally, a phenotype definition should be reproducible across institutions, but many factors can affect reproducibility, including regional differences in patient populations, differences in EHR systems, variations in the work flows that generate data, and variations in coding practices.

What are potential limitations of EHR data and computable phenotypes?

The data contained in EHRs and ancillary systems are generated through the provision of clinical care. As such, the data are not optimized for secondary uses and are associated with multiple limitations when applied for research purposes [14].

Missing Data

Because EHR data are derived from patient encounters with a provider or healthcare system, data are only recorded during healthcare episodes. This can result in bias due to healthier individuals being missing from the dataset. “Missingness” is a frequent problem and is often nonrandom—a concept known as informative censoring [15,16]. Patients are also lost to follow-up if they move out of the area or obtain care from a provider in a different healthcare system. In pragmatic clinical trials, it is therefore important to distinguish between “not present” in the dataset versus “did not assess.”

Inaccurate or Uninterpretable Data

Errors are common in data from EHRs or ancillary sources, because most data are entered by busy healthcare providers during a patient visit or afterwards from recall. Phenotype definitions based on coding that is influenced by billing are susceptible to systematic biases. In addition, data may be uninterpretable if, for example, units of measurement are missing or analyzable information cannot be gleaned from qualitative assessments.

Complex and Inconsistent Data

In healthcare, clinical definitions, coding rules, and data collection systems vary over time, creating challenges in the analysis of these data. Data collection practices can also vary by providers at different locations. Finally, much information is still captured as unstructured data and stored in narrative notes. Though many challenges exist in extracting unstructured data, these data are increasingly being used to support various types of clinical decision-making and research using an evolving set of tools [17].

Data Quality

[Back to top]

The quality of the data in health information systems has the potential to affect the results of phenotype-based queries in such a way that the resulting data may not be useful. Secondary use of healthcare data is defined as use of the data for a purpose other than that for which the data were originally collected [18]. This means that the secondary user should in no way expect that the data will meet his or her needs. For these reasons, data quality assessment should accompany phenotype validation. Using healthcare data in the absence of an understanding of their accuracy, consistency, missingness, and possible biases can lead to misleading answers. The capacity of the data to support research conclusions is so important that requests for applications for demonstration projects for the Collaboratory require that data validation be addressed, and there is ongoing work through a methodology contract from the Patient-Centered Outcomes Research Institute (PCORI) (contract #ME-1303-5581; PI, Michael Kahn, MD, PhD) to recommend reporting requirements for data quality (i.e., reporting of data quality along with research results).

The Collaboratory is developing a data quality assessment framework to help investigators and research teams with identifying and implementing necessary assessments. Unfortunately, today there are few validated electronic methods for data quality assessment that can be executed on a dataset. Instead, current methodology for data quality assessment is comparison based, involving comparison of chart review to data returned from a phenotype-based query, or comparison of two different datasets to quantify the number and type of discrepancies and understand how they might be distributed in a dataset.

Identification and Development of Phenotype Definitions

[Back to top]

A phenotype definition can be considered an operationalization of a disease concept in electronic health data systems or repositories. To operationalize a disease concept against EHR data, researchers must explicitly define how a concept should be measured, observed, or manipulated within a particular study and available data sources. A theoretical or conceptual variable of interest (disease) must be translated into a set of specific diagnoses or procedures paired with implementation specifications that define the variable’s meaning in a specific study. In the context of healthcare data, this means explicitly defining diagnoses, treatments, and clinical and patient characteristics that are indicative or suggestive of the condition. Researchers must specify the clinical condition that they are looking for and how that would be represented in various EHRs.

For example, to identify obesity, researchers would first identify diagnostic and procedure codes for this condition and investigate whether these codes are reliable and applied consistently. If they cannot reasonably assume that all patients with obesity would be coded with a given diagnosis or procedure code, they must use other data sources. The next step is to review the data sources that are available (e.g., EHR, claims, registry, and patient-reported outcomes data), noting that if a phenotype definition is to be applied in multiple organizations, the researchers must consider the data sources that are available in other organizations. Possible data sources for obesity might include patient height/weight, the ordering or dispensing of medications associated with weight management, or patient-reported data on weight or previous diagnosis of obesity. Within each data type, researchers should identify which data are available to them (e.g., in my EHR data, I have medication orders but not administration data, or billing diagnoses versus problem lists). Knowing the types of data available can support an early feasibility assessment of existing phenotype definitions.

Before any phenotype development begins, researchers should search for existing phenotype definitions and consider their performance in validation testing. They should then assess the candidate phenotype definitions for feasibility in a particular setting (e.g., do my available domains match the authoritative source phenotype definition?). If a suitable phenotype definition cannot be found from authoritative sources, then one must be developed and validated. Regardless, once one or more candidate phenotype definitions are identified, they must be validated against a gold standard in clinical populations, as shown in the figure below.

Phenotype evaluation process

Phenotype evaluation process. AHRQ, Agency for Healthcare Research and Quality; CMS, Center for Medicare and Medicaid Services. Adapted with permission from Shelley Rusincovitch, Center for Predictive Medicine, Duke Clinical Research Institute.

Implementation of Phenotype Definitions

[Back to top]

Phenotype definitions must be curated and maintained over time, as diagnosis and procedure codes can change. For example, the majority of existing phenotype definitions developed to date include ICD-9-CM diagnosis and/or procedure codes. In the near future, health systems and providers will be implementing ICD-10-CM, necessitating updates to existing phenotype definitions that utilize these codes.

Recommended Phenotype Definitions

[Back to top]

The Collaboratory Phenotypes, Data Standards, and Data Quality Core is developing recommendations for the use of data from EHRs and/or ancillary sources. These recommendations apply to investigators planning to leverage data originally collected in the course of healthcare delivery for the secondary purposes of observational and interventional studies, prospective recruitment into clinical trials, health services research, public health surveillance, and comparative effectiveness research.

The developers of EHR and ancillary systems will ultimately determine the data collected in their own systems. Our effort is intended to share knowledge about conditions estimated to be important to a broad set of researchers. We will also share recommendations related to conditions studied by researchers in the Collaboratory.

The material presented here has not been fully vetted or endorsed by the NIH, the Collaboratory Steering Committee, or all Collaboratory members. The information presented is continually evaluated and updated as new use cases, phenotype definitions, and phenotype validation results become known.


Common Conditions

What opportunities does the Collaboratory have to contribute?

Researchers can define best practices in data collection and use. The Collaboratory represents a high-visibility effort that is ideally positioned to build (and endorse) a case for standards. Our members can be a conduit to healthcare organizations. The Collaboratory is uniquely focused on using data from EHRs, in contrast to de novo data collection standards that other research networks have focused on in the past.

Phenotypes, the Collaboratory, and PCORnet

[Back to top]

What is the relationship between the PCORnet Common Data Model and phenotype definitions?

PCORnet, the National Patient-Centered Clinical Research Network, is a “network of networks” that brings together both Clinical Data Research Networks (CDRNs) and Patient-Powered Research Networks (PPRNs) to conduct large-scale, patient-centered pragmatic clinical trials. Creating a robust analytic dataset for use across PCORnet is the focus for development of the PCORnet Common Data Model (CDM), a document that will specify data elements required for all network participants [19]. Based on principles of PCORnet and readiness of various CDRNs, the CDM needs to encompass data elements that most providers already collect and can share, query, and report easily.

The CDM for PCORnet is based on the Mini-Sentinel Common Data Model. The PCORnet model will extend the Mini-Sentinel model to include new elements derived from EHRs, including patient-reported outcomes and medication orders.

The CDM is organized into broad data domains as encounters, diagnoses, and procedures. Disease-specific research requires detailed definitions to define conditions, comorbidities, patient outcomes, and patient characteristics of interest. Most organizations use a standard coding system, but these data may be organized into different data models. The source systems may also have slightly different data-generating activities in certain contexts. A phenotype definition provides a consistent approach to querying data for disease-specific research.

In short, the CDM is the “building block data” (e.g., encounters, diagnoses, and procedures). The phenotype definitions recommended by the Collaboratory are the details (i.e., logic) of how to apply criteria across one or more of the CDM data domains to identify patients with certain conditions consistently and reliably. The Collaboratory is defining definitions in selected areas, but also defining standards, guidelines, and tools for the specification, representation, validation, and sharing of these definitions. These activities will ultimately support the informed, appropriate, and consistent use of these phenotype definitions, thereby promulgating and promoting “standard” data collection and reporting in pragmatic clinical trials (and likely other contexts, such as quality improvement, population and safety surveillance, and observational research). The end result will be high-quality data that can be used and shared to support learning healthcare.

The phenotype definitions that are defined in the Collaboratory will likely be of value to PCORnet researchers, and it is possible that many of these phenotype definitions will someday be required in PCORnet or PCORI-funded studies. Therefore, the phenotype definitions developed by the Collaboratory will leverage and reference the PCORnet CDM variables, and the specifications for these phenotype definitions will provide evidence, access to information, and guidance to make it easier for researchers and health data operators to identify and use standard phenotype definitions.

This preliminary work and vision of the Collaboratory might also inspire future efforts aimed at developing strategies for standardizing other types of clinical and patient-reported data as they are collected.

The phenotype definitions for various conditions will use the PCORnet CDM data elements, but they will identify specific subsets of codes for identifying specific conditions (e.g., a list of approximately 30 ICD-9-CM codes for type 2 diabetes mellitus, the use of diagnosis plus medication codes). Ideally, the phenotype definitions recommended by the Collaboratory will be promoted and/or required in PCORnet research activities.

In time, PCORnet might further standardize the data collected or reported by CDRNs, and the Collaboratory efforts can provide strategies and priorities for standardizing important data elements and collecting data (i.e., use of coding systems) prospectively. The data standardization efforts of PCORnet and the Collaboratory should be synergistic and complementary.


The Phenotypes, Data Standards, and Data Quality Core of the NIH Collaboratory has influenced much of this content through monthly meetings throughout 2013-2014. Members of the Core include Monique Anderson, Nick Anderson, Alan Bauck, Denise Cifelli, Lesley Curtis, John Dickerson, Beverly Green, W. Ed Hammond, Chris Helker, Michael Kahn, Cindy Kluchar, Reesa Laws, Melissa Leventhal, Rosemary Madigan, Renee Pridgen, Jon Puro, Rachel Richesson, Jennifer Robinson, Shelley Rusincovitch, Jerry Sheehan, Greg Simon, Michelle Smerek, Kari Stephens, and Meredith Nahm Zozus.

Editorial support was provided by Gina Uhlenbrauck and Jonathan McCall of Duke Clinical Research Institute. The phenotyping methods figure was developed by Shelley Rusincovitch.

We are also grateful to the Duke Center for Predictive Medicine for development and clarification of the scientific validity and evaluation of phenotype definitions. Members include Maria Grau-Sepulveda, Benjamin Neely, Charlotte Nelson, Michael Pencina, Paramita Saha Chaudhuri, and Anne Wolfley.

[Back to top]


[Back to top]

1. Richesson RL, Hammond WE, Nahm M, et al. Electronic health records based phenotyping in next-generation clinical trials: a perspective from the NIH Health Care Systems Collaboratory. J Am Med Inform Assoc 2013;20:e226-e231. PMID: 23956018. doi: 10.1136/amiajnl-2013-001926.

2. Hsia DC, Krushat WM, Fagan AB, et al. Accuracy of diagnostic coding for Medicare patients under the prospective-payment system. N Engl J Med 1988;318:352-355. PMID: 3123929. doi: 10.1056/NEJM198802113180604.

3. Ludvigsson JF, Pathak J, Murphy S, et al. Use of computerized algorithm to identify individuals in need of testing for celiac disease. J Am Med Inform Assoc 2013;20:e306-310. PMID: 23956016. doi: 10.1136/amiajnl-2013-001924.

4. PheKB. Collaborative Groups. Available at: http://phenotype.mc.vanderbilt.edu/groups. Accessed June 13, 2014.

5. U.S. Department of Health and Human Services. Health Information Technology: Standards, Implementation Specifications, and Certification Criteria for Electronic Health Record Technology, 2014 Edition; Revisions to the Permanent Certification Program for Health Information Technology. In: Federal Register. 2012. 54163-54292. Available at: https://federalregister.gov/a/2012-20982. Accessed May 19, 2014.

6. National Library of Medicine, National Institutes of Health. Glossary. Common Data Element Resource Portal. Available at: http://www.nlm.nih.gov/cde/glossary.html#cdedefinition. Accessed May 20, 2014.

7. ISO/IEC JTC1 SC32 WG2. ISO/IEC 11179 Information Technology — Metadata Registries. Available at: http://metadata-standards.org/11179/#A1. Accessed June 13, 2014.

8. Burnette JL. Operationalization. In: Baumeister RF, Vohs KD, eds. Encyclopedia of Social Psychology. Thousand Oaks, CA: SAGE Publications; 2007. 636-637. Available at: http://knowledge.sagepub.com/view/socialpsychology/n379.xml. Accessed May 19, 2014.

9. Newton KM, Peissig PL, Kho AN, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inform Assoc 2013;20:e147-154. PMID: 23531748. doi: 10.1136/amiajnl-2012-000896.

10. Peissig PL, Rasmussen LV, Berg RL, et al. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inform Assoc 2012;19:225-234. PMID: 22319176. doi: 10.1136/amiajnl-2011-000456.

11. Rosenman M, He J, Martin J, et al. Database queries for hospitalizations for acute congestive heart failure: flexible methods and validation based on set theory. J Am Med Inform Assoc 2014;21:345-352. PMID: 24113802. doi: 10.1136/amiajnl-2013-001942.

12. Reliability. Merriam-Webster Dictionary. Available at: http://www.merriam-webster.com/dictionary/reliability. Accessed May 20, 2014.

13. Richesson RL, Rusincovitch SA, Wixted D, et al. A comparison of phenotype definitions for diabetes mellitus. J Am Med Inform Assoc 2013;20:e319-e326. PMID: 24026307. doi: 10.1136/amiajnl-2013-001952.

14. Bayley KB, Belnap T, Savitz L, et al. Challenges in using electronic health record data for CER: experience of 4 learning organizations and solutions applied. Med Care 2013;51:S80-86. PMID: 23774512. doi: 10.1097/MLR.0b013e31829b1d48.

15. National Research Council. The Prevention and Treatment of Missing Data in Clinical Trials. Washington, DC: National Academies Press; 2010. Available at: http://www.nap.edu/catalog.php?record_id=12955. Accessed May 20, 2014.

16. Shih W. Problems in dealing with missing data and informative censoring in clinical trials. Curr Control Trials Cardiovasc Med 2002;3:4. PMID: 11985778.

17. Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inform Assoc 2011;18:544-551. PMID: 21846786. doi: 10.1136/amiajnl-2011-000464.

18. Safran C, Bloomrosen M, Hammond WE, et al. Toward a national framework for the secondary use of health data: an American Medical Informatics Association white paper. J Am Med Inform Assoc 2007;14:1-9. PMID: 17077452. doi: 10.1197/jamia.M2273.

19. Institute of Medicine. Workshop in Brief: Data Harmonization for Patient-Centered Clinical Research— A Workshop. Available at: http://www.iom.edu/Activities/Quality/VSRT/~/media/Files/Activity%20Files/Quality/VSRT/Data-Harmonization/VSRT-WIB-DataHarmonization.pdf. Accessed May 21, 2014.

[Back to top]

Topic chapter originally published on June 27, 2014.

  • Questions or comments about Rethinking Clinical Trials can be submitted via email. Please add “Living Textbook” to the Subject line of the email.
  • Follow the NIH Collaboratory on Twitter!