Section 1. Electronic Health Records

Section 1. Electronic Health Records#

Electronic health records (EHRs) are commonly used in health analytics. EHR data has advantages over sources such as claims data, in particular the rich details of clinical care including laboratory results and unstructured data such as provider notes, radiology reports, operative notes, and other free text documents. EHR data also have notable disadvantages, such as relative lack of scale compared with claims data and its lack of comprehensiveness when a patient is seen at multiple health systems. Nonetheless, EHRs form an important resource for public health researchers that is amplified by evolving standards for EHR interoperability [Oemig and Snelick, 2017], e.g. Fast Healthcare Interoperability Resources (FHIR) and United States Core Data for Interoperability (USCDI).

Health Equity and Electronic Health Records

Electronic health records may introduce a wide range of biases into analytic results.
Particular biases to be aware of include selection bias, measurement bias, reporting bias, and coding bias.
Compared with other electronic health datasets such as claims, electronic health records are often smaller in population but richer in patient characteristics such as social determinants of health.

What is an Electronic Health Record?#

An Electronic Health Record (EHR) is a digital version of a patient’s medical record that is maintained by a given provider or provider organization. EHR content varies by provider type and setting, but below are some key categories of data that are stored in EHRs.

Demographics
Diagnoses
Medications
Allergies
Lab results
Physical measurements
Clinical notes
Procedure reports
Radiology reports
Pathology reports
Immunizations

Many EHRs also contain computerized provider order entry (CPOE) systems, which add data such as:

Prescriptions for medications
Orders for lab testing
Referrals to radiology or other procedures

How are EHR Data Generated?#

EHR data are generated when patients have encounters with the healthcare system, either in person or virtually. An encounter may be as simple as a routine laboratory draw or as complex as a hospitalization covering weeks of care, hundreds of clinical notes, and thousands of data elements. EHR data can also be generated from encounters that occur via telephone calls, messaging portals, or other virtual visits. Note that while an encounter may formally end when a patient is discharged (or “checks out” in the outpatient context), providers commonly follow up on test results and add EHR data outside of the actual visit context which adds additional elements to the EHR.

[Ehrenstein et al., 2019, Kohane et al., 2021]

What is Structured vs Unstructured EHR Data?#

Electronic health records may contain a wide variety of data types, but a key distinction is between “structured” and “unstructured” data.

Structured data refers to data stored in coded form using standardized terminologies (eg. ICD10, LOINC) such as diagnoses and lab results. There are over 150 medical terminologies listed in the National Library of Medicine’s Unified Medical Language System (UMLS). The most commonly encountered code systems in the United States are:

Unstructured data refers to free-text documents, medical images, videos, and other media that are present in the EHR but not represented using standard terminologies. Note that metadata for this content is often standardized (eg. a LOINC code is available specifying a note type), but the content itself is unstructured and thus requires a different approach to analyze.

Health Equity Considerations#

When conducting an analysis using EHR data, many potential sources of bias may be introduced that specifically affect health equity. Below we review these common types of bias as well as health equity examples and best practices.

Challenge	Challenge Description	Heath Equity Example	Recommended Best Practice
Selection Bias	Selection bias refers to the fact that the individuals whose information is included in EHRs may not be representative of the broader population, because they are more likely to have certain characteristics or to have sought out healthcare. For example, older people who have chronic health conditions or have a higher socio-economic status may be more likely to have their information recorded in EHRs. This can lead to an overestimation or underestimation of certain health outcomes or trends.	A researcher is studying the effect of lifestyle changes versus medication for the treatment of elevated blood pressure. EHR data may be skewed to those more readily able to exercise (due to environment or financial flexibility) or to eat lower salt foods (due to availability and ability to afford healthier foods). This socioeconomic and environmental selection bias may skew the results towards showing the great impact of lifestyle changes on hypertension compared with a more representative dataset.	Compare the EHR population against the general population via e.g. American Community Survey (ACS) Census data to ascertain representativeness. At minimum incorporate age, gender, and race. Ideally add ethnicity, education, income, insurance status, urban/rural status, and other demographic variables as present in the EHR. Lastly, techniques to address potential selection bias in observational studies, such as propensity score matching, should be considered.
Measurement Bias	Measurement bias can occur when the data collected in EHRs is not accurately or consistently recorded. For example, if healthcare providers use different methods or standards for measuring and recording patient information, this can introduce errors or inconsistencies in the data. Such variations can make it difficult to accurately generalize information across different data sites.	A researcher is exploring impact of education level on the control of diabetes. Healthcare providers may randomly or non-randomly fail to inquire about the education level of certain groups of patients. In non-random cases, this may include variation in inquiry based on for example age, race, or primary language. This failure to assess education levels consistently may be exacerbated by variations in how questions of education are asked and documented. As a result, researchers may have education level data available only on a particular subset of patients, misrepresenting the underlying evidence.	Estimate the distribution of patients for whom the EHR has particular data points of interest. Compare against the overall EHR population. At minimum incorporate age, gender, and race. As a further step, compare the population for whom the data are available with the broad ACS Census data, as described above.
Reporting bias	Reporting bias refers to the idea that the information recorded in EHRs may not be complete or accurate due to intentional or unintentional failure to report information by patients or by providers. For example, patients may not disclose all of their symptoms or medical history due to concerns they will be given a particular treatment or procedure they may not wish to receive. Similarly, a provider may not document information that is elicited during the patient’s history because they do not deem it relevant or important to the patient’s care. In either case, information is missing in a likely non-random fashion.	A researcher wishes to investigate the benefits of colon cancer screening in patients with and without a family history of colon cancer. While family history is routinely recorded by the health system, some patients may be hesitant to mention a family history due to fear of undergoing a colonoscopy. There has been extensive research showing higher rates of colonoscopy-specific fears in Black and Hispanic adults [Miller et al., 2015]. This lack of reporting may lead to inaccurate data regarding patients with a family history of the disease, ultimately skewing the findings of the intended research.	If other data sources are available, can investigate underreporting of key variables. Can also study variations within groups for differences in reporting rates. But may be difficult to detect, and thus will need to highlight the limitation in reporting the results of an analysis.
Coding Bias	When documenting a healthcare visit in the EHR, providers select diagnosis codes for billing. Frequently used codes or codes that most effectively justify particular billing will often be part of a “favorites” or similar list. This may lead to a certain homogenization of diagnosis codes even when a more nuanced code is available in the code system.	A researcher wishes to investigate the Maturity-Onset Diabetes of the Young (MODY) in urban and rural populations using a primary care-based electronic health record. The researcher finds that there are very few patients, even those young with diabetes, who are coded with a specific diagnosis code for MODY. Rather, almost all patients have been diagnosed with a more generic type 2 diabetes code. They are thus limited in their ability to utilize diagnosis code to look for this condition.	When studying particular conditions or outcomes in EHR data, it is important to conduct sensitivity testing of different phenotype definitions, as they may perform quite differently based on the nature of the condition. For example, performing a manual review on multiple different definitions of diabetes incorporating laboratory values, patient weight, medications, etc may be necessary to arrive at a reliable definition for the cohort of interest.

Mitigating Bias in Electronic Health Record Data#

The approach to mitigating bias in EHR data is dependent on the analytic objective and methods of the study (see Unit 3, Unit 4, and Unit 5 for additional details). Broadly speaking, however, the primary method to reduce bias in EHR data is to ensure a representative sample that includes a diverse patient population or at minimum a population that aligns with the group being studied. Ways to examine a population to ensure representativeness include:

Characterize population demographics. Compare with ACS data or other baselines as appropriate.
Characterize exposure and outcome patterns by subgroup.
Characterize data density patterns (eg., number of conditions, medications, procedures, etc recorded) by subgroup.
Perform sensitivity testing to determine the impact of different phenotype definitions on cohort composition.

While significant differences across groups may be appropriate to the region of study or condition of interest, these differences should be noted if not well-established. Analysis-specific mitigation strategies are provided in subsequent sections.

Case Study Example#

Case study is for illustrative purposes and does not represent a specific study from the literature.

Scenario SR is a state public health researcher interested in improving medication adherence for chronic diseases, in particular diabetes and hypertension. Through a partnership with a local academic medical center, she gains access to the center’s electronic health record system containing de-identified data on over 2.3M patients.

Specific Research Question What factors contribute to lower levels of medication adherence for oral hypoglycemic agents in people with diabetes?

Analytic Method Multivariate analysis of demographic and clinical factors with the continuous dependent variable being adherence rate defined as percent days of gaps in medication renewal prescriptions over two years. To address potential selection bias, SR used propensity score matching [Franchetti, 2022] incorporating age, gender, race, ethnicity, 3-digit ZIP code, and clinical factors including conditions and procedures.

Results SR found that there were differences in levels of adherence for patients living in urban versus rural areas, with rural patients showing a lower rate of adherence to oral hypoglycemic agents. SR conducted a subgroup analysis stratifying by age, gender, race, and ethnicity and confirmed that the finding persisted.

Health Equity Considerations SR’s study took into account selection bias through propensity matching and sensitivity analysis. But an important consideration that was not specifically addressed was the variability in avaialble medication data in the urban and rural populations. EHR data typically does not contain dispensing data, meaning that the study is reliant on EHR prescription data. Prescribing data differ from dispensing data as prescriptions reflect provider intent rather than actual patient use of the drug. Further characterizing the patterns of medication data for all patients by urban/rural subgroup would help support the validity of the results.

Considerations for Project Planning

Have you characterized your EHR population? How does it compare to the general population?
Does the EHR have consistent availability of data across demographic groups?
Is underreporting a potential factor for your research topic?
Do you currently have procedures to identify and address other biases in the EHR data?

Resources#

All of Us Data Browser - Useful example of relative rates of conditions across a disparate EHR dataset

Tools for EHR-based Analytics - A set of R tools for EHR-based analyses using the OMOP common data model