Section 2. Claims Data

Section 2. Claims Data#

Medical claims are commonly used in health analytics and have several significant advantages over Electronic Health Record (EHR) sources:

Oftentime, claims capture data across diverse healthcare systems and pharmacies and thus may offer more comprehensive data in terms of capturing care from multiple sites compared with individual EHRs
Claims contain information, albeit limited largely to demographics, on individuals who did not receive any medical care, thus providing more representative denominators than EHRs
Claims datasets tend to be very large with aggregate data sources offering tens of millions or even 100+ million records, far exceeding what is available through a given EHR

The disadvantages of claims data are also notable:

They lack depth of clinical information, typically providing only procedure codes, diagnosis codes, and pharmacy dispensing
They do not include data such as notes, radiology reports, or laboratory results
The population for a given claims dataset will often be constrained to certain strata of employment status, age, or income based on insurer

Despite these limitations, claims data are widely used and thus it is essential that researchers have a good understanding of how to identify and mitigate biases in claims-based analyses.

Health Equity and Claims Data

Medical claims may introduce a wide range of biases into analytic results, including selection bias, reporting bias, coding bias, and attrition bias.
Claims data are also typically missing key social determinants of health, with demographic information often limited to age, gender, zip code, and sometimes race.
However, the volume of data in claims is usually quite large, allowing the subsetting of populations or propensity matching while still retaining a sufficient sample size to produce precise estimates.

What are Claims Data?#

Health insurance claims data refers to a collection of information related to medical services provided to patients and the associated costs that are submitted to health insurance providers for reimbursement.

Claims content and format may vary across insurer, but generally includes the following data types:

Patient information
Provider information
Encounter / Service information
Procedure information
Diagnosis information
Cost information
Medication dispensing information

Below we provide some additional information on each category.

Patient Information is typically found in a member or member eligibility file. In the source file, this will include protected health information such as the patient name, address, and date of birth. In the more common de-identified or limited data sets used for analytics, this will include age or year of birth, 3-digit ZIP or State, as well as gender and potentially race and ethnicity. Additionally, the member file will include the eligibility dates for the individual including, the date of enrollment and termination (if applicable). This information is extremely helpful in identifying whether the absence of patient information during a given period reflects no utilization of care or is simply due to not being insured during that period.

Provider Information includes details such as name, practice location address, specialty information, and national provider identifier (NPI) number for the healthcare provider delivering services.

Encounter / Service Information provides the location of the service, such as inpatient, outpatient, laboratory, and so forth. Admission and discharge dates are also provided for inpatient claims.

Procedure Information includes specifications of the specific medical procedure provided to the patient, typically coded via the Current Procedural Terminology (CPT) coding system.

Diagnosis Information typically includes at least one, but often multiple, diagnosis codes associated with the procedure / visit. Most commonly these will be represented using the International Classification of Diseases (ICD) coding system. For claims after 2015, these codes should be ICD-10, while claims prior to October 1, 2015 may include ICD-9 codes.

Cost information includes per-procedure charges, the total amount paid, and the amount paid separately by the patient (deductible, co-pay, etc) and by the insurer. Certain proprietary information, such as negotiated rates, may not be included in all claims datasets.

Medication Dispensing Information Pharmacy information is usually included in a separate pharmacy claims file that includes details such as medication code (National Drug Code, or NDC), dosage, number of doses dispensed, date of dispensing, pharmacy information, prescriber information, and charges and payment information including a breakdown of patient payment and insurer payment.

Claims data can also include many other fields that are relevant to the processing of claims but less relevant to analyses for public health or other research purposes. Note that depending on the recency of the dataset, claims data can be updated in subsequent versions as a result of the claims adjudication process.

Health Equity Considerations#

When conducting an analysis using claims data, many potential sources of bias may be introduced that specifically affect health equity. Below we review these common types of bias as well as health equity examples and best practices.

Challenge	Challenge Description	Heath Equity Example	Recommended Best Practice
Selection Bias	Selection bias refers to the fact that the individuals whose information is included in claims data may not be representative of the broader population. A common example of this would be claims data based on Medicare (skews older), Medicaid (skews lower socioeconomic status as well as sicker), and commercial sources (skews towards employed and healthier). Not accounting for the difference between the claims population and the general population can lead to an overestimation or underestimation of certain health outcomes or trends.	A researcher is studying the effect of steroid inhalers on reducing hospital admissions for asthma. A commercial claims dataset may have fewer sick and less demographically diverse patients than those found in the general population. A healthier baseline population may lead to inaccurate estimation of the benefits of inhalers in avoiding hospitalization.	Compare the claims population against the general population as reflected in the ACS to ascertain representativeness. Ideally, incorporate age, gender, race, ethnicity, and urban/rural status. (Practically, claims data may be inconsistent beyond age and gender, thus limiting variables that can be incorporated into this assessment.) Depending on the findings, additional methods such as weighting may need to be applied, as covered in Unit 4.
Reporting bias	Reporting bias refers to the idea that the information recorded in claims data may not be complete or accurate due to intentional or unintentional failure to report information by patients or by providers. For example, patients may not disclose all of their symptoms or medical history due to concerns they will be given a particular treatment or procedure they may not wish to receive. Similarly, a provider may not document information that is elicited during the patient’s history because they do not deem it relevant or important to the patient’s care. In either case, information is missing in a likely non-random fashion.	A researcher is using claims data to study obesity rates. Unlike EHR data, which typically includes vital signs and measurements such as weight, a claims dataset does not have such quantitative information. As a result, the researcher is using coded diagnoses of Obesity to assess prevalence. The diagnosis of obesity may often be billed only in morbidly obese patients or in patients for whom a test needs to be ordered with an acceptable code for billing. Additionally, for patients in whom an obesity-related condition is already present (eg. diabetes), the provider may not choose to add the code for obesity as well. In such cases, the provider does not report the obesity code despite the recognition that the patient meets the technical criteria of obesity. The researcher will underestimate rates of obesity if using claims alone.	Looking at rates of obesity in the claims dataset compared with standard population-level surveys such as the Behavioral Risk Factor Surveillance System (BRFSS) can help ascertain how different reporting rates in the claims data are from baseline population levels. This must include, however, adjusting for age, gender, race, and other demographic factors if available, to ensure that the claims dataset population is weighted to the standard (eg BRFSS) population. Details on applying weighting methods are provided in Unit 4.
Coding Bias	Coding bias refers to errors in the coding or classification of medical data. One source of coding bias is benign variation in how providers represent a particular condition (e.g. Diabetes vs Type 2 Diabetes). A more concerning cause of coding bias is the variation in how providers diagnose and treat particular conditions based on demographic factors. Such implicit biases may lead to a disconnect between an individual’s true medical condition and its representation in claims data. Finally, coding may be influenced by providers’ intentional or unintentional biases in coding the complexity of a visit. For example, documenting more medical conditions will enable a higher complexity (and thus higher reimbursement) encounter. Such coding styles may differ based on provider, institution, or region.	A researcher is using a commercial claims dataset to explore the prevalence of heart disease in male vs female patients. Extensive literature has shown that female patients are consistently underdiagnosed and under-treated for heart disease compared with male patients presenting with similar symptoms. Thus, using diagnosis or procedural codes alone may not reveal the true prevalence of heart disease in this population. The researcher may need to broaden outcome definitions or compare the frequency of heart disease overall in the claims dataset with prior research in this area to assess baseline rates. At a minimum, the researcher will need to highlight this limitation in the analysis.	Look at the distribution of patients in the claims dataset that have the events (conditions, medications, procedures, etc) of interest. How does this vary by age, gender, and race? Is the variation consistent with existing patterns in the literature? If not, this may represent a new finding or reflect a bias in the data, similar to the problem of measurement bias. Where feasible, performing calibration weighting or incorporating additional data sources may be helpful. At a minimum, calling out the bias as a potential source of error will make the consumer aware of this limitation.
Attrition Bias	Attrition bias refers to the loss of patients from the dataset in a non-random fashion. For example, in a commercial claims dataset, patients who are likely to change or lose their jobs may be more likely to move out of the dataset. Similarly, with a state-based Medicaid dataset, patients who are likely to move out of the state may have other non-random differences in health patterns compared with those who remain stably within the same state over a long period of time. Studies looking at outcomes over time may be affected by such attrition bias	A researcher is using a state Medicaid dataset to look at rates of stroke in diabetics in urban versus rural settings. Rural populations may be more likely to remain in the state over time compared with those living in an urban setting. Thus the researcher may miss more outcomes of stroke in urban diabetics due to attrition.	First, assess whether your analysis would likely be affected by attrition. If no longitudinal aspects, it may not have a meaningful impact. If attrition is a potential concern, assess the nature of the claims dataset in regards to would a patient be likely to exit the dataset if they were to change jobs, lose a job, or move to a new locale. If yes, this should be noted as a potential source of bias. Additionally, look at patients whose membership file indicates exit from the dataset mid-year. Do their demographic or clinical characteristics differ from the population as a whole? These patterns may provide information regarding the non-random impact of attrition on the dataset.

Case Study Example#

Case study is for illustrative purposes and does not represent a specific study from the literature.

Scenario JS is a public health researcher who is interested in studying health disparities in cardiovascular disease (CVD) outcomes among Medicare beneficiaries. She wants to explore whether there are differences in CVD outcomes between Black and White Medicare beneficiaries and examine the factors contributing to these differences.

Specific Research Question Are there racial disparities in CVD outcomes among Medicare beneficiaries, and if so, what are the contributing factors?

Data Source Medicare 100% Standard Analytical Files (SAF) data as well Master Beneficiary Summary File with National Death Index Segment from 2016 to 2019.

Analytic Method Retrospective cohort design. Cohorts are defined as beneficiaries aged 65 or older who had a CVD-related hospitalization or outpatient visit during the period 2016-2019. CVD outcomes are defined by ICD10 and CPT codes as a composite measure of all-cause mortality, recurrent CVD hospitalization, or myocardial infarction. Multivariable logistic regression models were used to estimate the adjusted odds ratios and 95% confidence intervals of CVD outcomes by race, controlling for demographics, comorbidities, medications, and healthcare utilization.

Results Black Medicare beneficiaries had higher rates of CVD outcomes than white beneficiaries, with an adjusted OR of 1.24 (95% CI, 1.20-1.28). The disparity persisted even after adjusting for all the covariates, suggesting that other factors beyond demographics and clinical characteristics contribute to the disparities. Further analyses revealed that Black beneficiaries had lower rates of appropriate medication use and higher rates of healthcare utilization compared to white beneficiaries, which may partially explain the disparities. A sensitivity analysis was conducted using different outcome definitions and subgroup analyses by age, gender, and comorbidity burden, and found consistent disparities.

Discussion This case study is a good example of how large-scale longitudinal claims datasets can be used to assess health disparities. However, a few limitations should be noted that might affect the results. Several types of selection bias may be present, such as variation in access to care which may lead to over- or underestimation of the actual rate of CVD events. Approaches to address such bias include sensitivity analyses by geographic location or healthcare system. Another limitation is the accuracy and completeness of race reporting in claims data. JS should report these limitations and if possible incorporate additional data sources such as EHR to validate the study findings.

Considerations for Project Planning

What characteristics of your claims data might affect its representativeness compared with the general population?
What is the availability and completeness of demographic factors such as race and ethnicity?
Does your research require any social determinants of health data not typically found in claims data?
How do you plan to mitigate the limitations in the claims population or available data elements?

Resources#

Health Equity Research Assessment Tool - An insightful claims-based analysis of patient demographic and clinical characteristics across multiple claims datasets.

Tools for Claims-based Analytics - A set of R tools for EHR-based analyses using the OMOP common data model