Lesson 1. Addressing Missing Values

Lesson 1. Addressing Missing Values#

Health equity can be impacted by missing values in data in a few ways. Missing values can lead to incomplete or inaccurate results when analyzing health disparities across different populations. This can lead to an incomplete understanding of the root causes of health inequities, as well as potential solutions. Additionally, certain demographic groups may be more likely to have missing data than others which can lead to further disparities in health outcomes between those groups. Finally, the lack of data can impede the ability to accurately measure progress toward health equity goals. Without complete data sets, it is difficult to assess the effectiveness of health equity interventions, as well as to identify any potential unintended consequences of those interventions.

Health Equity and Missing Values

Missing values can lead to incorrect or distorted results, which could lead to inaccurate public health recommendations or interventions. In addition, missing data can lead to bias in the analysis and an inability to detect meaningful trends or associations.
Imputing missing values in a dataset can help researchers better understand public health trends and associations. By filling in missing data points, researchers can get a clearer picture of how different variables are associated with each other, allowing them to more accurately assess the impact of certain public health initiatives, policies, or programs.
Careful consideration must be taken if choosing to delete incomplete records from your data. By deleting records with missing values, researchers may miss out on important information that could provide insights into public health issues.

Types of Missing Values#

One of the biggest impacts of missing values is it can lead to model bias and a reduction of accuracy in the analysis. There are many possible reasons for missing values from data corruption to an equipment or human failure to record data. Below are descriptions of three types of missing values to keep in mind.

Type	Description	Example
Missing Completely at Random (MCAR)	Missing values are distributed across the variable and unrelated to other variables. The probability of any particular value being missing is unrelated to anything else. This condition is rare in real datasets since “true randomness” is rare. Still, things like equipment malfunction or lost files are considered MCAR because the reason the values are missing is unrelated to other variables in the data.	You have data from a survey about an individual’s medical expenses. Some people started filling out the survey but stopped midway or skipped a question, and you have values from a wide distribution, ranging from low to high values. Therefore, you conclude the missing values are not related to any specific variable or spending range.
Missing at Random (MAR)	Missing values are not uniformly distributed, but they are accounted for by other observed variables. The missing values systematically differ from the data you’ve collected, but other observed variables can fully account for them. The likelihood of a data point being missing is related to another observed variable but not to the specific value of that data point itself.	You notice that you have more missing values for adults aged 18-25 than for other age groups. However, the values for that age group are still widely spread, so it is unlikely that the values are missing because of the specific values themselves. It could be that young adults are less inclined to reveal their information for unrelated reasons like privacy concerns.
Missing not at Random (MNAR)	Values are missing for reasons related to the values themselves. This may indicate you lack data from key subgroups within your sample, and your sample may not be representative of your population.	In long-term medical studies, some participants may drop out because they become more and more unwell. This may cause your dataset to contain only healthy individuals, and you miss out on important data.

How to Address Missing Values#

It is important to understand the various types of missing values and what their context may mean to the dataset as a whole. It is generally best to avoid replacing missing values when the data is MCAR and MAR, as this means that the data is missing in a way that is unrelated to the other variables in the dataset. In this case, replacing the data may introduce bias and distort the results of any analysis. When data values are MNAR, this could mean your dataset is biased. Potential techniques to treat missing values are listed below.

Deletion#

Deleting the rows/columns is a simple approach to address missing values. Deleting records that are missing values to deal with this issue could mean losing a chunk of valuable data. Imputation prevents data loss, preserving valuable information.

List-wise deletion refers to deleting the whole record (participant, row, case, etc) from your dataset.
Pairwise deletion refers to deleting the missing data point, retaining the other data points for these. The variables with missing data may be removed, such that only complete datasets are processed.

Public Health Example: It may be best to choose to delete records from a sample when there are more than a certain number of missing values. An example of this could be in survey data, where a survey response has an insufficient amount of information filled in for the purpose of the research. However, relying too heavily on deletion can introduce bias, especially if several variables have missing data.

Imputation#

Imputation is the process of replacing missing values with substituted values. It is a common data preprocessing step used in machine learning and data analysis. Imputation methods can vary based on categorical and continuous values.

Categorical imputation is used for values that represent a certain group or category, such as gender, hair color, or eye color.
Different forms of imputation are performed on continuous values that represent numerical data, such as height, weight, or age.

Imputation can provide a more accurate/complete picture which in turn can better inform public health policies and interventions that are better targeted to address health equity issues. When using imputation, it is important to note what variables have missing values and that imputation is being used to generate an estimated value to create a more accurate/complete picture.

Single imputation: In single imputation, a single estimate is made for a missing value.
Multiple imputation [Gelman et al., 2005, Li et al., 2015]: In multiple imputation, multiple estimates are made for each missing value. When a single estimate is made, the standard error is unknown. Having multiple estimates introduces variation, allows for evaluation of the standard error across the resulting parameter estimates, and reduces bias.

Public Health Example: Laboratory data from Electronic Health Records (EHR) used in prediction modeling often have missing values that can be addressed through imputation. This, in turn, can improve the model’s estimation bias and model performance [Li et al., 2021].

Imputation Methods#

The table below lists several common methods for imputing missing categorical and numeric values [Azur et al., 2011].

If you are already familiar with imputation methods, please continue to the next section. Otherwise click here.

Method	Common Usage	Description
Mode Imputation	Categorical, Continuous	Replacing missing values with the most frequent value in the variable.
Random Sample Imputation	Categorical, Continuous	Replacing missing values with a randomly selected value from the variable
Similar Category Imputation	Categorical	When specific variables are missing data, similar variables may be selected to replace the missing values.
Logical Rules	Categorical	Sometime the researcher has knowledge about the missing values. Logical rules may be imposed to set values with certain conditions. For instance, you may be studying a disease that varies with distance from the city and specifically drops off at the city limits. You may choose a logical rule stating that each missing disease rate is assigned a value based on distance from the city center; a missing value within the heart of the city, outside the heart of the city but within the city limits, and outside the city limits are assigned different values.
Adding a Missing Category	Categorical	Adding a new category to the variable to represent the missing values. For instance, you could replace all the missing data with the word missing. This method helps to understand the importance of the missing data.
End of Distribution Imputation	Categorical	Replacing missing values with the highest or lowest value in the variable of the underlying distribution. In a normal distribution these would be the values that are +/- three standard deviations from the mean.
Frequent Category Imputation (or Mode Imputation)	Categorical	Replacing missing values with the most frequent category in the variable.
Mean Imputation	Continuous	Replaces missing values with the mean of the available values.
Median Imputation	Continuous	Replaces missing values with the median of the available values.
K-Nearest Neighbor (KNN) Imputation	Continuous	Identifies the k-nearest neighbors that are similar, or close in feature space, to a sample with missing values. Then the missing value for the sample is replaced with the mean value of the neighbors. The specification of parameter "k" means a trade-off of imputation error and maintaining the data structure.
Multiple Imputation	Continuous	Uses multiple imputations to fill in the missing values.
Multivariate Imputation by Chained Equations (MICE)	Continuous	MICE are good at imputing large datasets. These equations are included in a multiple imputation analysis to account for statistical uncertainty in the imputation. Steps in multiple imputation include replacement with the mean and performing regression analysis to predict the values.
Imputation by Deep Learning	Continuous	Using deep learning methods for imputing missing values can be seen as a pattern classification task, and has been shown to reduce bias when imputing large datasets. Imputation using deep learning methods have been shown to have superior performance when compared with baseline imputation methods based on the mean, such as KNNs.
Deleting Records	Categorical, Continuous	The variables with missing data may be removed, such that only complete datasets are processed. Deleting records that are missing values to deal with this issue could mean losing a chunk of valuable data. Imputation prevents data loss, preserving valuable information.

Health Equity Considerations#

Using deletion can be a convenient way to address missing values, especially when the missing factors only make up 5% of the data. However, this does introduce the loss of potential information. Imputation can be used to help prevent data loss, and preserving potentially valuable information.

When there is measurement error or when data is missing at random from a dataset, imputation should not be used. When data is missing due to an underrepresented group of people, for example, imputation might inject bias into the dataset. Also, if the variables used for imputation are strongly linked with the missing data, imputation may not be a good idea since this may artificially increase the variance of the dataset. Moreover, imputation may not be prudent if the data is missing entirely at random because it is impossible to approximate the missing values accurately .

Recommended Best Practices for Missing Data#

Investigate and understand the context of your missing values. Could it be due to human error, mechanical error, or is it random?

MCAR and MAR values are acceptable to ignore, but when data values are MNAR, this could mean your dataset is biased.
Limit the amount of deletion/imputation you perform. It is generally best to perform deletion/imputation if no more than 5% of the values in question are missing. If the missing values are greater than that, it is better to add a missing category to your data to indicate where the data is missing and where it is present.
If you choose imputation, be mindful of what imputation method may be best for your data and analysis [Li et al., 2021] as each method has its own pros and cons, or consider doing a sensitivity analysis between multiple methods.
After performing imputation, compare the results of the imputed data to the original data to verify your imputed results adequately align.
Make sure to document all decisions made during the imputation process, including the methods used, the parameters used, and the results.

Case Study Example#

Case study is for illustrative purposes and does not represent a specific study from the literature.

Scenario: A researcher wants to explore the effect of living in a city on maternal health and newborn health.

Specific Model Objective: Does living in a city increase rates of depression in pregnant women, and does this affect newborn health?

Data Source: Clinical data was obtained for 5000 pregnant women from each state. Women submitted answers to a survey used to assess depression daily for one month.

Analytic Method: Hospital sites were categorized as either urban or suburban and women were tracked over the course of their pregnancies and the outcomes of birth recorded. Clinical and self-reporting was assumed to be valid.

Results: Rates of depression in suburban regions amongst pregnant women are lower and newborn health is greater.

Health Equity Considerations: Clinical datasets often have many variables, and some will have missing values. We want to preserve data for each individual and decide to impute the missing data.

Data obtained from hospitals should be fair with respect to race and gender. When more data is missing for one ethnic group than another, this may skew results.
- For example, let’s say that in our study, only 5% of the data is missing, but 60% of the missing data is data about Latino women. If the imputation algorithm is based on data from other ethnic groups, then it may not capture trends within the Latino population. We may choose to weight the data or to impute solely based on the data from the Latino women population.
- When imputing, we must check our results and make sure we are not systematically skewing results for one population.
We must also consider the issue of imputation for combined, geographically and demographically disparate datasets. For example, in this study we need to combine data obtained from various hospitals and assess individual representativeness as well as the collective representative sample. When combining data from multiple sources, we need to impute into each, and the imputation should be fair in relation to each dataset. A common approach is to rake the data so that values match common standards, such as the Census.
In the case of larger datasets with many variables, multiple imputation is suggested. In this case, using the mean to impute could shift our results and introduce bias. Multiple imputation provides a measure of the quality of imputation and averts shifting results.

Considerations for Project Planning

How have you addressed missing values in the past when trying to create a more holistic dataset? What was your reasoning for choosing those methods?
Can you think of some examples where deleting incomplete records from your dataset may be acceptable? What are some reasons why deleting may be a better option than imputing?