Section 1. Data Cleaning#

Data cleaning is a data preparation process that attempts to correct data records that may be improperly formatted, corrupted, incorrect, or incomplete. Messy data that has disparate formats and/or representations can introduce problems for models and methods. This in turn could result in inaccurate or biased outcomes. Data cleaning is integral to developing a more accurate and reliable dataset for applying analytic methods.

Health Equity and Data Cleaning

  • Cleaning data helps to reduce bias and health inequity by removing errors or incomplete data that could lead to incorrect conclusions.
  • Addressing outliers in data can allow researchers to focus on the data that accurately reflects the population being studied. This improved accuracy can lead to better understanding of the health issues being studied and can provide more reliable information for healthcare providers and policymakers.
  • Data transformation techniques can also help researchers compare different populations by standardizing data across multiple variables. This can lead to more meaningful and accurate comparisons, which can help inform public health researchers and administrators as they make decisions about how to best serve their communities.
  • Datasets with missing values can lead to inaccurate interpretations of health outcomes and trends, which can result in unequal health outcomes. Imputation can be used to fill in missing data, and researchers can gain a better understanding of the health disparities that exist within different populations and make more informed decisions about how to address health disparities.

Types of Data Cleaning#

Data cleaning involves identifying and correcting inaccurate records, filling in missing values, removing outliers, and transforming data into a consistent format. Data cleaning content will be broken down into the following lessons:

Imputation is the selection of values to replace missing values in a dataset. Missing values may be Missing Completely at Random (MCAR), such that the probability of missing data is unrelated to any of the variables in the dataset nor to the missing variable. Missing values could be related to some variable, such that the variable may be used to estimate the missing values. Missing at Random (MAR) denotes missing values that are related to the other variables but not the missing variable itself or any unobserved values [Pedersen et al., 2017].

Data transformation attempts to add cohesion to datasets by ensuring datasets are consistent across all records and fields. Data normalization is also applied when scaling numeric values to values within specific ranges such as [ 0,1 ], [-1, 1], or [-0.5, 0.5]. Data normalization can also involve mean centering and norming so that the standard deviation is 1. Data normalization leads to a more standardized representation of the data records across the datasets and assists in removing duplicate records as well as aggregation [Brownlee, 2020].

Outliers are valued that lie an abnormally far distance from other values in a dataset. To identify outliers, we can find the upper and lower bounds for outliers depending on the distribution of our data. Outliers can be kept, removed, or recoded. Handling outliers improves the accuracy and representation of the data since many algorithms and techniques are sensitive to outliers.

Considerations for Project Planning

  • What data cleaning methods are you employing to help clean and create a more accurate representation of your data?
  • While data cleaning is crucial to ensure the accuracy and reliability of the data, it can also have negative impacts on health-related datasets if not done carefully. Are there any areas in your data cleaning process where you could be introducing bias unintentionally?
    • Are you relying too heavily on data imputation?
    • Are there outliers in your data that exist for a reason that would help better inform your analysis?
    • Are you relying too heavily on sample weights for a specific group when, in reality, there is inadequate data to be making estimates at all?
  • Have or do you need to consider feedback from specific stakeholders in the data cleaning process?