Unit 4: Data Preparation#

Data preparation is one of the critical initial steps in any analytic work flow. Analytic models may require different data preparation processes in order to adjust inputs to the proper formats and data configurations [Brownlee, 2020].

Health Equity and Data Preparation

  • Cleaning data helps to reduce bias and health inequity by removing errors or incomplete data that could lead to incorrect conclusions.
  • Feature engineering can allow researchers to identify features that are predictive of health outcomes but do not exhibit bias towards certain groups.
  • Sample weighting can be used to address disparities between population subgroups, allowing researchers to better understand and reduce health inequity.
  • Data validation works to ensure that all groups within the data are recorded accurately, so that any disparities in the data and health outcomes can be identified and addressed.

What is Data Preparation?#

Data preparation is the process of gathering, cleaning, transforming, organizing, and formatting data to make it ready for downstream analysis and machine learning models. Below are the subject areas and data preparation methods that will be covered. These subject areas are often classified in interchangeable ways, but are classified accordingly for the purpose of this learning unit:

  • Data Cleaning: This section introduces data cleaning and how it is used in the public health space. It also briefly introduces various types of data cleaning processes and their impact on health equity

    • Missing Values: Missing values can lead to incomplete or inaccurate results when analyzing a dataset. This can lead to incomplete understanding of the root causes of health inequities, as well as potential solutions. This lesson introduces imputation, the selection of values to replace missing values in a dataset, and several recommended best practices.

    • Data Transformation: Data transformation processes attempt to add cohesion to datasets by ensuring datasets are consistent across all records and fields. This lesson covers various data transformation methods and how data transformation impacts public health research.

    • Handling Outliers: An outlier is a value that lies an abnormally far distance from other values in a dataset. Outliers can be kept, removed, or recoded. This decision requires careful consideration of how keeping, removing, or recoding an outlier may improve or misrepresent the population in question.

  • Feature Engineering and Selection: Feature engineering refers to the preprocessing and data handling of features, variables, or dimensions before model training. Feature engineering includes feature extraction, selection, and reduction to optimize a model’s accuracy and performance. This section covers health equity considerations related to feature engineering and selection, and how focusing solely on accuracy and performance can introduce bias.

  • Sample Weighting: Sample weighting is the process of generalizing data from a collected study sample to a wider population. This section covers ways to help enhance data representativeness through sample weighting.

  • Data Validation: This section covers the process of data validation, which is used to identify and correct where data might be incomplete or missing, and inaccurate or inconsistent. Data validation also helps ensure the data adheres to the desired project structure or format, in accordance with the organization’s standards, prior to using it for analysis or to train a model.

Lessons#

This Unit includes the following lessons exploring data preparation and health equity: