Section 1. Data Cleaning#
Data cleaning is a data preparation process that attempts to correct data records that may be improperly formatted, corrupted, incorrect, or incomplete. Messy data that has disparate formats and/or representations can introduce problems for models and methods. This in turn could result in inaccurate or biased outcomes. Data cleaning is integral to developing a more accurate and reliable dataset for applying analytic methods.
Health Equity and Data Cleaning
|
Types of Data Cleaning#
Data cleaning involves identifying and correcting inaccurate records, filling in missing values, removing outliers, and transforming data into a consistent format. Data cleaning content will be broken down into the following lessons:
Imputation is the selection of values to replace missing values in a dataset. Missing values may be Missing Completely at Random (MCAR), such that the probability of missing data is unrelated to any of the variables in the dataset nor to the missing variable. Missing values could be related to some variable, such that the variable may be used to estimate the missing values. Missing at Random (MAR) denotes missing values that are related to the other variables but not the missing variable itself or any unobserved values [Pedersen et al., 2017].
Data transformation attempts to add cohesion to datasets by ensuring datasets are consistent across all records and fields. Data normalization is also applied when scaling numeric values to values within specific ranges such as [ 0,1 ], [-1, 1], or [-0.5, 0.5]. Data normalization can also involve mean centering and norming so that the standard deviation is 1. Data normalization leads to a more standardized representation of the data records across the datasets and assists in removing duplicate records as well as aggregation [Brownlee, 2020].
Outliers are valued that lie an abnormally far distance from other values in a dataset. To identify outliers, we can find the upper and lower bounds for outliers depending on the distribution of our data. Outliers can be kept, removed, or recoded. Handling outliers improves the accuracy and representation of the data since many algorithms and techniques are sensitive to outliers.
Considerations for Project Planning
|