Unit 3. Data Characterization

Unit 3. Data Characterization#

In this unit, users will be given insight into how data characterization methods can impact health equity. In particular, mitigation strategies and examples will be covered for the topics of Descriptive Statistics and Data Visualization.

Health Equity and Data Characterization

Data characterization methods can help mitigate bias within analyses by illuminating issues or trends prior to analysis such as precision and sample size estimation, data heterogeneity (or difference in variances among groups such as race, ethnicity, and gender), skewness present in distributions, correlations, temporal trends, and potential issues of bias or unfair representation in data visualizations.
Exploring the data set using descriptive statistics and data visualization enables a better understanding of the strengths and limitations within the data, and therefore helps inform necessary data preparation processes, ensure appropriate method selection and model development, and provide context and framing around valid conclusions during post-analysis and dissemination.
Public health data sets are unlikely to include every member of a population, which is why we must rely on statistics in order to make inferences about the population. Understanding the data sample is key to producing reliable statistics.

Prior to these lessons, we give a brief introduction to the field of data characterization as well as common motivators for pursuing data characterization prior to modeling and analysis.

What is Data Characterization?#

Data Characterization is the practice of summarizing characteristics about a data set or sample under study and includes extracting and visualizing information such as associations, trends, correlations, and outlier and distributional analysis. This step is usually fundamental toward characterizing and addressing quality issues in the data set (i.e. identifying missing values, outliers, heterogeneity and variability, etc), which further aids in data preparation tasks such as feature engineering and supports development of predictive models.

Motivation for Data Characterization#

The advances in AI methods have proven useful in many public health initiatives. In order to mitigate potential health equity issues within analyses, ideally the data sample used should be representative of the demographics, viewpoints, and other important aspects of the population at large [Ibrahim et al., 2020]. While it is important that data samples and the statistics that they generate are representative of the population of interest, it is unlikely that every member of a population will be included in a collected sample. This is why we must rely on statistics in order to make inferences about the population. Understanding the data sample is key to producing reliable statistics. Further, characterizing the data set is also critical to informing method selection, model development and training, accurate reporting, as well as illuminating the scope of valid conclusions that may be drawn with the data being analyzed in order to prevent perpetuating existing bias or creating new health disparities.

Performing descriptive statistics and visual exploration of the data helps to:

understand sample differences along attributes such as race, ethnicity, and gender
inform model development and analysis
provide context and framing around valid conclusions during post-analysis and dissemination

Specifically, data characterization can help illuminate the following:

precision and sample size estimation
data heterogeneity (difference in variances among groups)
skewness present in distributions and distribution shape
correlations
temporal trends
potential issues of bias or unfair representation in data visualizations

Lessons#

This Unit includes the following two lessons exploring data characterization and health equity: