Unit 3. Data Characterization#
In this unit, users will be given insight into how data characterization methods can impact health equity. In particular, mitigation strategies and examples will be covered for the topics of Descriptive Statistics and Data Visualization.
Health Equity and Data Characterization
|
Prior to these lessons, we give a brief introduction to the field of data characterization as well as common motivators for pursuing data characterization prior to modeling and analysis.
What is Data Characterization?#
Data Characterization is the practice of summarizing characteristics about a data set or sample under study and includes extracting and visualizing information such as associations, trends, correlations, and outlier and distributional analysis. This step is usually fundamental toward characterizing and addressing quality issues in the data set (i.e. identifying missing values, outliers, heterogeneity and variability, etc), which further aids in data preparation tasks such as feature engineering and supports development of predictive models.
Motivation for Data Characterization#
The advances in AI methods have proven useful in many public health initiatives. In order to mitigate potential health equity issues within analyses, ideally the data sample used should be representative of the demographics, viewpoints, and other important aspects of the population at large [Ibrahim et al., 2020]. While it is important that data samples and the statistics that they generate are representative of the population of interest, it is unlikely that every member of a population will be included in a collected sample. This is why we must rely on statistics in order to make inferences about the population. Understanding the data sample is key to producing reliable statistics. Further, characterizing the data set is also critical to informing method selection, model development and training, accurate reporting, as well as illuminating the scope of valid conclusions that may be drawn with the data being analyzed in order to prevent perpetuating existing bias or creating new health disparities.
Performing descriptive statistics and visual exploration of the data helps to:
understand sample differences along attributes such as race, ethnicity, and gender
inform model development and analysis
provide context and framing around valid conclusions during post-analysis and dissemination
Specifically, data characterization can help illuminate the following:
precision and sample size estimation
data heterogeneity (difference in variances among groups)
skewness present in distributions and distribution shape
correlations
temporal trends
potential issues of bias or unfair representation in data visualizations
Lessons#
This Unit includes the following two lessons exploring data characterization and health equity: