Lesson 1. Supervised Learning#
Supervised learning uses labeled data to train a model to make predictions and judgments. In the area of health equality, this kind of learning can be applied to pinpoint health inequities, estimate the likelihood of health-related outcomes, and create focused interventions. Supervised learning algorithms can be used to identify groups of people who are more likely to benefit from preventative treatments, to identify groups of people who are at risk for particular health disorders, for instance, to examine patient data. Moreover, supervised learning can be used to pinpoint subgroups of people who have various health outcomes and create individualized therapies to deal with health outcomes inequalities [Ghassemi et al., 2020].
Health Equity and Supervised Learning
|
What is Supervised Learning?#
Supervised learning models are defined by their use of labeled datasets for training models to learn a mapping between an input and target variable, and ultimately leveraged for prediction on new instances. A labeled dataset refers to data that comes with known characteristics such as discrete categorical labels (class labels or targets) and attributes (features). There are two main categories of supervised learning problems:
Classification algorithms are used to identify which discrete category or class a given data point belongs to given its features.
Regression algorithms differ from classification problems in that they are used to predict continous numeric quantities. Regression problems are used to predict the value of a dependent variable based on given independent variables.
This lesson focuses on using classification and regression models for prediction. However, regression is also commonly used to perform statistical inference, which estimates the associations between two or more variables of interest. See the lesson on Biostatistics for more details regarding statistical inference.
How Supervised Learning is Used#
Below is are examples of common supervised model uses within public health:
Forecasting: Regression models can be used to forecast future rates of infection based on historical data.
Risk Assessment: Regression analysis can be used in identifying cardiovascular risk factors from EHR data and generating risk scores for individual prognostication.
Decision Making: A regression analysis can help enable aid decision-making under uncertainty. For example, risk scores generated from regression analysis on patient history, diagnosis, and prognosis can help patients and care providers make more informed decisions for successful treatment recommendations.
Categorical Classification: Classification can be used to discriminate among a set of features and group samples according to a predicted target. An example would be a classification model that predicts mortality risk for disease based on demographics, symptom, and epidemiologic data features. A second example could be a binary classification model used to predict malignant vs. non-malignant skin lesions based on dermatoscopic image data.
Methods in Supervised Learning#
Below is a table that lists common supervised learning methods. This table builds upon a Machine Learning Quick Reference guide provided by SAS. For more in depth information regarding regression analyses and statistical inference, see the lesson on Biostatistics.
If you are already familiar with methods for supervised learning, please continue to the next section. Otherwise click here.
Model | Common Usage | Suggested Usage | Suggested Scale | Interpretability | Common Concerns |
---|---|---|---|---|---|
Linear Regression | Supervised regression | Multiple linear regression, simple linear regression | Small to large datasets | High | Missing values, outliers, standardization, parameter tuning |
Polynomial Regression | Supervised regression | Modeling non-linear data using a linear model, analyze the curve towards the end for signs of overfitting, often used when linear models are unclear | Small to large datasets | High | Missing values, outliers, overfitting, standardization, parameter tuning |
Logistic Regression | Supervised classification | Most commonly used in classification but can be used in regression modeling, dependent variable (target) is categorical | Small to large datasets | High | Missing values, outliers, standardization, parameter tuning |
Penalized Regression | Supervised regression, Supervised classification | Modeling linear or linearly separable phenomena, manually specifying nonlinear and explicit interaction terms, well suited for N << p , where the number of predictors exceeds the number of samples {cite:p}'brownlee2022bigp', specific types of penalized regression (Bayesian Linear Regression, Ridge, Lasso, Elastic Net) | Small to large datasets | High | Missing values, outliers, standardization, parameter tuning |
Naïve Bayes | Supervised classification | Modeling linearly separable phenomena in large datasets, well-suited for extremely large datasets where complex methods are intractable | Small to extremely large datasets | Moderate | Strong linear independence assumption, infrequent categorical levels |
Decision Trees | Supervised regression, Supervised classification | Modeling nonlinear and nonlinearly separable phenomena in large, dirty data, interactions are considered automatically but implicitly, missing values and outliers in input variables handled automatically in many implementations, decision tree ensembles (e.g., random forests and gradient boosting) can increase prediction accuracy and decrease overfitting | Medium to large datasets | Moderate | Instability with small training datasets, gradient boosting can be unstable with noise or outliers, overfitting, parameter tuning |
Support Vector Machines (SVM) | Supervised regression, Supervised classification | Modeling linear or linearly separable phenomena using linear kernels, modeling nonlinear or nonlinearly separable phenomena using nonlinear kernels, anomaly detection with one-class SVM (OSVM) | Small to large datasets for linear kernels, Small to medium datasets for nonlinear kernels | Low | Missing values, overfitting, outliers, standardization, parameter tuning, accuracy versus deep neural networks depends on the choice of the nonlinear kernel; Gaussian and polynomial are often less accurate {cite:p}'singh2019svm' |
k-Nearest Neighbors (kNN) | Supervised regression, Supervised classification | Modeling nonlinearly separable phenomena, can be used to match the accuracy of more sophisticated techniques, but with fewer tuning parameters | Small to medium datasets | Low | Missing values, overfitting, outliers, standardization, curse of dimensionality |
Neural Networks (NN) | Supervised regression, Supervised classification | Modeling nonlinear and nonlinearly separable phenomena, deep neural networks (e.g., deep learning) are well suited for state-of-the-art pattern recognition in images, videos, and sound, all interactions considered in fully connected, multilayer topologies, nonlinear feature extraction with auto-encoder and restricted Boltzmann machine (RBM) networks | Usually medium to large datasets. | Low | Missing values, overfitting, outliers, standardization, hyperparameter tuning |
Health Equity Considerations#
Supervised learning methods in the context of health equity require careful consideration in data preparation, model training, and interpretation/dissemination of results. Addressing biases, ensuring diversity in the evaluation dataset, and quantifying algorithmic fairness using appropriate metrics are essential steps to promote equitable outcomes in health care decision-making. Additionally, incorporating qualitative evaluation through user research can provide valuable insights into the model’s impact on diverse communities.
Model training and tuning is a process where training data is introduced to an algorithm and model performance is optimized through methods such as iterative hyperparameter tuning and cross-validation. Below are several challenges you may encounter while training and tuning a supervised learning model. For more information on data preparation and dissemination, please visit their respective units.
Challenge |
Challenge Description |
Heath Equity Example |
Recommended Best Practice |
---|---|---|---|
Overfitting |
High-variance and low-bias models that fail to generalize well. |
When trying to make a policy decision that impacts public health, an initial thought may be to include all socio-demographic variables as well as data on the population’s health/co-morbidities into your model. Including too many variables in smaller datasets could lead to a model fitting to values that are not significant indicators of the target value that can lead to biased and inaccurate predictions. |
|
Discriminatory Classification |
Different prediction error rates for different subgroups suggest that the model discriminates against particular subgroups. The choice of the target variable can introduce bias and underrepresented populations in the training data can lead to discriminatory predictions. Further, due to protected attributes, “existing approaches for reducing discrimination induced by prediction errors may be unethical or impractical to apply in settings where predictive accuracy is critical, such as in healthcare.” [Chen et al., 2018] |
A healthcare model predicts which patients might be discharged earliest to efficiently direct limited case management resources in order to prevent delays and open up more beds. If the model discovers that residence in certain zip codes predicts longer stays, but those zip codes are socioeconomically depressed or predominantly African American, then the model might disproportionately allocate case management resources to patients from richer, predominantly white neightborhoods [Rajkomar et al., 2018] |
|
Different costs of misclassification |
Carefully consider when a false positive error can cause harm to an individual in a protected class. The use of ROC AUC to measure diagnostic accuracy does not account for different costs of misclassification; it lacks clinical interpretability, and confidence scales used to make ROC cures can be inconsistent and unreliable |
For a colonography or mammogram, using ROC-AUC to determine the diagnostic accuracy of radiological tests could be problematic as ROC-AUC does not consider misclassification costs (important to assessing classification fairness). |
|
[Chen et al., 2018, d'Alessandro et al., 2017, Halligan et al., 2015, Har-Peled and Mahabadi, 2019, Kamiran et al., 2010, Kumar et al., 2020, Rajkomar et al., 2018, Yeom et al., 2018]
Case Study Example#
Case study is for illustrative purposes and does not represent a specific study from the literature.
Scenario: MG is an epidemiologist who is interested in creating a model to predict the effectiveness of a community weight loss intervention in young adults with Type 2 diabetes mellitus. MG hopes that identifying optimal individuals for intervention will enable a broader rollout of this program.
Specific Model Objective: Predict successful weight loss (> 5% initial body weight) after 6 months of program participation in adults aged 18-33 with established diagnosis of Type 2 diabetes and at least one treatment with oral hypoglycemics. Program intervention consisted of bi-weekly online counseling for 15 minutes covering the patient’s nutrition, exercise, and medication adherence.
Data Source: Clinical and demographic data from outpatient diabetes center EHR at an academic institution in Boston. Per EHR records: 75% of the participants were white, 15% were black, and 10% were Hispanic/Latino or another race.
Analytic Method: Decision Tree Classifier
Results: The model achieved an overall AUC of 89.2 and an F1 score of 0.41.
Health Equity Considerations:
While the model achieved an overall high performance in terms of AUC, there are several considerations that should be made:
A limitation of the data was the lower representation of Hispanic/Latinos. Due to imbalanced data, the model may present discriminatory classification performance for Hispanic/Latino participants even with the observed high AUC overall. As discussed in a previous section, MG may want to consider techniques to mitigate imbalanced data such as over-sampling, under-sampling or trying another model.
Overfitting may limit the model’s portability, particularly to settings in which the population demographics differs from that found in the Boston sample set. Specific to decision tree-based models, certain choices when training the model can lead to overfitting:
Using all the features to train the model, while perhaps a good idea at first, can lead to overfitting. This means the model may not perform as well when new datasets are introduced.
If the decision tree is allowed to split in an unlimited manner (e.g., no maximum depth), it may become too closely aligned to the specific features of the training data, and make errors when new data are introduced.
It is important to consider that the model results may be biased due to the design of a virtual intervention, which is likely limited to individuals with internet access.
As noted in other lessons, the quality of health data from EHRs can be highly variable and data may be missing, which would otherwise have an effect on the outcome of this study (e.g., medication adherence data).
The model may be incorrectly interpreted in its application for deciding where and for whom the weight loss intervention should be deployed. For example, a classification threshold for predicted weight loss that is dependent on race.
Considerations for Project Planning
|