Lesson 1. Supervised Learning

Lesson 1. Supervised Learning#

Supervised learning uses labeled data to train a model to make predictions and judgments. In the area of health equality, this kind of learning can be applied to pinpoint health inequities, estimate the likelihood of health-related outcomes, and create focused interventions. Supervised learning algorithms can be used to identify groups of people who are more likely to benefit from preventative treatments, to identify groups of people who are at risk for particular health disorders, for instance, to examine patient data. Moreover, supervised learning can be used to pinpoint subgroups of people who have various health outcomes and create individualized therapies to deal with health outcomes inequalities [Ghassemi et al., 2020].

Health Equity and Supervised Learning

Identify patterns in public health data to help inform policy decisions and interventions.
Use supervised learning to predict and diagnose public health issues and to detect potential outbreaks.
Help to identify risk factors for public health issues and develop interventions to address them.
Assist in automating and streamlining data collection and analysis, resulting in increased efficiency and better decisions.

What is Supervised Learning?#

Supervised learning models are defined by their use of labeled datasets for training models to learn a mapping between an input and target variable, and ultimately leveraged for prediction on new instances. A labeled dataset refers to data that comes with known characteristics such as discrete categorical labels (class labels or targets) and attributes (features). There are two main categories of supervised learning problems:

Classification algorithms are used to identify which discrete category or class a given data point belongs to given its features.
Regression algorithms differ from classification problems in that they are used to predict continous numeric quantities. Regression problems are used to predict the value of a dependent variable based on given independent variables.

This lesson focuses on using classification and regression models for prediction. However, regression is also commonly used to perform statistical inference, which estimates the associations between two or more variables of interest. See the lesson on Biostatistics for more details regarding statistical inference.

How Supervised Learning is Used#

Below is are examples of common supervised model uses within public health:

Forecasting: Regression models can be used to forecast future rates of infection based on historical data.
Risk Assessment: Regression analysis can be used in identifying cardiovascular risk factors from EHR data and generating risk scores for individual prognostication.
Decision Making: A regression analysis can help enable aid decision-making under uncertainty. For example, risk scores generated from regression analysis on patient history, diagnosis, and prognosis can help patients and care providers make more informed decisions for successful treatment recommendations.
Categorical Classification: Classification can be used to discriminate among a set of features and group samples according to a predicted target. An example would be a classification model that predicts mortality risk for disease based on demographics, symptom, and epidemiologic data features. A second example could be a binary classification model used to predict malignant vs. non-malignant skin lesions based on dermatoscopic image data.

Methods in Supervised Learning#

Below is a table that lists common supervised learning methods. This table builds upon a Machine Learning Quick Reference guide provided by SAS. For more in depth information regarding regression analyses and statistical inference, see the lesson on Biostatistics.

If you are already familiar with methods for supervised learning, please continue to the next section. Otherwise click here.

Model	Common Usage	Suggested Usage	Suggested Scale	Interpretability	Common Concerns
Linear Regression	Supervised regression	Multiple linear regression, simple linear regression	Small to large datasets	High	Missing values, outliers, standardization, parameter tuning
Polynomial Regression	Supervised regression	Modeling non-linear data using a linear model, analyze the curve towards the end for signs of overfitting, often used when linear models are unclear	Small to large datasets	High	Missing values, outliers, overfitting, standardization, parameter tuning
Logistic Regression	Supervised classification	Most commonly used in classification but can be used in regression modeling, dependent variable (target) is categorical	Small to large datasets	High	Missing values, outliers, standardization, parameter tuning
Penalized Regression	Supervised regression, Supervised classification	Modeling linear or linearly separable phenomena, manually specifying nonlinear and explicit interaction terms, well suited for N << p , where the number of predictors exceeds the number of samples {cite:p}'brownlee2022bigp', specific types of penalized regression (Bayesian Linear Regression, Ridge, Lasso, Elastic Net)	Small to large datasets	High	Missing values, outliers, standardization, parameter tuning
Naïve Bayes	Supervised classification	Modeling linearly separable phenomena in large datasets, well-suited for extremely large datasets where complex methods are intractable	Small to extremely large datasets	Moderate	Strong linear independence assumption, infrequent categorical levels
Decision Trees	Supervised regression, Supervised classification	Modeling nonlinear and nonlinearly separable phenomena in large, dirty data, interactions are considered automatically but implicitly, missing values and outliers in input variables handled automatically in many implementations, decision tree ensembles (e.g., random forests and gradient boosting) can increase prediction accuracy and decrease overfitting	Medium to large datasets	Moderate	Instability with small training datasets, gradient boosting can be unstable with noise or outliers, overfitting, parameter tuning
Support Vector Machines (SVM)	Supervised regression, Supervised classification	Modeling linear or linearly separable phenomena using linear kernels, modeling nonlinear or nonlinearly separable phenomena using nonlinear kernels, anomaly detection with one-class SVM (OSVM)	Small to large datasets for linear kernels, Small to medium datasets for nonlinear kernels	Low	Missing values, overfitting, outliers, standardization, parameter tuning, accuracy versus deep neural networks depends on the choice of the nonlinear kernel; Gaussian and polynomial are often less accurate {cite:p}'singh2019svm'
k-Nearest Neighbors (kNN)	Supervised regression, Supervised classification	Modeling nonlinearly separable phenomena, can be used to match the accuracy of more sophisticated techniques, but with fewer tuning parameters	Small to medium datasets	Low	Missing values, overfitting, outliers, standardization, curse of dimensionality
Neural Networks (NN)	Supervised regression, Supervised classification	Modeling nonlinear and nonlinearly separable phenomena, deep neural networks (e.g., deep learning) are well suited for state-of-the-art pattern recognition in images, videos, and sound, all interactions considered in fully connected, multilayer topologies, nonlinear feature extraction with auto-encoder and restricted Boltzmann machine (RBM) networks	Usually medium to large datasets.	Low	Missing values, overfitting, outliers, standardization, hyperparameter tuning

Health Equity Considerations#

Supervised learning methods in the context of health equity require careful consideration in data preparation, model training, and interpretation/dissemination of results. Addressing biases, ensuring diversity in the evaluation dataset, and quantifying algorithmic fairness using appropriate metrics are essential steps to promote equitable outcomes in health care decision-making. Additionally, incorporating qualitative evaluation through user research can provide valuable insights into the model’s impact on diverse communities.

Model training and tuning is a process where training data is introduced to an algorithm and model performance is optimized through methods such as iterative hyperparameter tuning and cross-validation. Below are several challenges you may encounter while training and tuning a supervised learning model. For more information on data preparation and dissemination, please visit their respective units.

Challenge

Challenge Description

Heath Equity Example

Recommended Best Practice

Overfitting

High-variance and low-bias models that fail to generalize well.

When trying to make a policy decision that impacts public health, an initial thought may be to include all socio-demographic variables as well as data on the population’s health/co-morbidities into your model. Including too many variables in smaller datasets could lead to a model fitting to values that are not significant indicators of the target value that can lead to biased and inaccurate predictions.

Regularization
Noise injection
Partitioning or cross-validation

Discriminatory Classification

Different prediction error rates for different subgroups suggest that the model discriminates against particular subgroups. The choice of the target variable can introduce bias and underrepresented populations in the training data can lead to discriminatory predictions. Further, due to protected attributes, “existing approaches for reducing discrimination induced by prediction errors may be unethical or impractical to apply in settings where predictive accuracy is critical, such as in healthcare.” [Chen et al., 2018]

A healthcare model predicts which patients might be discharged earliest to efficiently direct limited case management resources in order to prevent delays and open up more beds. If the model discovers that residence in certain zip codes predicts longer stays, but those zip codes are socioeconomically depressed or predominantly African American, then the model might disproportionately allocate case management resources to patients from richer, predominantly white neightborhoods [Rajkomar et al., 2018]

Logistic regression uses fairness-aware regularizers to penalize differences in classification between protected and non-protected classes.
Naive bayes splits the dataset based on the value of the protected attribute and trains separate classifiers for each group.
An ensemble of independent ML models can compensate for disparities in predictions among models and applies penalties to ensure fairness.
K-nearest neighbors mitigates bias with larger k values or by implementing fair near neighbor algorithm.

Different costs of misclassification

Carefully consider when a false positive error can cause harm to an individual in a protected class. The use of ROC AUC to measure diagnostic accuracy does not account for different costs of misclassification; it lacks clinical interpretability, and confidence scales used to make ROC cures can be inconsistent and unreliable

For a colonography or mammogram, using ROC-AUC to determine the diagnostic accuracy of radiological tests could be problematic as ROC-AUC does not consider misclassification costs (important to assessing classification fairness).

Alternatives to ROC AUC - net benefit method, net reclassification improvement, net reclassification index, relative utility, weighted comparison. application for radiology.
Dependency-aware tree construction - evaluate the accuracy and level of unfairness caused by a splitting criterion for a tree node. in addition to optimizing the overall accuracy using information gain (IGC), consider the gain in sensitivity to the protected attribute (IGS).
Leaf relabeling - alternative to classifying using the majority class of leaves.

[Chen et al., 2018, d'Alessandro et al., 2017, Halligan et al., 2015, Har-Peled and Mahabadi, 2019, Kamiran et al., 2010, Kumar et al., 2020, Rajkomar et al., 2018, Yeom et al., 2018]

Case Study Example#

Case study is for illustrative purposes and does not represent a specific study from the literature.

Scenario: MG is an epidemiologist who is interested in creating a model to predict the effectiveness of a community weight loss intervention in young adults with Type 2 diabetes mellitus. MG hopes that identifying optimal individuals for intervention will enable a broader rollout of this program.

Specific Model Objective: Predict successful weight loss (> 5% initial body weight) after 6 months of program participation in adults aged 18-33 with established diagnosis of Type 2 diabetes and at least one treatment with oral hypoglycemics. Program intervention consisted of bi-weekly online counseling for 15 minutes covering the patient’s nutrition, exercise, and medication adherence.

Data Source: Clinical and demographic data from outpatient diabetes center EHR at an academic institution in Boston. Per EHR records: 75% of the participants were white, 15% were black, and 10% were Hispanic/Latino or another race.

Analytic Method: Decision Tree Classifier

Results: The model achieved an overall AUC of 89.2 and an F1 score of 0.41.

Health Equity Considerations:

While the model achieved an overall high performance in terms of AUC, there are several considerations that should be made:

A limitation of the data was the lower representation of Hispanic/Latinos. Due to imbalanced data, the model may present discriminatory classification performance for Hispanic/Latino participants even with the observed high AUC overall. As discussed in a previous section, MG may want to consider techniques to mitigate imbalanced data such as over-sampling, under-sampling or trying another model.
Overfitting may limit the model’s portability, particularly to settings in which the population demographics differs from that found in the Boston sample set. Specific to decision tree-based models, certain choices when training the model can lead to overfitting:
- Using all the features to train the model, while perhaps a good idea at first, can lead to overfitting. This means the model may not perform as well when new datasets are introduced.
- If the decision tree is allowed to split in an unlimited manner (e.g., no maximum depth), it may become too closely aligned to the specific features of the training data, and make errors when new data are introduced.
It is important to consider that the model results may be biased due to the design of a virtual intervention, which is likely limited to individuals with internet access.
As noted in other lessons, the quality of health data from EHRs can be highly variable and data may be missing, which would otherwise have an effect on the outcome of this study (e.g., medication adherence data).
The model may be incorrectly interpreted in its application for deciding where and for whom the weight loss intervention should be deployed. For example, a classification threshold for predicted weight loss that is dependent on race.

Considerations for Project Planning

Are you currently using or considering supervised learning methods, with a focus on health equity, in your project?

If yes, what specific supervised learning methods have you been using or plan to use to address health disparities, promote equitable health outcomes, or support public health interventions?
If not, are there areas in your work where you see the potential benefits of employing supervised learning methods to analyze health-related data, identify social determinants of health, and develop strategies for achieving health equity and improving overall well-being in diverse populations?