Lesson 1. Supervised Learning#

Supervised learning uses labeled data to train a model to make predictions and judgments. In the area of health equality, this kind of learning can be applied to pinpoint health inequities, estimate the likelihood of health-related outcomes, and create focused interventions. Supervised learning algorithms can be used to identify groups of people who are more likely to benefit from preventative treatments, to identify groups of people who are at risk for particular health disorders, for instance, to examine patient data. Moreover, supervised learning can be used to pinpoint subgroups of people who have various health outcomes and create individualized therapies to deal with health outcomes inequalities [Ghassemi et al., 2020].

Health Equity and Supervised Learning

  • Identify patterns in public health data to help inform policy decisions and interventions.
  • Use supervised learning to predict and diagnose public health issues and to detect potential outbreaks.
  • Help to identify risk factors for public health issues and develop interventions to address them.
  • Assist in automating and streamlining data collection and analysis, resulting in increased efficiency and better decisions.

What is Supervised Learning?#

Supervised learning models are defined by their use of labeled datasets for training models to learn a mapping between an input and target variable, and ultimately leveraged for prediction on new instances. A labeled dataset refers to data that comes with known characteristics such as discrete categorical labels (class labels or targets) and attributes (features). There are two main categories of supervised learning problems:

  • Classification algorithms are used to identify which discrete category or class a given data point belongs to given its features.

  • Regression algorithms differ from classification problems in that they are used to predict continous numeric quantities. Regression problems are used to predict the value of a dependent variable based on given independent variables.

This lesson focuses on using classification and regression models for prediction. However, regression is also commonly used to perform statistical inference, which estimates the associations between two or more variables of interest. See the lesson on Biostatistics for more details regarding statistical inference.

How Supervised Learning is Used#

Below is are examples of common supervised model uses within public health:

  • Forecasting: Regression models can be used to forecast future rates of infection based on historical data.

  • Risk Assessment: Regression analysis can be used in identifying cardiovascular risk factors from EHR data and generating risk scores for individual prognostication.

  • Decision Making: A regression analysis can help enable aid decision-making under uncertainty. For example, risk scores generated from regression analysis on patient history, diagnosis, and prognosis can help patients and care providers make more informed decisions for successful treatment recommendations.

  • Categorical Classification: Classification can be used to discriminate among a set of features and group samples according to a predicted target. An example would be a classification model that predicts mortality risk for disease based on demographics, symptom, and epidemiologic data features. A second example could be a binary classification model used to predict malignant vs. non-malignant skin lesions based on dermatoscopic image data.

Methods in Supervised Learning#

Below is a table that lists common supervised learning methods. This table builds upon a Machine Learning Quick Reference guide provided by SAS. For more in depth information regarding regression analyses and statistical inference, see the lesson on Biostatistics.

Health Equity Considerations#

Supervised learning methods in the context of health equity require careful consideration in data preparation, model training, and interpretation/dissemination of results. Addressing biases, ensuring diversity in the evaluation dataset, and quantifying algorithmic fairness using appropriate metrics are essential steps to promote equitable outcomes in health care decision-making. Additionally, incorporating qualitative evaluation through user research can provide valuable insights into the model’s impact on diverse communities.

Model training and tuning is a process where training data is introduced to an algorithm and model performance is optimized through methods such as iterative hyperparameter tuning and cross-validation. Below are several challenges you may encounter while training and tuning a supervised learning model. For more information on data preparation and dissemination, please visit their respective units.

Challenge

Challenge Description

Heath Equity Example

Recommended Best Practice

Overfitting

High-variance and low-bias models that fail to generalize well.

When trying to make a policy decision that impacts public health, an initial thought may be to include all socio-demographic variables as well as data on the population’s health/co-morbidities into your model. Including too many variables in smaller datasets could lead to a model fitting to values that are not significant indicators of the target value that can lead to biased and inaccurate predictions.

  • Regularization
  • Noise injection
  • Partitioning or cross-validation

Discriminatory Classification

Different prediction error rates for different subgroups suggest that the model discriminates against particular subgroups. The choice of the target variable can introduce bias and underrepresented populations in the training data can lead to discriminatory predictions. Further, due to protected attributes, “existing approaches for reducing discrimination induced by prediction errors may be unethical or impractical to apply in settings where predictive accuracy is critical, such as in healthcare.” [Chen et al., 2018]

A healthcare model predicts which patients might be discharged earliest to efficiently direct limited case management resources in order to prevent delays and open up more beds. If the model discovers that residence in certain zip codes predicts longer stays, but those zip codes are socioeconomically depressed or predominantly African American, then the model might disproportionately allocate case management resources to patients from richer, predominantly white neightborhoods [Rajkomar et al., 2018]

  • Logistic regression uses fairness-aware regularizers to penalize differences in classification between protected and non-protected classes.
  • Naive bayes splits the dataset based on the value of the protected attribute and trains separate classifiers for each group.
  • An ensemble of independent ML models can compensate for disparities in predictions among models and applies penalties to ensure fairness.
  • K-nearest neighbors mitigates bias with larger k values or by implementing fair near neighbor algorithm.

Different costs of misclassification

Carefully consider when a false positive error can cause harm to an individual in a protected class. The use of ROC AUC to measure diagnostic accuracy does not account for different costs of misclassification; it lacks clinical interpretability, and confidence scales used to make ROC cures can be inconsistent and unreliable

For a colonography or mammogram, using ROC-AUC to determine the diagnostic accuracy of radiological tests could be problematic as ROC-AUC does not consider misclassification costs (important to assessing classification fairness).

  • Alternatives to ROC AUC - net benefit method, net reclassification improvement, net reclassification index, relative utility, weighted comparison. application for radiology.
  • Dependency-aware tree construction - evaluate the accuracy and level of unfairness caused by a splitting criterion for a tree node. in addition to optimizing the overall accuracy using information gain (IGC), consider the gain in sensitivity to the protected attribute (IGS).
  • Leaf relabeling - alternative to classifying using the majority class of leaves.

[Chen et al., 2018, d'Alessandro et al., 2017, Halligan et al., 2015, Har-Peled and Mahabadi, 2019, Kamiran et al., 2010, Kumar et al., 2020, Rajkomar et al., 2018, Yeom et al., 2018]

Case Study Example#

Case study is for illustrative purposes and does not represent a specific study from the literature.

Scenario: MG is an epidemiologist who is interested in creating a model to predict the effectiveness of a community weight loss intervention in young adults with Type 2 diabetes mellitus. MG hopes that identifying optimal individuals for intervention will enable a broader rollout of this program.

Specific Model Objective: Predict successful weight loss (> 5% initial body weight) after 6 months of program participation in adults aged 18-33 with established diagnosis of Type 2 diabetes and at least one treatment with oral hypoglycemics. Program intervention consisted of bi-weekly online counseling for 15 minutes covering the patient’s nutrition, exercise, and medication adherence.

Data Source: Clinical and demographic data from outpatient diabetes center EHR at an academic institution in Boston. Per EHR records: 75% of the participants were white, 15% were black, and 10% were Hispanic/Latino or another race.

Analytic Method: Decision Tree Classifier

Results: The model achieved an overall AUC of 89.2 and an F1 score of 0.41.

Health Equity Considerations:

While the model achieved an overall high performance in terms of AUC, there are several considerations that should be made:

  • A limitation of the data was the lower representation of Hispanic/Latinos. Due to imbalanced data, the model may present discriminatory classification performance for Hispanic/Latino participants even with the observed high AUC overall. As discussed in a previous section, MG may want to consider techniques to mitigate imbalanced data such as over-sampling, under-sampling or trying another model.

  • Overfitting may limit the model’s portability, particularly to settings in which the population demographics differs from that found in the Boston sample set. Specific to decision tree-based models, certain choices when training the model can lead to overfitting:

    • Using all the features to train the model, while perhaps a good idea at first, can lead to overfitting. This means the model may not perform as well when new datasets are introduced.

    • If the decision tree is allowed to split in an unlimited manner (e.g., no maximum depth), it may become too closely aligned to the specific features of the training data, and make errors when new data are introduced.

  • It is important to consider that the model results may be biased due to the design of a virtual intervention, which is likely limited to individuals with internet access.

  • As noted in other lessons, the quality of health data from EHRs can be highly variable and data may be missing, which would otherwise have an effect on the outcome of this study (e.g., medication adherence data).

  • The model may be incorrectly interpreted in its application for deciding where and for whom the weight loss intervention should be deployed. For example, a classification threshold for predicted weight loss that is dependent on race.

Considerations for Project Planning

  • Are you currently using or considering supervised learning methods, with a focus on health equity, in your project?
    • If yes, what specific supervised learning methods have you been using or plan to use to address health disparities, promote equitable health outcomes, or support public health interventions?
    • If not, are there areas in your work where you see the potential benefits of employing supervised learning methods to analyze health-related data, identify social determinants of health, and develop strategies for achieving health equity and improving overall well-being in diverse populations?