Lesson 2. Unsupervised Learning

Lesson 2. Unsupervised Learning#

Unsupervised learning methods are powerful tools that help detect patterns in a dataset and latent relationships between variables. They are especially useful in supporting supervised learning problems in downstream model development and analysis by performing crucial tasks such as data compression and feature selection and extraction. Unlabeled datasets pose a challenge as they do not have predefined labels to validate model results. This requires human interventions.

Health Equity and Unsupervised Learning

Unsupervised learning can be used to identify any changes in health behaviors or in the environment that could be contributing to the spread of a disease. This can help public health officials design and implement effective intervention programs to reduce the risk of disease spread.
The potential bias that can emerge from unsupervised learning can arise from the data points selected for the algorithm, the algorithm selected, and the assumptions made while developing the algorithm. Additionally, if the assumptions made during the development of the algorithm are not reflective of the population in question, the algorithm can produce biased results.
Common unsupervised learning applications include analyzing health behavior with clustering advancing health informatics and surveillance through association rule, and improving performance in health risk with prediction

What is Unsupervised Learning?#

Unsupervised learning models are defined by their use of datasets that are not labeled and therefore are a class of algorithms that only act on features within the data they are processing without additional guidance. Unsupervised learning is often used for representation learning tasks. Representation learning focuses on learning the data representations needed for prediction tasks. It is used in public health to develop predictive models that can identify health outcomes, detect risk factors, and analyze trends in public health data. This lesson will cover techniques that are used to learn the meaningful representations of data, which are then used for tasks such as classification, clustering, or other machine learning tasks.

Clustering algorithms are used to group data based on their feature similarities or differences. These are useful in practice to help categorize unlabeled data into similar groups (clusters) with many different algorithms available for linear and non-linear mappings.

Association Rule algorithms are used as a rule-based approach to find relationships (associations) between variables within a dataset. The associations are based on if/then statements where the “if” value is termed as the antecedent and the “then” value, the consequent. The three other metrics applied to the rules are 1) support, which is the frequency of variable values, 2) the rule confidence score, and 3) the lift which is the strength of the rule.

Dimensionality Reduction algorithms are used to reduce the number of features (dimensions) of a dataset in order to help interpret the most relevant/significant features within a dataset. It naturally serves as a form of data compression, helping to reduce computational complexity and improve scalability. It also can be used in feature engineering prior to supervised methods, helping to reduce noise and redundancy in the data and therefore improve model accuracy in downstream analysis [Sorzano et al., 2014]. Moreover, dimensionality reduction helps address several common challenges:

Curse of Dimensionality occurs when there are a higher number of dimensions (features) within the data. This often risks having more sparse data, which implies that the amount of training data needed in order to achieve a reliable result grows exponentially with dimensionality. This phenomenon also implies that there is a point at which adding more features actually produces diminished predictive performance for a supervised classifier.
Overfitting is where a model may fit too closely to the training set and therefore does not generalize well to new data. In other words, overfitting is the condition where the model learns to describe the random error (noise) within the data rather than the true relationships between variables. Overfitting can occur when the model is too complex, meaning there are too many features relative to the number of observations, or too flexible, meaning it’s not regularized enough.
Multicollinearity occurs when features are correlated and therefore redundant. Multicollinearity can be prevalent in survey data that often has high dimensionality (many features).

Unlabeled Data and Model Interpretability#

Unlabeled data lacks predefined labels that would otherwise determine natural groupings within the dataset and can come in different modalities including photos, audio, video, or text content such as a tweet or article. Unlike supervised learning models, which trains on input/output pairs, unsupervised learning models learn to categorize data based on similarities, differences, and patterns that exist within the data. Further, while supervised learning models are often tested for accuracy, unsupervised learning model results are unable to be validated due to their use of unlabeled data. This can contribute to varying degrees of interpretability of unsupervised learning model results. However, while they are harder to evaluate, they are often useful in practice.

How Unsupervised Learning is Used#

Below are a few examples of unsupervised learning algorithms and their application to public health:

Analyzing Health Behavior Clustering is used to process data into similar groups based on trends found within the data. Clustering algorithms can be used to help classify health behaviors in different groups of people that differ from manual groupings based on some demographic feature. Identified clusters can help discover associations of health behaviors according to sex, age, race, and other socio-economic factors resulting in improved strategies for effective intervention.
Advancing Health Informatics Association rule mining is used to generate associations between variables of an unlabeled dataset. Association rule mining could be used to facilitate knowledge discovery by performing knowledge extraction from health data. For example, it could be used to discover correlations among diseases from a health dataset.
Public Health Surveillance Association rule mining could be utilized in public health surveillance in order to identify which events could lead to the spread of infection [Brossette et al., 1998].
Improving Performance in Health Risk Prediction Dimensionality Reduction is a method used to reduce the number of features (dimensions) within a dataset while minimizing information loss. Dimensionality reduction may be used to improve existing models, such as a logistic regression model trained to predict risk of chest pain in patients.

Unsupervised Learning Algorithms#

When choosing any machine learning model for analysis, it is important to keep in mind the implicit assumptions that may arise from the individual conducting the analysis, the types of datasets used, and how the results are interpreted. It is generally good practice to perform visual exploratory analysis on your dataset to help explain your model outputs and inputs.

Below is a table that lists common unsupervised learning models. This table builds upon a Machine Learning Quick Reference guide provided by SAS [Sassoftware, n.d.].

If you are already familiar with methods for unsupervised learning, please continue to the next section. Otherwise click here.

Model	Common Usage	Suggested Usage	Suggested Scale	Interpretability	Common Concerns
Artificial Neural Network (ANN)	Clustering	Modeling nonlinear and non-linearly separable phenomena Deep neural networks (e.g. deep learning) are well suited for state-of-the-art pattern recognition in images, videos, and sound All interactions considered in fully connected, multilayer topologies Clustering and visualization performed with self organizing maps (SOMs)	Usually small to medium datasets (stochastic gradient descent optimization drastically increases scalability)	Low	Missing values Overfitting Outliers Standardization Parameter tuning
k-Means Clustering	Clustering	Creating a known a priori number of spherical, disjoint, equally sized clusters k-modes method can be used for categorical data k-prototypes method can be used for mixed data	Small datasets	Moderate	Missing values Outliers Standardization Correct number of clusters unknown High sensitivity to initialization Curse of dimensionality
Hierarchical Clustering	Clustering	Creating a known a priori number of spherical, disjoint, or overlapping clusters of different sizes Agglomerative ("bottom-up") starts with each samples in its own cluster and merging the most similar (or nearest) pair of clusters Divisive ("top-down") starts with all samples within single cluster and recursively splits into child clusters	Small datasets	Moderate	Missing values Standardization Correct number of clusters unknown Curse of dimensionality
Spectral Clustering	Clustering	Creating a data dependent number of arbitrarily-shaped, disjoint, or overlapping clusters of different sizes	Small datasets	Moderate	Missing values Standardization Correct number of clusters unknown Curse of dimensionality
Association Rules	Association Rule Learning	Building sets of complex rules by using the co-occurrence of items or events in transactional datasets	Medium to large transactional datasets	Moderate	Overfitting Instability with small training data Parameter tuning
Principal Components Analysis (PCA)	Dimensionality reduction	PCA is one of the most popular dimensionality reduction methods PCA is a linear transformation of the feature space and will not be effective when the distribution of the data is non-linear Finds a lower-dimensional representation by maximizing the variance as measured in the high dimensional input space Transformed features are called principal components	Small to large datasets	Low	Missing values Outliers Curse of dimensionality

Health Equity Considerations#

Unsupervised learning algorithms work with unlabeled data. This means that the unsupervised learning algorithms do not have previous knowledge to validate their outputs. This lends itself to challenges, especially when preparing your data and model parameters. For more information about data preparation techniques, visit the Data Preparation Unit. Model training and tuning is a process where training data is introduced to an algorithm and model performance is optimized through iterative hyperparameter tuning to optimize a given objective function. Below are several challenges you may encounter while training and tuning a supervised learning model.

Challenge	Challenge Description	Health Equity Example	Recommended Best Practice
Correct number of clusters unknown	For clustering algorithms, you will not automatically know the correct number of clusters to choose for your data. There are several techniques you can employ to estimate a good number of clusters based on your data.	When clustering groups of data based on demographic information and seroprevalence rates, too many or too few clusters may cause you to loose potentially use full insights and leads to an inequal spread of variance. This in turn could lead to biased and inaccurate policy analysis.	There are several metrics and graphs you can run to find an optimal number of clusters for your dataset. For details on these metrics,vist this article Elbow curve The Silhouette Method Gap Statistic Sum of Squares Method Clustree
Standardization	Standardization is especially important when clustering data as it prevents features that use larger scales from dominating the other features arbitrarily. Standardization is a scaling technique that centers values around their mean with a standard deviation of 1.	Clustering can be used to form groupings to help identify patterns in complicated data such as clinical data. However, clinical data often has varying scales. If you are trying to group records based on similar patterns, features with larger scales could skew the group representations. By standardzing the features, the clustering model will be optimized to consider features with the smaller scales, garnering more insights to your dataset.	Use the formula for standardization: \[ z = \frac{{X - \mu}}{{\sigma}} \]
Curse of Dimensionality	Various phenomena that can occur when when performing machine learning methods with high-dimensional datasets (higher error, overfitting fit to irrelevant features, irregular patterns identified)	Survey data often has high dimensionality (many features). Dimensionality reduction can be used to help both interpret the relevant features within survey data and help aid in the performance of downstream model analysis.	Perform dimensionality reduction techniques such as principle component analysis
Sensitive to initialization	K-means clustering chooses n number of centroid values within the dataset where n is the number of clusters you choose. The initial centroid values are chosen at random and the algorithm then begins to form clusters based upon the distance all the values have from the centroids. The problem lies in that the centroids are chosen at random and may not be the optimal initial points to build the k-means clusters on	K-means could be used to identify which region's population has a higher risk of developing diabetes based on glucose, blood pressure, insulin, BMI, age, etc. values. K-means iteratively improves the initial centroid chosen which improves the clustering process, but if the model is not adequately iterated, your results could be skewed. In this case, public messaging on diabetes type 2 could be directed to the wrong populations, leading to an ineffective public health campaign.	Run the k-means algorithm many times, until you notice your loss function has minimal decrease in value. This means that your model is converging. There are several advanced k-means algorithms that are being developed to help address the initial centroid issue. K-means++ is one such algorithm that helps to improve initialization
Overfitting	High-variance and low-bias models that fail to generalize well	When trying to make a policy decision that impacts public health, an initial thought may be to include all socio-demographic variables as well as data on the population's health/co-morbidities into your model. Including too many variables in smaller datasets could lead to a model fitting to values that are not significant indicators of the target value.	Perform dimensionality reduction methods Neural networks: add a dropout layer or early stopping Clustering: perform the methods for identifying optimal cluster count
Hyperparameter tuning	Choosing the optimal model specifications (hyper parameters) that a model may require before beginning its processes	Without proper hyperparameter tuning, your model could provide suboptimal results with higher errors.	test different parameter values in your model local search optimization, including genetic algorithms grid-search, random search

Case Study Example#

Case study is for illustrative purposes and does not represent a specific study from the literature.

Scenario: AR is a public health researcher investigating opioid death rates in counties across the U.S. The goal is to determine if certain county-level features are valuable in predicting higher risk of opioid overdose deaths. These features may provide hypotheses for public policy interventions.

Specific Model Objective: Use a clustering algorithm to generate geospatial clusters of counties across the U.S. with similar socio-demographic attributes. After successful clusters are determined, cluster assignments will be used as features in a regression model to predict counties at risk for higher opioid overdose death rates, denoting potential ideas for public policy interventions.

Data Source: AR has a dataset with variables for yearly opioid prescriptions and opioid deaths. Additionally, they are leveraging the AHRQ Social Determinants of Health (SDOH) Database which contains county-level variables on social contexts, economic contexts, healthcare contexts, education, and physical infrastructure.

Analytic Method: K-Means Clustering

Results: AR initially chose a value of k=3, and ran k-means on her data. After assigning each county a value from 1-3 for their cluster assignment, her regression model had an AUC of 52.5. She tried several values of k, ending up with a value of 8, which resulted in an AUC of 79.2.

Health Equity Considerations:

A number of technical and data factors from this cluster analysis may result in detection of features that overlook or incorrectly identify certain community features:

k-Means clustering is sensitive to errors of initialization and cluster size. Thus, if initial clusters chosen by this unsupervised approach are off, their predictive power could be skewed in one direction or the other.
The value of k can have significant impact, and may require iterative experimentation. Newer clustering algorithms such as HDBSCAN have evolved that can determine the best number of the clusters automatically.
k-Means is sensitive to outliers. Thus if AR’s data contained outliers that were not managed, they could shift the clusters away from their optimal position, affecting overall performance.
There are a variety of methods used to determine k-means performance such as evaluating the elbow curve. Although it is also common to visually plot clusters, AR should avoid using only visual confirmation as clusters in the 2D space can be difficult to confirm.
The k-means algorithm is based around centroids, and not all data is clustered in natural circular clusters. Thus, AR may want to try other approaches such as spectral clustering.
Since AR is using geospatial data centered around counties, it is likely to miss important geospatial clusters in populations at the sub-county level. This could lead to overlooking important group and community factors, so a more granular geospatial area may be considered if data are accessible.
Finally, as noted AR’s dataset only contains counts on opioid prescriptions and opioid deaths. Data is not available on illicit opioid use nor the opioid types (e.g. synthetic). Therefore, there are likely factors that could be linked to geospatial attributes, and thereby help the model performance, that are not included in this dataset.

Considerations for Project Planning

Are you currently using or considering unsupervised learning methods, particularly in the context of health equity, in your project?

If yes, what specific methods to address health disparities and improve health equity?
If not, potential benefits of employing unsupervised learning to identify hidden patterns related to health disparities and promote health equity.