{ "cells": [ { "cell_type": "markdown", "id": "ea6c73de", "metadata": {}, "source": [ "# Lesson 2. Unsupervised Learning\n", "Unsupervised learning methods are powerful tools that help detect patterns in a dataset and latent relationships between variables. They are especially useful in supporting supervised learning problems in downstream model development and analysis by performing crucial tasks such as data compression and feature selection and extraction. Unlabeled datasets pose a challenge as they do not have predefined labels to validate model results. This requires human interventions." ] }, { "cell_type": "markdown", "id": "3dc21c35", "metadata": {}, "source": [ "
\n",
"
| \n",
"
Model | Common Usage | Suggested Usage | Suggested Scale | Interpretability | Common Concerns |
---|---|---|---|---|---|
Artificial Neural Network (ANN) | Clustering |
| Usually small to medium datasets (stochastic gradient descent optimization drastically increases scalability) | Low |
|
k-Means Clustering | Clustering |
| Small datasets | Moderate |
|
Hierarchical Clustering | Clustering |
| Small datasets | Moderate |
|
Spectral Clustering | Clustering |
| Small datasets | Moderate |
|
Association Rules | Association Rule Learning |
| Medium to large transactional datasets | Moderate |
|
Principal Components Analysis (PCA) | Dimensionality reduction |
| Small to large datasets | Low |
|
Challenge | Challenge Description | Health Equity Example | Recommended Best Practice |
---|---|---|---|
\n",
" Correct number of clusters unknown | \n",
" \n",
" For clustering algorithms, you will not automatically know the correct number of clusters to choose for your data. There are several techniques you can employ to estimate a good number of clusters based on your data.\n",
" | \n",
" \n", " When clustering groups of data based on demographic information and seroprevalence rates, too many or too few clusters may cause you to loose potentially use full insights and leads to an inequal spread of variance. This in turn could lead to biased and inaccurate policy analysis. \n", " | \n",
" \n",
" There are several metrics and graphs you can run to find an optimal number of clusters for your dataset. For details on these metrics,vist this article\n",
"
|
Standardization | \n",
" Standardization is especially important when clustering data as it prevents features that use larger scales from dominating the other features arbitrarily. Standardization is a scaling technique that centers values around their mean with a standard deviation of 1. | \n", "\n", " Clustering can be used to form groupings to help identify patterns in complicated data such as clinical data. However, clinical data often has varying scales. If you are trying to group records based on similar patterns, features with larger scales could skew the group representations. By standardzing the features, the clustering model will be optimized to consider features with the smaller scales, garnering more insights to your dataset. | \n", "
|
Curse of Dimensionality | \n",
" Various phenomena that can occur when when performing machine learning methods with high-dimensional datasets (higher error, overfitting fit to irrelevant features, irregular patterns identified) | \n", "\n", " Survey data often has high dimensionality (many features). Dimensionality reduction can be used to help both interpret the relevant features within survey data and help aid in the performance of downstream model analysis. \n", " | \n", "Perform dimensionality reduction techniques such as principle component analysis\n", " |
\n", " Sensitive to initialization | \n",
" \n", " K-means clustering chooses n number of centroid values within the dataset where n is the number of clusters you choose. The initial centroid values are chosen at random and the algorithm then begins to form clusters based upon the distance all the values have from the centroids. The problem lies in that the centroids are chosen at random and may not be the optimal initial points to build the k-means clusters on | \n", "\n", " K-means could be used to identify which region's population has a higher risk of developing diabetes based on glucose, blood pressure, insulin, BMI, age, etc. values. K-means iteratively improves the initial centroid chosen which improves the clustering process, but if the model is not adequately iterated, your results could be skewed. In this case, public messaging on diabetes type 2 could be directed to the wrong populations, leading to an ineffective public health campaign. \n", " | \n", "
|
\n", "Overfitting | High-variance and low-bias models that fail to generalize well | \n",
" When trying to make a policy decision that impacts public health, an initial thought may be to include all socio-demographic variables as well as data on the population's health/co-morbidities into your model. Including too many variables in smaller datasets could lead to a model fitting to values that are not significant indicators of the target value. | \n",
"
|
\n", "Hyperparameter tuning | \n", "\n", " Choosing the optimal model specifications (hyper parameters) that a model may require before beginning its processes | \n", "\n", " Without proper hyperparameter tuning, your model could provide suboptimal results with higher errors. | \n", "\n",
"
|
\n",
"
|