Lesson 3. Handling Outliers#

An outlier is a value that lies an abnormally far distance from other values in a dataset. To identify outliers, we can find the upper and lower bounds for outliers depending on the distribution of our data. Outliers can be kept, removed, or recoded. Handling outliers improves the accuracy and representation of the data since many algorithms and techniques are sensitive to outliers.

Health Equity and Handling Outliers

  • Outliers can be caused by errors in data collection, processing, or analysis, and may also be due to extreme cases or rare events. Taking the time to properly handle outliers can help ensure that public health research is as accurate and reliable as possible.
  • If the outliers are not identified and handled properly, the results of the study may be inaccurate or misleading, which could lead to incorrect recommendations for public health policy.
  • It is important for researchers to consider outliers and to identify and address them appropriately. This may involve removing them from the dataset or adjusting the analysis to account for the outlier's influence.

Handling Outliers#

Handling Outliers improves the accuracy and representation of the data since many algorithms and techniques are sensitive to outliers. To identify outliers, we can find the upper and lower bounds for outliers depending on the distribution of our data. If the data is normally distributed, outliers are the values outside of the 3-sigma limits. If our data is left or right-skewed, outliers are visualized using whisker-box plots and identified by the interquartile distances from the first and third quantiles. Other techniques include measuring the spread of the data points from the mean or center [Ferketich and Verran, 1994].

Outliers and Public Health#

Since they can skew the data and create an inaccurate sense of health disparities, outliers can have a substantial impact on health equity. For instance, it could seem as though some health inequalities are worse than they actually are if an outlier population is overrepresented in a set of health data. Similar to the previous example, it could seem as though some health inequalities are not as severe as they actually are if an outlier community is underrepresented in a set of health data. Due to the possibility that they might not accurately represent the spectrum of experiences among varied populations, outliers can also result in inaccurate conclusions concerning the causes of health inequalities. As a result, it’s critical for researchers to spot and take into account outliers when examining data on health equity [Meghani et al., 2014].

Methods for Identifying Outliers#

Several methods for identifying outliers are in the table below. Univariate outlier detection methods identify outliers considering values from a single attribute. Multivariate outlier detection methods are used to identify outliers in multiple dimensions or variables [Meghani et al., 2014].

If you are already familiar with methods for handling outliers, please continue to the next section

Method Description
Box and Whisker Plots To create a box and whisker plot, you will need to first calculate the median and the quartiles of the dataset. The box and whisker plot will then be created by drawing a box between the first quartile (Q1) and the third quartile (Q3) and then adding lines for the median (M) and the minimum (min) and maximum (max) values. Any values that fall outside of the box are considered to be outliers.
Z-Score Method The Z-score method of outlier detection is a commonly used statistical method to identify outliers in data sets. The Z-score, or standard score, is a measure of how many standard deviations an observation is from the mean. To use the Z-score method, first, calculate the mean and standard deviation of the data set. Then calculate the Z-scores for each data point. Any data point with a Z-score greater than 3 or less than -3 is considered an outlier.
Interquartile Range (IQR) Method The interquartile range (IQR) method is calculated by subtracting the first quartile (Q1) from the third quartile (Q3). Once the IQR is determined, any data points that fall below the value Q1 - 1.5 times the IQR or above Q3 + 1.5 times the IQR are considered outliers. These outliers are values that deviate significantly from the majority of the data points and may require special attention or investigation.
Standard Deviation Method The standard deviation method of identifying outliers involves calculating the standard deviation of a dataset and then identifying any data points that are more than a certain number of standard deviations away from the mean. For example, if the standard deviation of a dataset is 1 and the mean is 10, any data points that are more than 3 standard deviations away (i.e. 13 or higher) could be considered outliers. Alternatively, you can use the 1.5 times the interquartile range method, which is similar but slightly more robust.
DBSCAN Algorithm (Density-Based Spatial Clustering of Applications with Noise) The DBSCAN Algorithm is a popular unsupervised machine learning algorithm used for identifying outliers in data. It works by clustering data points based on their density and then assigning each data point a label depending on its location within the clusters. Outliers are then identified as data points that are not part of a cluster or are located far away from other data points in the same cluster.
Isolation Forest Algorithm Isolation Forest Algorithm is an unsupervised machine learning algorithm used for anomaly detection. It is designed to identify outliers in a dataset by isolating individual data points and then using decision tree-based models to detect anomalies. The algorithm works by randomly selecting a feature and then randomly selecting a split value between the minimum and maximum values of the selected feature. It then isolates the data point and builds a decision tree based on the remaining data points. The algorithm continues to build the decision tree until all data points are isolated. The algorithm then assigns a score to each isolated data point based on how many splits it requires to isolate it. Points with higher scores are more likely to be outliers.

When should outliers be removed or changed?#

Outliers should be removed when they are believed to be the result of errors or are believed to be skewing the data in an undesirable way. Outliers can occur due to mistakes in data entry, such as typing in incorrect numbers or transposing digits [Boukerche et al., 2020].

  • Natural Variability: Outliers can be caused by natural variability in the data, such as a rare event or an extreme value. When outliers represent natural variability in the data, they may contain valuable information that should be taken into account

  • Data Entry: Outliers can occur due to mistakes in data entry, such as typing in incorrect numbers or transposing digits.

  • Measurement Error: Outliers can be caused by errors in measurement. An example of a measurement error can be when a sensor malfunctions or a device reads incorrect values.

  • Data Processing: Outliers can be caused by errors in data processing and cleaning steps. Data aggregations are one example of how outliers can be caused in preprocessing steps.

  • Sampling: Outliers can be caused by errors in data sampling, such as when data points are incorrectly selected or weighted.

It is important to understand your data, your research, and the possible context your outliers may have in your specific analysis. There are several methods to address outliers

  • Removing values: One way to handle outliers is to remove them.

  • Replacing values: Instead of removing the outliers, the outliers can be replaced by imputation methods, handled as missing, or replaced by the mean, median, or mode to avoid losing data.

  • Capping: In general, capping approaches replace the outliers with the upper or lower bounds from the variable distribution. The most common approach to capping is Winsorization, where the outliers are recoded to a value at a certain percentile. The outliers are typically capped at the 5th and 95th percentiles for the lower and upper bounds, respectively.

  • Grouping outliers: One way to indicate outliers is to use binning. Binning groups the outliers into separate bins and treat them differently. You can also treat the outliers as a separate category instead of discarding them.

Case Study Example

Case study is for illustrative purposes and does not represent a specific study from the literature.

Data source: Researchers collected data from over 500,000 people, and a researcher wants to study the correlations of red meat and cancer.

Scenario: The research results indicated that those who ate the most red meat were more than twice as likely to get cancer as those who ate the least. However, when researchers looked more closely, they found that the data was skewed by a few extreme outliers who were eating large amounts of red meat, including more than a pound a day.

Analytic method: The analytic method used in this scenario involves studying the correlations between red meat consumption and cancer risk. The initial results showed a significant association, but upon closer inspection, the data was skewed by extreme outliers with exceptionally high red meat consumption, leading to misleading conclusions. To address this, data cleaning steps were performed, including outlier removal, to ensure the accuracy and reliability of the results.

Health Equity Considerations: To promote health equity, additional considerations need to be incorporated into the data cleaning process, specifically during outlier removal. By carefully evaluating outliers’ impact on different ethnic groups, potential biases can be mitigated to ensure fair representation and avoid disproportionate effects on certain populations. Furthermore, handling missing data with sensitivity to underrepresented groups can enhance the accuracy and inclusivity of the analysis, thus fostering equitable and reliable research outcomes.

Outliers:

In order to merge disparate datasets, several data cleaning steps need to be performed.

Mitigation Approach:

  • Be sure to thoroughly understand your data

  • Eliminate duplicate data and redundant information.

  • Use standard naming conventions.

  • Use a consistent data format for each field.

  • Perform imputation and outlier removal where necessary

  • Once the data cleaning process is completed, apply the necessary data transformation step.

  • Several data cleaning steps may be necessary after the data transform as it may unify even more records

Case Discussion

After removing the outliers, the association between red meat and cancer was no longer statistically significant. This example shows how outliers in data can lead to false conclusions.

Considerations for Project Planning

  • Why is it important to first understand the possible reasons behind outliers in your analysis before choosing a method to address them?
  • Are there outliers in your data that exist for a reason that would help better inform your analysis?