Section 1. Descriptive Statistics

Section 1. Descriptive Statistics#

Descriptive statistics can be used to summarize and describe variables in a data set, which is a valuable step prior to conducting more advanced analyses. Data exploration is good practice prior to model development in order to help inform appropriate method selection and mitigate misinterpretation of the analysis. For example, computing descriptive statistics to understand the distribution of the data and sample sizes relative to variables of interest such as race, sex, age, or socioeconomic factors would be valuable when drawing the associated conclusions. Note, descriptive statistics as referred to here are not inferential statistics. Researchers only use descriptive statistics to better understand the attributes of a given data set. Researchers interested in better understanding how variables interact with each other and testing the potential relationships between variables should check out the lesson on Biostatistics.

Health Equity and Descriptive Statistics

Descriptive statistics provide a summarization of features within the data set and can reveal data quality issues that can lead to bias, such as missing, inconsistent, inaccurate, or ambiguous values.
Descriptive statistics combined with data visualization are used to show relationships between variables and can help select the type of analysis and research question appropriate for the data. This can lead to more accurate analyses with any limitations, and potential bias, clearly communicated.

Common Types of Descriptive Statistics#

Researchers use descriptive statistics to better understand the underlying distribution of the data, gaining intuition for the aggregated sample as well as for stratifications along variables of interest. Moreover, this understanding often provides context for what to expect from a statistical analysis, such as regression, and can aid in mitigating biased outcomes.

The most common descriptive statistics can be categorized as follows:

Measures of Frequency describe how often something occurs and include metrics such as counts and percentages of subgroups within the data set. Rates can result from combining two different data sets together in order to answer a question of interest (e.g. How do cancer rates vary across community demographics?), which can create more opportunity for biases and data quality issues. When calculating and analyzing rates, researchers should avoid the perils of the base rate fallacy, which creates opportunity to jump to conclusions based on initial impressions and therefore perpetuating harmful stereotypes.
Measures of Central Tendency help describe what a “typical” value is within a data set and describe attributes such as the average (mean), middle (median), or most frequent values within a data set (mode). These statistics help researchers better understand the central point of the distribution underlying a data set or variable as well as what is most likely to occur in a given data set.
Measures of Dispersion and Variation indicate the dispersion (spread) between data points in a data set and include metrics such as range, variance, and standard deviation. These statistics can be helpful for determining outliers and understanding the underlying distribution (pattern) of your data.
Measures of Position help researchers understand how values in a data set occur in relation to each other. The most common measures of position are percentiles, quartiles, standard scores (z-scores), and correlations.

All of these measures work together to help characterize a data distribution. Additionally, visualizing the shape of the distribution is also good practice, but can be potentially misleading depending on the method chosen, such as a boxplot versus histogram. See the lesson on data visualization to learn more about health equity considerations and graphical representations.

The following table contains a review of specific methods in descriptive statistics. More experienced readers already familiar with different types of descriptive statistics can continue to the next section on Health Equity Considerations, which details health equity impacts and a more in depth case study.

If you are already familiar with descriptive statistical methods, please continue to the next section. Otherwise click here.

Statistic	Type of Measure	Calculation	Description
Counts and Percentages	Frequency	Aggregated counts and fractional occurrences of an event or attribute of interest present within the raw data set.	Counts and percentages of certain quantities within a data set are often used to produce base rates, establish incidence of an event, understand differences in group stratifications and distributions, and begin to form a characterization of the data.
Mean	Central Tendency	Average value for the collected sample.	The sample mean varies from study to study. Ideally, the sample mean and the true population mean should be close. The mean is the most sensitive to outliers, which can translate to overestimation or underestimation of the most common values.
Median	Central Tendency	Middle value in the collected sample found when ordering all data points.	The median is less likely to be influenced by outliers and is more resilient to skewed distributions. It is the preferred measure to the mean in this case.
Mode	Central Tendency	The value that occurs the most frequently in a data set (i.e. it occurs the highest number of times).	The mode is least likely to be influenced by outliers. It is possible to have no mode, a single mode, or more than one mode as in a bimodal data distribution, for example. The mode is particularly useful when using nominal data.
Range	Dispersion/Variation	A measure of the total distance between the smallest and largest data points within a sample.	Range is a measure of the total spread in a data set. This metric has the advantage of direct interpretation because it is measured in the same units as the values in a data set. Range is perhaps most useful when comparing data collected at different points in time or from different samples, however, it is a measure not often used.
Variance	Dispersion/Variation	A measure of the average squared distance between all data points within a sample.	Ideally, the population and sample variance should be close. The variance is a building block for other important statistics such as the standard deviation, standard errors, and confidence intervals.
Standard Deviation	Dispersion/Variation	A measure of how dispersed the data points are in reference to the mean of the sample.	A low standard deviation means that the data points within the sample are tightly clustered around the mean. The standard deviation is equal to the square root of the variance.
Percentile Rank	Position	Given a rank-ordered data set, percentiles are calculated by segmenting values in the ordered data set into 100 equal parts. In general, a value having a percentile rank of n indicates that it is greater than n% of the other values within that data set.	In this case, a value at the 50th percentile rank corresponds to the median value of the data set.
Quartile Rank	Position	Given a rank-ordered data set, quartiles are calculated by segmenting values in the ordered data set into 4 equal parts.	There is a relationship between quartile ranks and percentile ranks. That is, the first quartile corresponds to the 25th percentile, the second quartile corresponds to the 50th percentile (and also the median value), and the third quartile corresponds to the 75th percentile.
Standard Scores (z-scores)	Position	The formula for calculating a z-score is z = (x-μ)/σ, where x is the data point, μ is the population mean, and σ is the standard deviation.	Standard scores (also known as z-scores) indicate the number of standard deviations that a single sample data point is from the population mean of the data set. When comparing multiple samples we instead want to describe the sample deviations of those sample means by using the standard error σ/√n vs the standard deviation σ in the formula's denominator. In this case, the z-score describes the number of standard errors that are between the sample mean (x) and the population mean (μ).
Correlation	Position	A measure of the strength of the statistical relationship between two random variables.	Common types are Pearson and Spearman Rank coefficients. Since the data set is a random sample or a representative sample of the underlying population, proper inference on the association being measured between two variables must be carefully ascertained. In order to test if the measured correlation is statistically significant, t-tests are typically used. Finally, observed correlations are not evidence for causality.

How Descriptive Statistics are Used#

Common applications of descriptive statistics within public health are:

Data Characterization: Measuring typical values and outliers, skewness, and dispersion properties about a data set help to characterize it. This can serve as an exploration of the underlying distribution of the data, identifying information about a sample to support downstream analysis/interpretability, or as part of data preparation to support feature engineering prior to modeling. This is helpful within public health research but also applies across many other technical disciplines.
Health Monitoring: Descriptive statistics about an individual are used to test for the presence of disease or infection, track disease progression and observe efficacy of treatments, and monitor a person’s overall health.
Disease Prevalence and Incidence Rate Calculations: Descriptive statistics are used to characterize disease prevalence by computing base rates and creating frequency tables. Incidence rate ratios can aid hypothesis testing for effects of certain lifestyle choices, for example, between groups of smokers and non-smokers.
Data Visualization Support: After using descriptive statistics to summarize the data set and aggregate groups, this information can be displayed as a final product in plots or tables to aid in understanding group trends. This is an effective way of organizing information and can often be seen in displaying health trends on maps (e.g. charting the spread of infectious diseases) and in final reports.

Health Equity Considerations#

Many sources of bias can occur within various steps of the research process, from the data curation phase through interpretation of results. When analyzing a data set it is important to consider how it was collected and note the potential biases present within it. However, once the data set is in hand, the researcher may be unlikely to acquire additional data to correct for over or under representation of certain groups. Therefore, it is critical for the researcher to understand the distribution of their data, group representation, sample sizes, and how this impacts analysis in order to mitigate additional bias. For example, many hypothesis tests assume the data is normally distributed and may be inappropriate to use given a non-normal distribution shape and smaller sample size.

Challenge

Challenge Description

Health Equity Example

Recommended Best Practice

Statistical Confounding

A confounder (or confounding variable) may introduce bias into a study and preclude finding a true effect by producing an underestimation or overestimation of the true association between an exposure and an outcome.

Univariate analyses that rely on aggregated observational data can contain confounding factors that can promote inaccurate conclusions when analyzing even simple descriptive statistics. An example of this is known as Simpson’s Paradox, which has been known to emerge in cases such as COVID-19 vaccination data sets [von Kügelgen et al., 2021, Wang and Rousseau, 2021] (confounder=age) and cancer studies [Fu et al., 2015] (confounder=race), which could cause health disparities in areas of preventing spread of infectious diseases and targeted cancer therapies. Another byproduct of confounding may be the detection of spurious correlations.

Question how a univariate analysis can lead to a loss of information and consider:

Stratifying the raw data set based on the potential confounders

Inclusion and adjustment for other potential confounding variables

Use of more rigorous Multivariate methods such as regression and ANOVA studies to further improve the estimation

Sampling Error

Occurs when the sample population within the data set is not an accurate representation of the true population simply due to chance.

Sampling error can produce the appearance of effects in the sample that are not actually present in the population, or vice versa. Spurious correlations are one such effect when performing descriptive statistics on a sample. Regarding a comparison, suppose a study is conducted to evaluate the effectiveness of a new treatment for dementia and find that it improves the health outcomes in the treatment group by 15% relative to the control group. However, due to sampling error and high variability in the true population, the true estimate is actually lower and varies greatly among different ethnic groups.

More rigorous biostatistical methods, such as hypothesis testing, can help sort out if associations are spurious or if an effect observed in the sample is likely also to exist in the larger population.

Small Sample Sizes and Data Heterogeneity

Small data sets present higher risk for biased estimates and random sampling error. These risks are only increased when combined with heterogeneous data.

When analyzing the effectiveness of cancer treatment in a sample, the computation of base rates or risk calculations using raw counts, for example, can be drastically altered when adjusting for skewed group samples. That is, analyzing the distributions along race, ethnicity, and gender for example, and adjusting for these attributes may make a critical impact when analyzing base rates for health outcomes using a new cancer treatment.

Use methods in descriptive statistics such as measures of dispersion and exploratory data visualization to understand and help quantify the variability within the sample. Statistical tests for homogeneity can be done using a chi-square test for categorical data and hypothesis tests for group means (t-test, ANOVA) and variability between groups (variance test) for continuous data.

Case Study Example#

Case study is for illustrative purposes and does not represent a specific study from the literature.

Scenario: A public health fellow is quantifying diabetic control statistics in the US covering 153 hospitals.

Specific Model Objective: Amongst patients who have a diagnosis, laboratory, or medication indicator of diabetes (using SUPREME-DM definition [Nichols et al., 2012]), what percentage have a hemoglobin A1c value < 7% for the two most recent values?

Data Source: EHR data

Analytic Method: Univariate analysis to quantify diabetes control

Results: The fellow decided to estimate raw prevalence for this cross-sectional study [Capili, 2021] and found that individuals in rural areas had 1.2x higher rates than those in urban areas.

Health Equity Considerations: Initial findings showed only a slight difference in prevalence between rural vs urban geographies. After reviewing the data with a senior epidemiologist in the chronic disease section, several new findings are noted:

There is an imbalance of patient characteristics between those residing in rural and urban areas, which should be taken into account in the interpretation of the difference in risk:
- Normalizing counts between those in rural and urban areas with largely different sample sizes can help estimate true effects.
- When adjusting for geographic area, individuals who reside in rural areas now exhibited 2.4x higher rates those in urban areas.
Univariate analyses based on aggregated data may lead to contradictory or misleading results compared to methodology that controls for statistical confounding:
- When splitting out the data further based on race, it was found that rates of uncontrolled diabetes are higher in black populations than white populations regardless of what kind of geographic area in which they resided. In this case, race is a confounding variable.
The fellow may also consider performing a multi-variable analysis to control for statistical confounding, which will yield a better estimate of effects of race and rural-urban areas on diabetes control risk.
- Data characterization and univariate analyses are necessary first steps in exploring these associations but yield limited conclusions.

Considerations for Project Planning

How have you characterized the distribution of your data, such as skewness and outliers? How does this compare to the marginal distributions for the target population of interest stated in your research question?
Data quality concerns may place limitations on the type of analysis you can do or create bias, such as missing variables you wish you had or smaller sample sizes. How has data characterization influenced your method selection?
After looking at the data, do you find you need to refine your original research question? Why or why not?