Section 1. Descriptive Statistics#

Descriptive statistics can be used to summarize and describe variables in a data set, which is a valuable step prior to conducting more advanced analyses. Data exploration is good practice prior to model development in order to help inform appropriate method selection and mitigate misinterpretation of the analysis. For example, computing descriptive statistics to understand the distribution of the data and sample sizes relative to variables of interest such as race, sex, age, or socioeconomic factors would be valuable when drawing the associated conclusions. Note, descriptive statistics as referred to here are not inferential statistics. Researchers only use descriptive statistics to better understand the attributes of a given data set. Researchers interested in better understanding how variables interact with each other and testing the potential relationships between variables should check out the lesson on Biostatistics.

Health Equity and Descriptive Statistics

  • Descriptive statistics provide a summarization of features within the data set and can reveal data quality issues that can lead to bias, such as missing, inconsistent, inaccurate, or ambiguous values.
  • Descriptive statistics combined with data visualization are used to show relationships between variables and can help select the type of analysis and research question appropriate for the data. This can lead to more accurate analyses with any limitations, and potential bias, clearly communicated.

Common Types of Descriptive Statistics#

Researchers use descriptive statistics to better understand the underlying distribution of the data, gaining intuition for the aggregated sample as well as for stratifications along variables of interest. Moreover, this understanding often provides context for what to expect from a statistical analysis, such as regression, and can aid in mitigating biased outcomes.

The most common descriptive statistics can be categorized as follows:

  • Measures of Frequency describe how often something occurs and include metrics such as counts and percentages of subgroups within the data set. Rates can result from combining two different data sets together in order to answer a question of interest (e.g. How do cancer rates vary across community demographics?), which can create more opportunity for biases and data quality issues. When calculating and analyzing rates, researchers should avoid the perils of the base rate fallacy, which creates opportunity to jump to conclusions based on initial impressions and therefore perpetuating harmful stereotypes.

  • Measures of Central Tendency help describe what a “typical” value is within a data set and describe attributes such as the average (mean), middle (median), or most frequent values within a data set (mode). These statistics help researchers better understand the central point of the distribution underlying a data set or variable as well as what is most likely to occur in a given data set.

  • Measures of Dispersion and Variation indicate the dispersion (spread) between data points in a data set and include metrics such as range, variance, and standard deviation. These statistics can be helpful for determining outliers and understanding the underlying distribution (pattern) of your data.

  • Measures of Position help researchers understand how values in a data set occur in relation to each other. The most common measures of position are percentiles, quartiles, standard scores (z-scores), and correlations.

All of these measures work together to help characterize a data distribution. Additionally, visualizing the shape of the distribution is also good practice, but can be potentially misleading depending on the method chosen, such as a boxplot versus histogram. See the lesson on data visualization to learn more about health equity considerations and graphical representations.

The following table contains a review of specific methods in descriptive statistics. More experienced readers already familiar with different types of descriptive statistics can continue to the next section on Health Equity Considerations, which details health equity impacts and a more in depth case study.

How Descriptive Statistics are Used#

Common applications of descriptive statistics within public health are:

  • Data Characterization: Measuring typical values and outliers, skewness, and dispersion properties about a data set help to characterize it. This can serve as an exploration of the underlying distribution of the data, identifying information about a sample to support downstream analysis/interpretability, or as part of data preparation to support feature engineering prior to modeling. This is helpful within public health research but also applies across many other technical disciplines.

  • Health Monitoring: Descriptive statistics about an individual are used to test for the presence of disease or infection, track disease progression and observe efficacy of treatments, and monitor a person’s overall health.

  • Disease Prevalence and Incidence Rate Calculations: Descriptive statistics are used to characterize disease prevalence by computing base rates and creating frequency tables. Incidence rate ratios can aid hypothesis testing for effects of certain lifestyle choices, for example, between groups of smokers and non-smokers.

  • Data Visualization Support: After using descriptive statistics to summarize the data set and aggregate groups, this information can be displayed as a final product in plots or tables to aid in understanding group trends. This is an effective way of organizing information and can often be seen in displaying health trends on maps (e.g. charting the spread of infectious diseases) and in final reports.

Health Equity Considerations#

Many sources of bias can occur within various steps of the research process, from the data curation phase through interpretation of results. When analyzing a data set it is important to consider how it was collected and note the potential biases present within it. However, once the data set is in hand, the researcher may be unlikely to acquire additional data to correct for over or under representation of certain groups. Therefore, it is critical for the researcher to understand the distribution of their data, group representation, sample sizes, and how this impacts analysis in order to mitigate additional bias. For example, many hypothesis tests assume the data is normally distributed and may be inappropriate to use given a non-normal distribution shape and smaller sample size.

Challenge

Challenge Description

Health Equity Example

Recommended Best Practice         

Statistical Confounding

    A confounder (or confounding variable) may introduce bias into a study and preclude finding a true effect by producing an underestimation or overestimation of the true association between an exposure and an outcome.

Univariate analyses that rely on aggregated observational data can contain confounding factors that can promote inaccurate conclusions when analyzing even simple descriptive statistics. An example of this is known as Simpson’s Paradox, which has been known to emerge in cases such as COVID-19 vaccination data sets [von Kügelgen et al., 2021, Wang and Rousseau, 2021] (confounder=age) and cancer studies [Fu et al., 2015] (confounder=race), which could cause health disparities in areas of preventing spread of infectious diseases and targeted cancer therapies. Another byproduct of confounding may be the detection of spurious correlations.

Question how a univariate analysis can lead to a loss of information and consider:

  • Stratifying the raw data set based on the potential confounders
  • Inclusion and adjustment for other potential confounding variables
  • Use of more rigorous Multivariate methods such as regression and ANOVA studies to further improve the estimation
  • Sampling Error

      Occurs when the sample population within the data set is not an accurate representation of the true population simply due to chance.

    Sampling error can produce the appearance of effects in the sample that are not actually present in the population, or vice versa. Spurious correlations are one such effect when performing descriptive statistics on a sample. Regarding a comparison, suppose a study is conducted to evaluate the effectiveness of a new treatment for dementia and find that it improves the health outcomes in the treatment group by 15% relative to the control group. However, due to sampling error and high variability in the true population, the true estimate is actually lower and varies greatly among different ethnic groups.

    More rigorous biostatistical methods, such as hypothesis testing, can help sort out if associations are spurious or if an effect observed in the sample is likely also to exist in the larger population.

    Small Sample Sizes and Data Heterogeneity

      Small data sets present higher risk for biased estimates and random sampling error. These risks are only increased when combined with heterogeneous data.

    When analyzing the effectiveness of cancer treatment in a sample, the computation of base rates or risk calculations using raw counts, for example, can be drastically altered when adjusting for skewed group samples. That is, analyzing the distributions along race, ethnicity, and gender for example, and adjusting for these attributes may make a critical impact when analyzing base rates for health outcomes using a new cancer treatment.

    Use methods in descriptive statistics such as measures of dispersion and exploratory data visualization to understand and help quantify the variability within the sample. Statistical tests for homogeneity can be done using a chi-square test for categorical data and hypothesis tests for group means (t-test, ANOVA) and variability between groups (variance test) for continuous data.

    Case Study Example#

    Case study is for illustrative purposes and does not represent a specific study from the literature.

    Scenario: A public health fellow is quantifying diabetic control statistics in the US covering 153 hospitals.

    Specific Model Objective: Amongst patients who have a diagnosis, laboratory, or medication indicator of diabetes (using SUPREME-DM definition [Nichols et al., 2012]), what percentage have a hemoglobin A1c value < 7% for the two most recent values?

    Data Source: EHR data

    Analytic Method: Univariate analysis to quantify diabetes control

    Results: The fellow decided to estimate raw prevalence for this cross-sectional study [Capili, 2021] and found that individuals in rural areas had 1.2x higher rates than those in urban areas.

    Health Equity Considerations: Initial findings showed only a slight difference in prevalence between rural vs urban geographies. After reviewing the data with a senior epidemiologist in the chronic disease section, several new findings are noted:

    • There is an imbalance of patient characteristics between those residing in rural and urban areas, which should be taken into account in the interpretation of the difference in risk:

      • Normalizing counts between those in rural and urban areas with largely different sample sizes can help estimate true effects.

      • When adjusting for geographic area, individuals who reside in rural areas now exhibited 2.4x higher rates those in urban areas.

    • Univariate analyses based on aggregated data may lead to contradictory or misleading results compared to methodology that controls for statistical confounding:

      • When splitting out the data further based on race, it was found that rates of uncontrolled diabetes are higher in black populations than white populations regardless of what kind of geographic area in which they resided. In this case, race is a confounding variable.

    • The fellow may also consider performing a multi-variable analysis to control for statistical confounding, which will yield a better estimate of effects of race and rural-urban areas on diabetes control risk.

      • Data characterization and univariate analyses are necessary first steps in exploring these associations but yield limited conclusions.

    Considerations for Project Planning

    • How have you characterized the distribution of your data, such as skewness and outliers? How does this compare to the marginal distributions for the target population of interest stated in your research question?
    • Data quality concerns may place limitations on the type of analysis you can do or create bias, such as missing variables you wish you had or smaller sample sizes. How has data characterization influenced your method selection?
    • After looking at the data, do you find you need to refine your original research question? Why or why not?