Section 1. Biostatistics#

In this lesson, users will be given insight into how different kinds of biases can impact the quality of data and biostatistical analysis, and eventually lead to health disparities. This section includes an overview of common methodologies, health equity examples, and a more detailed case study.

Health Equity and Biostatistics

  • Biostatistics is a largely influential and practiced field with applications in epidemiology, genetics, biology, medicine, and many others. As such, statistical analyses are often used to drive nation-wide conversations and new policies regarding many public health issues. If conducted improperly, health inequities may arise that adversely affect marginalized populations.
  • Public health research relies heavily on observational data, which may contain inherent selection bias leading to group underrepresentation, skewed distributions, and small sample sizes. Bias can be exacerbated due to algorithmic biases that are created during analysis, such as violations of model assumptions or misinterpretation of results.
  • Since it is unlikely to eliminate bias, we must take a vigilant approach to minimize it through careful method selection and analysis when practicing biostatistics.

What is Biostatistics#

Biostatistics are the statistical techniques applied to the collection, analysis, and interpretation of biological data including health, medicine, and human biology [Biostatistics, n.d.]. These methods play an important role in developing and analyzing public health initiatives and analyzing the response within communities. For example, biostatistical techniques are pivotal in creating population-level interventions, identifying barriers to care, analyzing public health trends and risks, and identifying health disparities and special risk groups.

Biostatistics falls into a category known as inferential statistics, which infers properties about an underlying population and distribution based on analysis and estimates obtained from a sample. Below are common tasks in biostatistics:

  • Hypothesis Tests: These tests represents group comparisons that quantify whether there are statistically significant differences in the means between two or more populations based on sampled data. Two common parametric, statistical techniques are T-Tests and Analysis of Variance (ANOVA) [Mishra et al., 2019]. The varieties of these techniques are beyond the scope of this lesson, however we give a refresher of more basic implementations. For information regarding single group summaries such as mean, median, correlation, and variance see the lesson on descriptive statistics.

  • Regression Analyses: Regression can be used for prediction/forecasting problems but is also useful to evaluate effects of variables on specific outcomes and to help quantify the significance of associations between variables. In choosing a methodology, ample consideration and checks should occur with regard to the assumptions inherent in the method. If assumptions are found to be violated, then model misspecification is likely and therefore estimated coefficients and errors will likely be biased and can create health inequities.

  • Structural Equation Modeling (SEM): SEM is used to explore the associations between variables by modeling the causal effects of directly observable variables with postulated latent and abstract constructs that cannot be directly measured. In other words, SEM is a statistical approach to test hypotheses about the causal nature of relationships between observed and theoretical phenomena. Structural Equation Modeling follows the steps: model specification, model identification, model estimation, and model assessment. If the model is a good fit for the data, then these techniques can be leveraged to better understand the effects of observed and latent constructs as applied to your use case. Unlike regression, it lacks prediction performance but is rather a confirmatory analysis of the researcher’s proposed model.

How Biostatistics are Used#

Some examples of applying biostatistics in a public health context are:

  • Hypothesis Testing: Methods in hypothesis testing can be used to evaluate if there are significant differences in cancer mortality rates for smokers vs non-smokers, or differences in COVID-19 hospitalizations for vaccinated vs. unvaccinated individuals.

  • Risk Assessment: Regression analysis can be used to identify significant risk factors for a disease or health outcome. For example, it could provide insight and/or support theoretical considerations for how two variables affect one another with respect to suicidal behavior, such as wealth or other socio-economic factors.

  • Causal Modeling: Structural Equation Modeling (SEM) can be used to explore the associations between socioeconomic status and smoking through latent social and psychological factors such as financial strain and psychological distress in order to target effective interventions [Martinez et al., 2018].

It should be noted that the health equity considerations in this lesson mainly focus on biostatistical techniques for cross-sectional studies vs longitudinal studies. Longitudinal analysis represents studies that take repeated measures of the same participants or variables in order to detect trends and associations over a period of time. These types of studies are a type of correlational research and are applied across a variety of different disciplines such as medicine, economics, and epidemiology. There are a number of techniques used to address longitudinal data analyses such as General Estimating Equations (GEE), repeated measures ANOVA, and Mixed Effects Regression (MER). The Mixed Effects Regression (MER) model can handle multiple challenges of longitudinal data and therefore it is the method recommended by the FDA in analyses conducted using observational studies and clinical trials [Suhr, 2006]. Within a public health context, Mixed Effects Regression can be used to model group-level trends in a study on mental health in adolescents for example.

Methods in Biostatistics#

This section serves as an overview of common types of biostatistical methods. More experienced readers who are already familiar with biostatistical methods may wish to jump to the next section on health equity considerations for more discussion regarding health equity issues when using biostatistics.

Health Equity Considerations#

Bias may persist throughout data collection, data preparation, or when applying an analytic technique leading to results that do not accurately represent the population being studied. There will likely always be some amount of bias that can create health disparities often affecting marginalized groups the most. Therefore, proactive steps must be taken to help minimize harm from health disparities and manage more equitable outcomes. The following table provides overviews of potential biases that can impact health equity when using biostatistics.

Challenge

Challenge Description

Health Equity Example

Recommended Best Practice         

Multicollinearity

    Exists when two or more of the independent variables in a regression model are moderately or highly correlated. Severe multicollinearity among predictor variables can reduce precision of coefficients and create distrust in significance testing, however does not prevent good, precise predictions of the response variable within the scope of the model. Therefore, this is more a concern in inference problems vs prediction.

  • Severe multicollinearity can introduce faulty conclusions based on effects that may be artificially exaggerated or diminished. This may occur from several undesirable outcomes of highly correlated predictors.
  • Estimated regression coefficients as well as hypothesis tests for any predictor variable can yield different conclusions depending on which combination of predictors are in the model. For example, assessing risk of diabetes with the predictor variable obesity is found to be significant for both black and Asian-American populations. However another predictor variable may be a stronger predictor of diabetes for the Asian-American population (i.e. other social factors prohibiting access to healthier food choices), but was dropped due to being highly correlated with obesity. This also alludes to the challenge presented by omitted variable bias.

Test presence of multicollinearity between independent variables using Variance Inflation Factors (VIF):

  • May consider removal of some of the predictor variables with moderate (>4) to high (>10) VIF scores from the model.

Omitted Variable Bias

    Occurs when confounding variables are not specified in the model, thereby reducing the validity of the statistical analysis. When more variables are added to the model, the number of samples must also increase in order to avoid overfitting. See power and sample size challenge below.

    When important predictors are omitted from the model, their effects are forcibly attributed to other variables included within the model which creates bias in the estimated effects and confounds the actual statistical relationship between variables. This ultimately leads to a distrust in the model results and can lead to spurious correlations, where effects of someone’s gender or socio-economic status, for example, may be over (or under) exaggerated leading to health disparities.

  • Omitted variable bias can be detected when the error term (residuals) of the model are correlated with the predictor variables.
  • Consider any missing variables and include them in the model if they are 1) correlated with the dependent variable and 2) correlated with at least one independent variable.
  • If there is not enough data to include an important predictor variable, then a proxy may be a good consideration in order to avoid omitted variable bias.

Extrapolation

    Occurs when the estimated regression equation is used to estimate a mean value or to predict a new response for independent variable values outside of the range of the sample data.

    Trends modeled using the sample data may not necessarily hold outside the scope of the model. For example, when fitting a regression model to predict maternal health outcomes using data composed of only white, black, and hispanic population, then it would be inappropriate or even dangerous to use this model to predict a new response for an asian person.

    Extrapolation may be attempted when modeling time dependent covariates, known as forecasting, with help of strategies such as weighting the samples so as to give less importance to older values via exponential weighting or smoothing.

Data Heteroscedasticity

    Occurs when residuals increase proportionally with the fitted values. It may occur more naturally in datasets that have a larger range between the largest and smallest observed values.

The existence of heteroscedasticity is a major concern in regression analysis and the analysis of variance, as it invalidates statistical tests of significance. Regression analysis using heteroscedastic data will still provide an unbiased estimate for the relationship between the predictor variable and the outcome, but standard errors and therefore inferences obtained from data analysis are suspect. This can lead to incorrect conclusions of the significance of effects between some measure and a sensitive attribute like race, gender, or other socioeconomic factors as standard errors are used in the calculation of the confidence intervals for regression coefficients.

  • Use a generalized least squares (GLS) model, such as the weighted least squares (WLS) estimator instead of ordinary least squares (OLS)
  • Transform the dependent variable, such as converting it to a rate vs raw count or modeling the log(raw counts)
  • Use a different model specification (i.e. different independent variables).

Power and Sample Size

Estimated models using small datasets (i.e. a lack of observations) can lead to poor estimation with large standard errors.

Models lack statistical power because there is insufficient data to be able to detect significant associations between the response and predictors. The consequence is very similar to that of the challenge of data heteroscedasticity, in that this may lead to incorrect conclusions of the significance of effects between some measure and a sensitive attribute (i.e. race, gender, etc).

Regarding sample size, depending on the model, study design, and population size the practical lower bound can vary and should be carefully considered given specifics of the analysis being conducted. For example, the distribution of predictor variables has shown to have an effect on statistical power and required sample sizes [Olvera Astivia et al., 2019]. In addition, it is always good practice to confirm that the estimated model satisfies the conditions (i.e. see the assumptions imposed by the regression in the tables above).

Case Study Example#

Case study is for illustrative purposes and does not represent a specific study from the literature.

Scenario: A researcher seeks to explore patient factors contributing to successful smoking cessation following a community-based smoking cessation program.

Specific Model Objective: What is the association of socio-economic factors (e.g. income, employment status, education) and successful smoking cessation?

Data Source: Participant questionnaires, Social Vulnerability Index

Analytic Method: Multivariable Logistic Regression controlling for age, sex, race/ethnicity, rural/urban status, and socio-economic factors.

Results: Higher education, higher income, and employment status all had a slightly positive, yet insignificant effect on smoking cessation (p<0.05)

Health Equity Considerations:

  • In order to prevent omitted variable bias, the inclusion of additional socio-economic variables, community safety (i.e. a measure of violent crime in the area) and family/social support (i.e. single-parent household and number of nearby relatives), were considered with family/social support found to be relevant.

  • Prior to adding, the sample sizes for the new predictor family/social support was checked and found to meet minimum standards in order to sustain the model’s statistical power.

  • After running a VIF test between all predictor variables, the only moderately high to high multicollinearity score (VIF=9.6) was detected between the variables: education and employment status. Employment status was removed from the model specification and the results re-analyzed at the same p<0.05 significance level as follows:

    • Higher education level now has a significant positive effect on smoking cessation

    • Lower income now has a significant negative effect on smoking cessation

    • The variable family/social support has a slightly positive, but insignificant effect on smoking cessation

  • Interpreting and reporting all model results should be done with care and attention to detail including special consideration so as to avoid inadvertent extrapolation beyond the scope of the model:

    • Sample sizes among different demographics in the dataset should be noted. In this case, >85% of the samples were from participants aged 20-34 years old. Therefore, associated effects should not be extrapolated to younger persons, such as teenagers for example, who may produce very different inference results.

    • One should also explicitly state the main significance tests reported above were averaged over the entire dataset. Therefore no claims should be made regarding these same associations with a specific demographic (e.g. Hispanic Men) unless the necessary hypothesis testing was conducted.

  • The steps above demonstrate that adjusting the model to avoid potential biases and fine-tune the covariates can change the effects analysis. In this case, the updated analysis can enable efforts to reach smokers of lower socioeconomic status with proven tobacco control strategies which help reduce disparities in smoking prevalence and therefore consequential disease and death.

Considerations for Project Planning

  • How have you checked that your data adheres to the assumptions of your chosen method in order to avoid inaccurate or misleading conclusions?
  • If you perform your analysis using a different technique, how do the analysis or outcomes change and why? Is a disparity, or inconsistency, created when comparing outcomes using different methods?
  • Who else can you ask to evaluate the analysis to help diminish bias that can creep into the process? Is your analysis reproducible?