Section 1. Biostatistics

Section 1. Biostatistics#

In this lesson, users will be given insight into how different kinds of biases can impact the quality of data and biostatistical analysis, and eventually lead to health disparities. This section includes an overview of common methodologies, health equity examples, and a more detailed case study.

Health Equity and Biostatistics

Biostatistics is a largely influential and practiced field with applications in epidemiology, genetics, biology, medicine, and many others. As such, statistical analyses are often used to drive nation-wide conversations and new policies regarding many public health issues. If conducted improperly, health inequities may arise that adversely affect marginalized populations.
Public health research relies heavily on observational data, which may contain inherent selection bias leading to group underrepresentation, skewed distributions, and small sample sizes. Bias can be exacerbated due to algorithmic biases that are created during analysis, such as violations of model assumptions or misinterpretation of results.
Since it is unlikely to eliminate bias, we must take a vigilant approach to minimize it through careful method selection and analysis when practicing biostatistics.

What is Biostatistics#

Biostatistics are the statistical techniques applied to the collection, analysis, and interpretation of biological data including health, medicine, and human biology [Biostatistics, n.d.]. These methods play an important role in developing and analyzing public health initiatives and analyzing the response within communities. For example, biostatistical techniques are pivotal in creating population-level interventions, identifying barriers to care, analyzing public health trends and risks, and identifying health disparities and special risk groups.

Biostatistics falls into a category known as inferential statistics, which infers properties about an underlying population and distribution based on analysis and estimates obtained from a sample. Below are common tasks in biostatistics:

Hypothesis Tests: These tests represents group comparisons that quantify whether there are statistically significant differences in the means between two or more populations based on sampled data. Two common parametric, statistical techniques are T-Tests and Analysis of Variance (ANOVA) [Mishra et al., 2019]. The varieties of these techniques are beyond the scope of this lesson, however we give a refresher of more basic implementations. For information regarding single group summaries such as mean, median, correlation, and variance see the lesson on descriptive statistics.
Regression Analyses: Regression can be used for prediction/forecasting problems but is also useful to evaluate effects of variables on specific outcomes and to help quantify the significance of associations between variables. In choosing a methodology, ample consideration and checks should occur with regard to the assumptions inherent in the method. If assumptions are found to be violated, then model misspecification is likely and therefore estimated coefficients and errors will likely be biased and can create health inequities.
Structural Equation Modeling (SEM): SEM is used to explore the associations between variables by modeling the causal effects of directly observable variables with postulated latent and abstract constructs that cannot be directly measured. In other words, SEM is a statistical approach to test hypotheses about the causal nature of relationships between observed and theoretical phenomena. Structural Equation Modeling follows the steps: model specification, model identification, model estimation, and model assessment. If the model is a good fit for the data, then these techniques can be leveraged to better understand the effects of observed and latent constructs as applied to your use case. Unlike regression, it lacks prediction performance but is rather a confirmatory analysis of the researcher’s proposed model.

How Biostatistics are Used#

Some examples of applying biostatistics in a public health context are:

Hypothesis Testing: Methods in hypothesis testing can be used to evaluate if there are significant differences in cancer mortality rates for smokers vs non-smokers, or differences in COVID-19 hospitalizations for vaccinated vs. unvaccinated individuals.
Risk Assessment: Regression analysis can be used to identify significant risk factors for a disease or health outcome. For example, it could provide insight and/or support theoretical considerations for how two variables affect one another with respect to suicidal behavior, such as wealth or other socio-economic factors.
Causal Modeling: Structural Equation Modeling (SEM) can be used to explore the associations between socioeconomic status and smoking through latent social and psychological factors such as financial strain and psychological distress in order to target effective interventions [Martinez et al., 2018].

It should be noted that the health equity considerations in this lesson mainly focus on biostatistical techniques for cross-sectional studies vs longitudinal studies. Longitudinal analysis represents studies that take repeated measures of the same participants or variables in order to detect trends and associations over a period of time. These types of studies are a type of correlational research and are applied across a variety of different disciplines such as medicine, economics, and epidemiology. There are a number of techniques used to address longitudinal data analyses such as General Estimating Equations (GEE), repeated measures ANOVA, and Mixed Effects Regression (MER). The Mixed Effects Regression (MER) model can handle multiple challenges of longitudinal data and therefore it is the method recommended by the FDA in analyses conducted using observational studies and clinical trials [Suhr, 2006]. Within a public health context, Mixed Effects Regression can be used to model group-level trends in a study on mental health in adolescents for example.

Methods in Biostatistics#

This section serves as an overview of common types of biostatistical methods. More experienced readers who are already familiar with biostatistical methods may wish to jump to the next section on health equity considerations for more discussion regarding health equity issues when using biostatistics.

If you are already familiar with biostatistical methods, please continue to the next section. Otherwise click here.

Method	Description	Assumptions	Hypotheses
One Sample T-Test	A test to determine if the difference in the means between two groups are statistically significant Can be performed one-sided or two-sided T-Tests analyze whether two samples are from the same population	Sample drawn from population is normally distributed Samples from populations having equal variances (homogeneous variance) Sampled observations are independent from one another Data are randomly sampled The t-distribution method is used when the standard deviation is unknown and sample size is small sample size. Group sample sizes is small (less than 30)	Null hypothesis H0: population means are equal Alternative hypothesis H1: population means are not equal
One-Way ANOVA	A test to determine if the difference in the means between three or more groups are statistically significant ANOVA is a single-sided test (no negative variance) ANOVA analyzes the variances among population means	Sample drawn from population is normally distributed Samples from populations having equal variances (homogeneous variance) Sampled observations are independent from one another (in case of dependent observations, then one-way repeated measures ANOVA should be used) Data are randomly sampled Predictor variable is categorical To determine which pair of means may be statistically different, post-hoc tests (pair-wise multiple comparisons) are needed	Null hypothesis H0: All population means are the same Alternative hypothesis H1: At least one population mean is different
Simple Linear Regression	An estimation of the change in the response variable (Y) given the predictor variables (X) The t-test is used to measure the linear correlation between a response variable and predictor variable(s) as applied to the regression coefficients	Continuous response variable Continuous predictor variable(s) categorical variables can be encoded (e.g. using dummy coding or other coding methods) Single predictor variable (simple linear regression) or multiple predictor variables (multiple linear regression) A linear relationship exists between the response Y and predictor variables X Observations are independent of one another The residuals have constant variance ("constant error") for any predictor X. This is known as homoscedasticity. Weighted least squares model can be used when observing heteroscedasticity. For any fixed value of X, the residuals are normally distributed No multicollinearity between predictor variables	Testing whether the intercept and any other coefficients are significantly non-zero
Logistic Regression	Models the log-odds of the response variable as a linear combination of the predictor variables Is an extension of linear regression, where the response variable is now binary The Wald statistic (vs t-test in linear regression) is used to assess the significance of the independent variables with respect to the regression coefficients Coefficients represent the odds ratio for the response variable based on the predictor variables (i.e. essentially a probability of an "event" occurring or not)	Response variable is binary Observations are independent No multicollinearity between predictor variables A linear relationship exists between the predictor variables and the logit of the response variable (Box-Tidwell test) Sufficiently large sample size: One practice is the rule of event per variable (EPV) of 50 and formula n = 100 + 50i where i refers to number of independent variables in the final model.**	Testing whether the intercept and any other coefficients are significantly non-zero
Confirmatory Factor Analysis (CFA)	A structural equation modeling technique used to measure the cause-effect type relationships in path models with latent variables Yields an estimation of variable coefficients representing the estimated change in the dependent variable given the unit change in independent variables	Researcher must rely heavily on theory, expert domain/empirical knowledge for the intended use case, and prior factor analysis in order to postulate the relationships a priori (i.e. the "factor structure") Be alert to the Heywood Case {cite:p}`cooperman2022heywood`, which produces negative computed variances and which implies possible high correlation between included variables Confirm the factor structure by referencing the chi-square test and resultant p-value for goodness of fit test before proceeding with interpretation of model results. If not a good fit, proceeding with a technique such as Exploratory Factor Analysis (EFA) is advised	Tests if there is a statistically significant relationship between observed variables and latent concepts and between latent variables themselves

[Bujang et al., 2018]

Health Equity Considerations#

Bias may persist throughout data collection, data preparation, or when applying an analytic technique leading to results that do not accurately represent the population being studied. There will likely always be some amount of bias that can create health disparities often affecting marginalized groups the most. Therefore, proactive steps must be taken to help minimize harm from health disparities and manage more equitable outcomes. The following table provides overviews of potential biases that can impact health equity when using biostatistics.

Challenge	Challenge Description	Health Equity Example	Recommended Best Practice
Multicollinearity	Exists when two or more of the independent variables in a regression model are moderately or highly correlated. Severe multicollinearity among predictor variables can reduce precision of coefficients and create distrust in significance testing, however does not prevent good, precise predictions of the response variable within the scope of the model. Therefore, this is more a concern in inference problems vs prediction.	Severe multicollinearity can introduce faulty conclusions based on effects that may be artificially exaggerated or diminished. This may occur from several undesirable outcomes of highly correlated predictors. Estimated regression coefficients as well as hypothesis tests for any predictor variable can yield different conclusions depending on which combination of predictors are in the model. For example, assessing risk of diabetes with the predictor variable obesity is found to be significant for both black and Asian-American populations. However another predictor variable may be a stronger predictor of diabetes for the Asian-American population (i.e. other social factors prohibiting access to healthier food choices), but was dropped due to being highly correlated with obesity. This also alludes to the challenge presented by omitted variable bias.	Test presence of multicollinearity between independent variables using Variance Inflation Factors (VIF): May consider removal of some of the predictor variables with moderate (>4) to high (>10) VIF scores from the model.
Omitted Variable Bias	Occurs when confounding variables are not specified in the model, thereby reducing the validity of the statistical analysis. When more variables are added to the model, the number of samples must also increase in order to avoid overfitting. See power and sample size challenge below.	When important predictors are omitted from the model, their effects are forcibly attributed to other variables included within the model which creates bias in the estimated effects and confounds the actual statistical relationship between variables. This ultimately leads to a distrust in the model results and can lead to spurious correlations, where effects of someone’s gender or socio-economic status, for example, may be over (or under) exaggerated leading to health disparities.	Omitted variable bias can be detected when the error term (residuals) of the model are correlated with the predictor variables. Consider any missing variables and include them in the model if they are 1) correlated with the dependent variable and 2) correlated with at least one independent variable. If there is not enough data to include an important predictor variable, then a proxy may be a good consideration in order to avoid omitted variable bias.
Extrapolation	Occurs when the estimated regression equation is used to estimate a mean value or to predict a new response for independent variable values outside of the range of the sample data.	Trends modeled using the sample data may not necessarily hold outside the scope of the model. For example, when fitting a regression model to predict maternal health outcomes using data composed of only white, black, and hispanic population, then it would be inappropriate or even dangerous to use this model to predict a new response for an asian person.	Extrapolation may be attempted when modeling time dependent covariates, known as forecasting, with help of strategies such as weighting the samples so as to give less importance to older values via exponential weighting or smoothing.
Data Heteroscedasticity	Occurs when residuals increase proportionally with the fitted values. It may occur more naturally in datasets that have a larger range between the largest and smallest observed values.	The existence of heteroscedasticity is a major concern in regression analysis and the analysis of variance, as it invalidates statistical tests of significance. Regression analysis using heteroscedastic data will still provide an unbiased estimate for the relationship between the predictor variable and the outcome, but standard errors and therefore inferences obtained from data analysis are suspect. This can lead to incorrect conclusions of the significance of effects between some measure and a sensitive attribute like race, gender, or other socioeconomic factors as standard errors are used in the calculation of the confidence intervals for regression coefficients.	Use a generalized least squares (GLS) model, such as the weighted least squares (WLS) estimator instead of ordinary least squares (OLS) Transform the dependent variable, such as converting it to a rate vs raw count or modeling the log(raw counts) Use a different model specification (i.e. different independent variables).
Power and Sample Size	Estimated models using small datasets (i.e. a lack of observations) can lead to poor estimation with large standard errors.	Models lack statistical power because there is insufficient data to be able to detect significant associations between the response and predictors. The consequence is very similar to that of the challenge of data heteroscedasticity, in that this may lead to incorrect conclusions of the significance of effects between some measure and a sensitive attribute (i.e. race, gender, etc).	Regarding sample size, depending on the model, study design, and population size the practical lower bound can vary and should be carefully considered given specifics of the analysis being conducted. For example, the distribution of predictor variables has shown to have an effect on statistical power and required sample sizes [Olvera Astivia et al., 2019]. In addition, it is always good practice to confirm that the estimated model satisfies the conditions (i.e. see the assumptions imposed by the regression in the tables above).

Case Study Example#

Case study is for illustrative purposes and does not represent a specific study from the literature.

Scenario: A researcher seeks to explore patient factors contributing to successful smoking cessation following a community-based smoking cessation program.

Specific Model Objective: What is the association of socio-economic factors (e.g. income, employment status, education) and successful smoking cessation?

Data Source: Participant questionnaires, Social Vulnerability Index

Analytic Method: Multivariable Logistic Regression controlling for age, sex, race/ethnicity, rural/urban status, and socio-economic factors.

Results: Higher education, higher income, and employment status all had a slightly positive, yet insignificant effect on smoking cessation (p<0.05)

Health Equity Considerations:

In order to prevent omitted variable bias, the inclusion of additional socio-economic variables, community safety (i.e. a measure of violent crime in the area) and family/social support (i.e. single-parent household and number of nearby relatives), were considered with family/social support found to be relevant.
Prior to adding, the sample sizes for the new predictor family/social support was checked and found to meet minimum standards in order to sustain the model’s statistical power.
After running a VIF test between all predictor variables, the only moderately high to high multicollinearity score (VIF=9.6) was detected between the variables: education and employment status. Employment status was removed from the model specification and the results re-analyzed at the same p<0.05 significance level as follows:
- Higher education level now has a significant positive effect on smoking cessation
- Lower income now has a significant negative effect on smoking cessation
- The variable family/social support has a slightly positive, but insignificant effect on smoking cessation
Interpreting and reporting all model results should be done with care and attention to detail including special consideration so as to avoid inadvertent extrapolation beyond the scope of the model:
- Sample sizes among different demographics in the dataset should be noted. In this case, >85% of the samples were from participants aged 20-34 years old. Therefore, associated effects should not be extrapolated to younger persons, such as teenagers for example, who may produce very different inference results.
- One should also explicitly state the main significance tests reported above were averaged over the entire dataset. Therefore no claims should be made regarding these same associations with a specific demographic (e.g. Hispanic Men) unless the necessary hypothesis testing was conducted.
The steps above demonstrate that adjusting the model to avoid potential biases and fine-tune the covariates can change the effects analysis. In this case, the updated analysis can enable efforts to reach smokers of lower socioeconomic status with proven tobacco control strategies which help reduce disparities in smoking prevalence and therefore consequential disease and death.

Considerations for Project Planning

How have you checked that your data adheres to the assumptions of your chosen method in order to avoid inaccurate or misleading conclusions?
If you perform your analysis using a different technique, how do the analysis or outcomes change and why? Is a disparity, or inconsistency, created when comparing outcomes using different methods?
Who else can you ask to evaluate the analysis to help diminish bias that can creep into the process? Is your analysis reproducible?