Section 1. Biostatistics#
In this lesson, users will be given insight into how different kinds of biases can impact the quality of data and biostatistical analysis, and eventually lead to health disparities. This section includes an overview of common methodologies, health equity examples, and a more detailed case study.
Health Equity and Biostatistics
|
What is Biostatistics#
Biostatistics are the statistical techniques applied to the collection, analysis, and interpretation of biological data including health, medicine, and human biology [Biostatistics, n.d.]. These methods play an important role in developing and analyzing public health initiatives and analyzing the response within communities. For example, biostatistical techniques are pivotal in creating population-level interventions, identifying barriers to care, analyzing public health trends and risks, and identifying health disparities and special risk groups.
Biostatistics falls into a category known as inferential statistics, which infers properties about an underlying population and distribution based on analysis and estimates obtained from a sample. Below are common tasks in biostatistics:
Hypothesis Tests: These tests represents group comparisons that quantify whether there are statistically significant differences in the means between two or more populations based on sampled data. Two common parametric, statistical techniques are T-Tests and Analysis of Variance (ANOVA) [Mishra et al., 2019]. The varieties of these techniques are beyond the scope of this lesson, however we give a refresher of more basic implementations. For information regarding single group summaries such as mean, median, correlation, and variance see the lesson on descriptive statistics.
Regression Analyses: Regression can be used for prediction/forecasting problems but is also useful to evaluate effects of variables on specific outcomes and to help quantify the significance of associations between variables. In choosing a methodology, ample consideration and checks should occur with regard to the assumptions inherent in the method. If assumptions are found to be violated, then model misspecification is likely and therefore estimated coefficients and errors will likely be biased and can create health inequities.
Structural Equation Modeling (SEM): SEM is used to explore the associations between variables by modeling the causal effects of directly observable variables with postulated latent and abstract constructs that cannot be directly measured. In other words, SEM is a statistical approach to test hypotheses about the causal nature of relationships between observed and theoretical phenomena. Structural Equation Modeling follows the steps: model specification, model identification, model estimation, and model assessment. If the model is a good fit for the data, then these techniques can be leveraged to better understand the effects of observed and latent constructs as applied to your use case. Unlike regression, it lacks prediction performance but is rather a confirmatory analysis of the researcher’s proposed model.
How Biostatistics are Used#
Some examples of applying biostatistics in a public health context are:
Hypothesis Testing: Methods in hypothesis testing can be used to evaluate if there are significant differences in cancer mortality rates for smokers vs non-smokers, or differences in COVID-19 hospitalizations for vaccinated vs. unvaccinated individuals.
Risk Assessment: Regression analysis can be used to identify significant risk factors for a disease or health outcome. For example, it could provide insight and/or support theoretical considerations for how two variables affect one another with respect to suicidal behavior, such as wealth or other socio-economic factors.
Causal Modeling: Structural Equation Modeling (SEM) can be used to explore the associations between socioeconomic status and smoking through latent social and psychological factors such as financial strain and psychological distress in order to target effective interventions [Martinez et al., 2018].
It should be noted that the health equity considerations in this lesson mainly focus on biostatistical techniques for cross-sectional studies vs longitudinal studies. Longitudinal analysis represents studies that take repeated measures of the same participants or variables in order to detect trends and associations over a period of time. These types of studies are a type of correlational research and are applied across a variety of different disciplines such as medicine, economics, and epidemiology. There are a number of techniques used to address longitudinal data analyses such as General Estimating Equations (GEE), repeated measures ANOVA, and Mixed Effects Regression (MER). The Mixed Effects Regression (MER) model can handle multiple challenges of longitudinal data and therefore it is the method recommended by the FDA in analyses conducted using observational studies and clinical trials [Suhr, 2006]. Within a public health context, Mixed Effects Regression can be used to model group-level trends in a study on mental health in adolescents for example.
Methods in Biostatistics#
This section serves as an overview of common types of biostatistical methods. More experienced readers who are already familiar with biostatistical methods may wish to jump to the next section on health equity considerations for more discussion regarding health equity issues when using biostatistics.
If you are already familiar with biostatistical methods, please continue to the next section. Otherwise click here.
Method | Description | Assumptions | Hypotheses |
---|---|---|---|
One Sample T-Test |
|
|
|
One-Way ANOVA |
|
|
|
Simple Linear Regression |
|
|
|
Logistic Regression |
|
|
|
Confirmatory Factor Analysis (CFA) |
|
|
|
Health Equity Considerations#
Bias may persist throughout data collection, data preparation, or when applying an analytic technique leading to results that do not accurately represent the population being studied. There will likely always be some amount of bias that can create health disparities often affecting marginalized groups the most. Therefore, proactive steps must be taken to help minimize harm from health disparities and manage more equitable outcomes. The following table provides overviews of potential biases that can impact health equity when using biostatistics.
Challenge |
Challenge Description |
Health Equity Example |
Recommended Best Practice |
---|---|---|---|
Multicollinearity |
|
|
Test presence of multicollinearity between independent variables using Variance Inflation Factors (VIF):
|
Omitted Variable Bias |
|
|
|
Extrapolation |
|
|
|
Data Heteroscedasticity |
|
The existence of heteroscedasticity is a major concern in regression analysis and the analysis of variance, as it invalidates statistical tests of significance. Regression analysis using heteroscedastic data will still provide an unbiased estimate for the relationship between the predictor variable and the outcome, but standard errors and therefore inferences obtained from data analysis are suspect. This can lead to incorrect conclusions of the significance of effects between some measure and a sensitive attribute like race, gender, or other socioeconomic factors as standard errors are used in the calculation of the confidence intervals for regression coefficients. |
|
Power and Sample Size |
Estimated models using small datasets (i.e. a lack of observations) can lead to poor estimation with large standard errors. |
Models lack statistical power because there is insufficient data to be able to detect significant associations between the response and predictors. The consequence is very similar to that of the challenge of data heteroscedasticity, in that this may lead to incorrect conclusions of the significance of effects between some measure and a sensitive attribute (i.e. race, gender, etc). |
Regarding sample size, depending on the model, study design, and population size the practical lower bound can vary and should be carefully considered given specifics of the analysis being conducted. For example, the distribution of predictor variables has shown to have an effect on statistical power and required sample sizes [Olvera Astivia et al., 2019]. In addition, it is always good practice to confirm that the estimated model satisfies the conditions (i.e. see the assumptions imposed by the regression in the tables above). |
Case Study Example#
Case study is for illustrative purposes and does not represent a specific study from the literature.
Scenario: A researcher seeks to explore patient factors contributing to successful smoking cessation following a community-based smoking cessation program.
Specific Model Objective: What is the association of socio-economic factors (e.g. income, employment status, education) and successful smoking cessation?
Data Source: Participant questionnaires, Social Vulnerability Index
Analytic Method: Multivariable Logistic Regression controlling for age, sex, race/ethnicity, rural/urban status, and socio-economic factors.
Results: Higher education, higher income, and employment status all had a slightly positive, yet insignificant effect on smoking cessation (p<0.05)
Health Equity Considerations:
In order to prevent omitted variable bias, the inclusion of additional socio-economic variables, community safety (i.e. a measure of violent crime in the area) and family/social support (i.e. single-parent household and number of nearby relatives), were considered with family/social support found to be relevant.
Prior to adding, the sample sizes for the new predictor family/social support was checked and found to meet minimum standards in order to sustain the model’s statistical power.
After running a VIF test between all predictor variables, the only moderately high to high multicollinearity score (VIF=9.6) was detected between the variables: education and employment status. Employment status was removed from the model specification and the results re-analyzed at the same p<0.05 significance level as follows:
Higher education level now has a significant positive effect on smoking cessation
Lower income now has a significant negative effect on smoking cessation
The variable family/social support has a slightly positive, but insignificant effect on smoking cessation
Interpreting and reporting all model results should be done with care and attention to detail including special consideration so as to avoid inadvertent extrapolation beyond the scope of the model:
Sample sizes among different demographics in the dataset should be noted. In this case, >85% of the samples were from participants aged 20-34 years old. Therefore, associated effects should not be extrapolated to younger persons, such as teenagers for example, who may produce very different inference results.
One should also explicitly state the main significance tests reported above were averaged over the entire dataset. Therefore no claims should be made regarding these same associations with a specific demographic (e.g. Hispanic Men) unless the necessary hypothesis testing was conducted.
The steps above demonstrate that adjusting the model to avoid potential biases and fine-tune the covariates can change the effects analysis. In this case, the updated analysis can enable efforts to reach smokers of lower socioeconomic status with proven tobacco control strategies which help reduce disparities in smoking prevalence and therefore consequential disease and death.
Considerations for Project Planning
|