Section 3. Social Media Data#
Social media data refers to the information that comes from users on social media sites, such as Facebook, Twitter, etc. This can include posts, comments, likes, shares, and other types of activities. These data are often used to understand patterns, trends, behaviors, and preferences of the chosen audience.
Health Equity and Social Media Data
|
Common Types of Social Media Data#
The table below describes three basic types of social media data: user-generated data, metadata, and network data.
Types | Common Usage | Example |
---|---|---|
User-generated data | This is the core data that is most commonly associated with social media data. It includes the content that users share and post. It includes text data, images, videos, and more. | ![]() |
Metadata | Data that provides additional user information, such as device type, IP address, location, and demographic information. It can also include platform-specific information, such as number of followers. Additionally, individual social media posts can have metadata, such as the date it was posted, and the number of views, etc. | ![]() |
Network data | Data that captures relationships or the connectedness between users, organizations, businesses, and other entities on social media. Social media relationships like friends, followers, and fans are common examples. These can also include interactions, such as the network of users are reacting to a particular social media post. Additionally, it can include things like members of established social media groups, e.g. England soccer fans. | ![]() |
[Abebe et al., 2020, Lentzen et al., 2022, Rajput et al., 2020]
How Social Media Data is Used#
Social Media data is valuable for public and population health research because it can provide real-time insights into human patterns and behaviors. The following are some examples of how social media data is used by public health and population health researchers:
Syndromic Surveillance: Monitoring the Spread of Infectious Diseases Social media data can be used for evaluating identifying trends in certain mentions or symptoms (i.e. fever, chills, coughing, etc.).
Disease Forecasting Disease Outbreak and Incidence Social media network analysis has been used to understand how diseases are spreading, such as tracking mentions of influenza over time based on the number of messages, location, and social connectedness.
Understanding Attitudes and Sharing Information Around Public Health Topics Social Media has been used to not only inform the public about key public health topics, but also understand public attitudes towards public health topics of interest.
[Daughton et al., 2020, Fung et al., 2015, Kanchan and Gaidhane, 2023, Samaras et al., 2020]
Health Equity Considerations#
Social media data has extensive potential to aid in disease surveillance and epidemiological research. However, there are several health equity concerns and considerations that should be taken into account when working with any social media data.
Challenge |
Challenge Description |
Heath Equity Example |
Recommended Best Practice |
---|---|---|---|
Ethical Concerns of social media data research |
The primary consideration here is around using social media data for a use other than its original intent. Researchers should evaluate any legal agreements to ensure that the research in question is allowed with the third-party agreements and terms of use. Beyond legal constraints, the research hypothesis should be evaluated in a social media context to see if the question can truly be answered using social media sources. |
EK is a researcher studying the effects of a new strain of influenza. They would like to run a social media analysis to see how users from different geographic areas are talking about the new strain in order to perform health planning and vaccine distribution. However, EK must consider that the motivations for seeking information and information sharing on social media are very different from the objective of this study. Moreover, the types of information shared may vary with the social media platform and there may be latent associations between socioeconomic/demographic factors and the information that different groups share or seek out. Basing resource planning or public health interventions without these considerations can lead to health disparities. |
Make sure any social media data used is cleaned and validated with some human-in-the-loop process. Several tools and algorithms have also been developed to help identify bots and misinformation. Additional data sources may also be helpful to validate and add contextual depth to social media datasets. These may be desired steps before using social media data in research. |
Sampling Bias or concerns around representativeness of social media data |
The challenges of representativeness and social media data are two-fold. Primarily, the demographics of the social media do not represent the overall population, as it generally tends to skew toward younger populations. Secondarily, different platforms have different demographics, i.e. certain platforms may have more males versus females, or one race over another. |
JN is a data scientist studying trends in drinking attitudes of adults over 40. They have several challenges in finding enough social media posts from accounts that meet that criteria. Using additional data sources, JN is able to find more accounts that match a representative sample of adults over 40. JN also must consider the demographic distribution of the sample, without which public health policies and outcomes as a result of the study could end up biased against an underrepresented population. |
Same as above. It may be a challenge to find a social media site that shares demographics, but researchers should aim to find as representative a sample as possible. |
Inaccuracy of social media data |
Social media data has several issues when it comes to inaccurate data. Some sources of this inaccurate information are malevolent, such as bots aimed at spreading misinformation. Other sources include spam accounts, advertisements, duplicate user accounts, and business (non-individual) accounts, which can impede in public health surveillance research, as these accounts can be difficult to distinguish from “real” user accounts. |
SJ is a researcher studying opinions about masks and prevalence of an infectious/communicable disease in various geographic areas. After downloading data from a popular social media platform to create semantic features for mask attitudes, she finds that over half of the information is a copied message from a popular influencer, which actually contains a made-up statistic about masks. She found a positive correlation between disease prevalence in areas with negative mask attitudes and particularly for lower income populations. Health misinformation in this case could have further exacerabted existing health disparities in those areas as evidenced by more vulnerable communities being burdened by the disease. |
There are several tools and algorithms developed to detect misinformation and other inaccuracies. Additionally, you can check inconsistencies in the other messages an account leaves, or look for corroborating sources. Tackling inaccuracy and misinformation is generally one of the biggest concerns in working with social media data and it is unclear how exposure to health misinformation intersects with existing health disparities. |
Functional bias |
Any bias that derives from the social media platform itself which may skew the data in one direction or another. For example, a platform may be comprised of users from a particular political group, interest group, or demographic. The unique features of any particular social media platform, may then, influence directly or indirectly, how users interact with or create data on that particular platform. |
GG is a researcher studying the impact of a popular smoking cessation campaign on the U.S. population stratified by age. They have access to a download of Reddit posts related to the campaign. Unfortunately, the sample is biased by the fact that most Reddit users are overwhelmingly adults under 50. |
Consider the demographics of the user base of the social media platform, and how that should be accounted for in your study. Gathering a representative sample is challenging but crucial in reducing these types of biases. Consider including additional sources, or expanding the original search to ensure functional biases are limited. |
Bias introduced from social media data collection methods |
Social media sites may limit which data is available for download, introducing further issues with gathering representative samples. How data is made available from social media sites could be random, or opaque in methodology. Additionally, even if gathering a large sample were possible, it may take significant computing and/or storage resources which should be taken into special consideration. |
KL is a data engineer downloading social media posts for a researcher studying Lyme disease. They download data for the current year. Unfortunately, there were key differences in Lyme disease cases over the last three years, which biased the study towards the trends of the current year. |
The biases that exist in the data will depend on how the data is gathered. If you are looking at a temporal trend, try to gather multiple years if possible. If you are concerned about a particular location, look at surrounding geographic areas and see how they compare. In general, expanding your search window to include different types of data based on different filters, can help reduce these kinds of biases |
Mitigating Bias in Social Media Data#
The primary method to reduce bias in social media data research, is to ensure the sample is representative of the reference population, which includes a diverse user population, or at least aligns with the group you are trying to study. This becomes a more difficult challenge when needing to perform analysis
Use as diverse and representative sample as possible. Some methods have been established to gather representative data samples from social media sources, however, it remains a complex problem that varies from platform to platform [Hino and Fahey, 2019].
Any Natural Language Processing techniques, such as using trained language models to perform a semantic analysis, should use similar linguistic context and tone to the source.
Be aware of legal requirements, and ensure privacy is maintained keeping in mind HIPAA and other laws.
Case Study Example#
Case study is for illustrative purposes and does not represent a specific study from the literature.
Scenario: JE is studying the burden of influenza in the state of Louisiana. They wish to understand how self-reported flu-like symptoms on social media relate to reported cases to the state health department.
Specific Model Objective: Gather social media data sources and state surveillance data and compare influenza rates over the past 5 years.
Data Source: Yearly reported influenza counts and Twitter data on flu-related keywords.
Analytic Method: Use the Twitter influenza surveillance system developed by Lamb et al. This system uses a logistic regression classifier to identify influenza symptoms in tweets. Additionally, JE uses geocoding tools to ensure all tweets are from their state. JE plans on using a linear model to compare tweets with reported cases.
Results: Twitter showed a strong association with the occurrence of flu outbreaks and state-reported cases.
Health Equity Considerations:
Consider that with the emergence of other flu-like diseases, Twitter may not have the same predictive power as it did previously.
Consider the demographics of Twitter users in the state. People living in certain regions may be less likely to have internet access and/or less participation on social media, and the accounts that exist may be less representative of the overall population.
After applying geocoding, consider removing additional metadata and potentially identifiable information from identified tweets.
[Lamb et al., 2013, Paul et al., 2014]
Considerations for Project Planning
|