Lesson 2. Data Transformation

Lesson 2. Data Transformation#

Data transformation processes attempt to add cohesion to datasets by ensuring datasets are consistent across all records and fields. This lesson covers various data transformation methods and how data transformation impacts on public health research.

Health Equity and Data Transformation

Data transformation allows researchers to more easily compare their results to other studies, join data from different sources, identify patterns, and make better decisions based on their data.

What is Data Transformation?#

Data transformation is the process of converting data from one scale into another. This can be done in a variety of ways. For the purpose of this lesson, we will be covering the higher level categories of feature scaling and aggregation [Brownlee, 2020].

Feature scaling: Feature scaling is a method used to normalize the range of independent variables or features of data. Feature scaling is performed to bring all features to the same level of magnitudes so that no one feature dominates over the others. It is also used to reduce the time and computational cost of certain algorithms used in machine learning or visualization.

Normalization: Scaling that uses minimum and maximum values of features as the scaling metric. Normalization is typically when you are dealing with features of different scales. impacted by outliers
Standardization: Scaling that uses mean and standard deviation as the scaling metric. less impacted by outliers

Data aggregation: This process involves combining similar data points into a single data point to reduce the complexity of the data. Examples of data aggregation include binning, clustering, and pivoting.

Data Transformation and Public Health#

Data transformation is a crucial step in the research process because it enables researchers to compare data from various populations, sources, and time periods. This makes ensuring that information is presented consistently and meaningfully, making comparison and analysis simple. Data transformations can improve data integrity, eliminate data redundancy, and lessen the need for intricate joins and queries when accessing or working with data. As a result, these procedures are essential for guaranteeing the correctness of research findings and assisting in the detection of health outcome discrepancies. Researchers can gain a better understanding of how certain populations are impacted by health disparities, such as those based on socioeconomic position, ethnicity, and gender, by using data transformation. This ultimately helps to improve health equity by allowing researchers to identify and address the root causes of health disparities. Data scaling and aggregation can also help to ensure that data is collected and analyzed in a consistent manner, which is essential for conducting reliable research and for making valid conclusions about the health of different populations [Ferketich and Verran, 1994].

Types of Data Transformation#

The table below lists several common data transformation methods.

If you are already familiar with data transformation, please continue to the next section. Otherwise click here.

Method	Description
Logarithmic Transformation	This type of data standardization method involves transforming data using logarithmic functions. This can be used to reduce the variance of a data set. Logarithmic transformations are often used to help better visualize data of large ranges
Interquartile Range (IQR)	This type of data standardization method involves mapping values to a range between the first and third quartiles. This can help reduce the effect of outliers on a data set.
Data Normalization (Min-Max Scaling)	This type of data scaling method involves mapping values from a given range to another range, usually between 0 and 1. For example, normalizing age from 0 to 100 to a range from 0 to 1. This process is used to scale certain attributes of the data that are measured on different scales.
Data Standardization	This type of data standardization method involves converting values to have a mean of 0 and a standard deviation of 1. This can be accomplished by subtracting the mean from each value and then dividing by the standard deviation. Z-scores are used in data standardization to identify where each value is expressed as a number of standard deviations away from the mean.
Data Aggregation	This process involves combining similar data points into a single data point to reduce the complexity of the data. Examples of data aggregation include binning, clustering, and pivoting.

Health Equity Considerations#

How can data transformation processes aid in downstream analysis?#

By providing a consistent collection of data items from many data sources that can be compared and analyzed, data standardization aids downstream analytical processes. As a result of the uniform formatting of each data collection, errors are reduced and it is simpler to spot linkages and trends in the data. By standardizing data, it is also made simpler to combine data from many sources, enabling more potent analysis [Feldman et al., 2018].

Data aggregation facilitates downstream analytical procedures by offering a consolidated view of the data that may be used for additional investigation. The amount of data that needs to be evaluated can be decreased with the use of aggregation, leading to quicker and more precise conclusions. Trends, correlations, and inferences can all be made using aggregated data. A more thorough perspective of the data can be obtained through aggregation, enabling a better comprehension of its underlying dynamics [Dhruva et al., 2020].

What are common challenges when it comes to data transformation?#

Data Aggregation#

Data aggregation can introduce bias because it can lead to a distortion of the original data. For example, if data is aggregated by gender, it may lead to an oversimplified view of the population. This could lead to conclusions that are not representative of the true data. Additionally, data aggregation can lead to information being lost. For example, aggregating a data set by age group can lead to important individual differences being lost. When performing data aggregation processes, it is important to thoroughly understand your data and the research question at hand before reducing the granularity of the data to a higher aggregation [Feldman et al., 2018].

Feature scaling#

Data quality issues such as missing values, incorrect values, and inconsistent formats can make data transformation difficult. Large volumes of data can also slow down the transformation process and make it difficult to manage. Before performing feature scaling, it is best to conduct cleaning processes such as correcting errors, format variations, inconsistencies, duplications and missing data to help optimize the process.

Case Study Example#

Case study is for illustrative purposes and does not represent a specific study from the literature.

Scenario: A researcher wants to perform a data-driven analysis to improve access to health care services in rural areas.

Specific Model Objective: Develop an intervention strategy to improve access to health care services including 1) increasing the number of health care providers in rural areas, 2) providing transportation to health care facilities, and 3) expanding telehealth services.

Data Source: Researchers collected data from multiple sources, including surveys, interviews, and focus groups to better understand the needs of rural populations and the barriers they face in accessing health care services.

Analytic Method: The data was transformed into actionable insights that were used to develop strategies to improve access to health care services in rural areas.

Results: The strategies implemented as a result of this study have helped to reduce disparities in access to health care services in rural areas and improve health equity.

Health Equity Considerations:

In order to merge disparate datasets, such as this study uses, consider the following steps:
- Eliminate duplicate data and redundant information
- Use standard naming conventions
- Use a consistent data format for each field
After merging the data, and prior to modeling and analysis, it will be important to thoroughly understand and characterize the data. Characterization helps provide context so that valuable information is not lost or misrepresented when aggregating or stratifying by a specific attribute such as race or gender, for example.
It will also be important to identify and perform outlier removal where necessary so that results are not skewed or biased towrad extreme values. Similarly, imputation should be considered where applicable to help reduce bias of survey estimates.
Feature scaling helps in combining multiple datasets by normalizing the data so that each feature is on the same scale. This makes it easier to compare and combine datasets, as the data points can be compared directly and the effect of one feature in the data does not overpower the other. This allows for better and more accurate analysis of the data, as well as more efficient machine learning algorithms.
Note, another iteration of data cleaning may be necessary after data transformation as it may unify even more records.

Considerations for Project Planning

Where could data transformation help in joining disparate datasets in your analysis?
Will applying data transformation on your cleaned data potentially yield any bias?
How are you mitigating a potentially oversimplified or inaccurate view of the population when using data aggregation methods?