Section 3. Data Validation#
Data validation is an integral step in the data preparation workflow because it is the process of ensuring that the data being collected is accurate and complete.
Health Equity and Data Validation
|
What is Data Validation?#
Verifying the data’s accuracy, validity, and adherence to the desired structure or format is known as data validation. Additionally, it entails making sure the data is error-free and in accordance with the organization’s standards. Data validation is crucial since it makes sure the information is trustworthy and suitable for analysis and decision-making [Breck et al., 2019].
- 1. Establish the criteria for the data to be regarded as legitimate.
- 2. Evaluate the data's validity by comparing it against the predetermined standards.
- 3. Provide the data validation results to those involved in the process.
- 4. Modify the data as necessary to correct any problems.
Data Validation and Public Health#
The accuracy and population representation of public health research can be improved via data validation. This is crucial for ensuring that the research appropriately reflects the health needs of many groups and communities, including those from marginalized backgrounds, which promotes health equality. This can assist in guiding the development of public health policies that are better suited to meeting the needs of everyone, regardless of background.
Data Validation Methods and Testing#
There are countless ways of incorporating data validation checks automatically within your analytic work. Unit testing and integration testing are key examples of automating these checks. Unit Testing is a type of testing where individual units or components of code or functions are tested to ensure that each individual unit performs as intended. Integration tests verify that the interactions between the individual units are interacting as intended.
The table below covers several commonly used data validation methods that can also be considered within your validation tests [Gao et al., 2016].
If you are already familiar with methods for data validation, please continue to the next section
Method | Description |
---|---|
Cross-validation | Cross-validation is a technique used to assess the accuracy of a predictive model by splitting the dataset into two parts, training the model on one part and testing it on the other. |
Statistical Analysis | Statistical analysis is the process of using mathematical models and techniques to analyze data. It can be used to identify patterns, trends, and relationships in data. |
Data Visualization | Data visualization is the process of creating visual representations of data. It can be used to identify patterns and trends in data, as well as to communicate complex information in an easy-to-understand format. |
Range checks | Range checks involve verifying that a value falls within a specified range. |
Type checks | Type checks involve verifying that a value is of the correct data type. |
Format checks | Format checks involve verifying that a value is in the correct format. |
Checksum | A checksum verifies that the data (typically from a download) is exactly byte-to-byte what the server says it is. |
Health Equity Considerations#
When data is passed through multiple hands, sources, cleaning, and analytic methods, it is important that your data has not made any unforeseen or unexpected changes [Breck et al., 2019]. Data validation helps to make sure your research and analysis remain on the right track. Below are some points to consider when developing your own data validation plans.
Data validation calls for the authentication of data to ensure that it is accurate, full, and free of processing or input errors. When working with a lot of data, this can be challenging because it can be time-consuming to manually verify each piece of data for accuracy. In such cases, automating data validation processes can significantly expedite the verification process and improve efficiency, especially when dealing with large datasets.
Ensuring that data is consistent across all data sources can be challenging, especially when the data you are using is coming from multiple sources.
Dealing with large amounts of data requires optimized ways to verify data efficacy. Data efficacy pertains to ensuring that the data is accurate, reliable, and fit for its intended purpose.
Considerations for Project Planning
|