Summer 2023
The Essence of Data Integrity
By David Navarro
For The Record
Vol. 35 No. 3 P. 10
How Can We Work Together to Manage It?
A study of 100 respondents across the health care C-suite found that while 85% of organizations view analytics priorities as fundamental to achieving their broader strategic objectives, only 20% of them fully trust their data. Though data make up the foundation of the health care ecosystem, many in the industry struggle with the collection, normalization, analysis, and application of data to inform the most important clinical and business decisions. In my 20+ years of working in the health data aggregation and interoperability space, I have experienced firsthand the apprehension that accompanies fully trusting large data sets. Validating the integrity of data is always the first topic of conversation.
While expert opinions on the single definition of data integrity might vary, three crucial elements are nonnegotiable: accuracy, completeness, and consistency of the data being examined. Data integrity can also refer to the data’s safety regarding regulatory compliance. For example, in the United States, most health care industry professionals are familiar with HIPAA.
The main goal of HIPAA is to protect sensitive patient health information from disclosure without patients’ consent. This level of protection requires that data be maintained by a set of standards during the design, collection, and storage phases. All data elements must be classified and organized properly to realize benefits such as interoperability and insights gleaned from data analytics. The health care industry leverages information models such as the HL7 Reference Information Model and specifications such as HL7 FHIR and United States Core Data for Interoperability to ensure compatibility across organizations and systems. These models and specifications can serve as a solid basis to assess the vast amounts of data across the health care data landscape.
It’s important to consider the essence of data integrity: what it is, why it matters, what makes it challenging, and how industry professionals can work together to manage it.
Data Integrity vs Data Quality
Data with integrity can be quality data, but quality alone does not ensure data integrity. While data quality is certainly important for hospitals and health systems in decision making and logistics planning, it may lack a level of continuity and accuracy needed to be fully useful for its intended owner. Data integrity means that regardless of changes—updates, migrations, modification—the data are still intact and convey a complete, comprehensive message, whereas quality data could be missing integral components and telling just one part of the story.
In a perfect world, data integrity and data quality would coexist with every data set at every hospital, giving patients and staff the right data at the right time and at the right location. Unfortunately, this type of complete interoperability is still a way off from the current reality. Regulations such as the Information Blocking Rule and the Office of the National Coordinator for Health Information Technology (ONC) Health IT Certification Program created under the umbrella of the 21st Century Cures Act have incentivized software developers and health care organizations to adopt standardization regarding the storage and sharing of health data. Despite these successful moves, there is much to be done to ensure that the information being distributed is complete and paints a comprehensive picture of the patient journey.
What Makes Data Integrity Challenging?
The biggest challenge to having data you can trust is often the sheer volume of data that organizations must manage. We are living in the era of big data, where data can be characterized as containing greater variety and arriving in increasing volumes and with more velocity. Health care data are synonymous with big data. For example, the latest HL7 FHIR specification describes more than 145 FHIR health care resources (categories of health data) with well more than 2,000 individual data elements. The following are the three dimensions of big data, commonly referred to as the three Vs:
• Volume—Big data is about volume. There are 2.5 quintillion bytes of data created every day, with an estimated 30% generated by the health care industry.
• Velocity—While volume refers to the amount of data, velocity refers to the speed with which data are received, analyzed, and interpreted—in other words, how quickly data are being created, moved, or assessed. Velocity is important when analyzing large data sets to detect patterns and turning raw data into actionable steps in areas such as genomics and precision medicine.
• Variety—This refers to the numerous different types of data sources. Data classes make up one variety, including administrative, clinical, financial, and social needs domains. Variety also refers to data formats, such as SQL backups, delimited files, and various HL7 formats. Variety is an important consideration in your data strategy. It forces you to take an honest assessment of the data you have available and ensure they are categorized and classified properly so that meaningful insights can be derived.
Maintaining Data Integrity by Defining Data Lifecycle Management
To begin addressing some of these challenges, an organization must adopt an appropriate data lifecycle management (DLM) strategy. At a high level, the data lifecycle starts with the creation of data at their point of origin, working their way through the various business processes that rely upon them, resulting in their eventual retirement. The breakdown of these phases per ONC guidance is as follows:
• Business specification—data requirements, business terms, metadata;
• Organization—point of data creation or acquisition by the organization;
• Development—architecture and logical design;
• Implementation—physical design, initial population in data store(s);
• Deployment—rollout of physical data usage in operational environment;
• Operations—data modifications, data transformations, and integration performance monitoring and maintenance; and
• Retirement—retirement, archiving, and destruction.
This data lifecycle matters for a few reasons. First, it captures each sequence that a particular unit of data goes through from the point of origin to the end of its useful life. Knowing the location of the data at each point in the cycle is essential to data integrity. Second, it assigns a value to the data at each stage of the cycle. No person or organization wants to have unusable data in their domain. It takes space, time, and money to maintain. Anyone working in health care knows that space, time, and money are in high demand, so it’s critical to make sure the data are usable and working for you.
A comprehensive DLM strategy can help maintain the integrity and quality of data throughout their lifecycle—increasing efficiencies and improving processes within your organization. Strategies vary across organizations, but it’s important to remember that DLM is an approach, not a product of how your team manages its data.
Best Practices When Formulating DLM
Following is a list of a practices that are essential to maintaining data integrity within your organization:
• Maintaining data provenance. Provenance of data should be addressed through all phases of the DLM. Provenance is the ability to trace data to the beginning of their existence. In the health care world, this means the ability to track, store, and reference key information related to data. Storing information that indicates when the data were created, the subject of the data, the author of the data, the organization that originated the data, and other elements can help to validate its authenticity and provide a level of trust when sharing data. This is especially helpful when storing data from various systems in a single architecture.
• Record integrity via patient identity. A patient identity management approach should be developed to maintain a comprehensive representation of a patient data record across a health care organization or across multiple systems that exchange data. There is no single specification that prescribes a specific identity strategy, but the ONC does provide a framework for cross-organizational patient identity management, which can be leveraged to formulate your own customized strategy.
• Terminology. A successful DLM strategy includes preserving terminology associated with data elements. When possible, data should be recorded and preserved using widely accepted industry terminology. Implementing common ontologies such as SNOMED, RxNorm, CPT, etc allows organizations to achieve a higher level of data intelligence and meaningful insights that support valuable decision-making.
• Industry specifications and models. A successful DLM should consider industry specifications such as HL7 V2, HL7 V3, and HL7 FHIR. These are great resources when developing a logical design. Ensuring that your architecture aligns with the latest interoperability standards allows for quicker data ingestion and comes with recommendations for the binding of terminology to common data elements. The Federated Health Information Model is a great example of the federal government harmonizing its information model to align with existing and emerging standards. This model describes a vast amount health-related data used by more than 20 federal agencies.
• Operational controls. Continual operational controls should be implemented to address common data integrity issues. This includes testing data for completeness (eg, only 50% of records are associated with visits), testing data for accuracy (eg, 10% of active patients of a birthdate prior to 1900), and consistency (eg, phone numbers in the same format). Continual is the keyword when performing operational controls. Each time data are added to or modified within a data set, operational controls should be implemented to ensure data integrity.
• Retirement. The growth and evolution of organizations serve as catalysts that drive the retirement of old systems and implementation of new ones. The retirement of old systems should be approached as a cyclical pattern and not a one-time event. In most cases, not all data will be migrated to the new system, but the data should be retained for regulatory compliance. It’s important to note that the medical retention period varies across different classes of data and greatly between different states. A solid plan for the retirement of data should address the migration of key data to go-forward systems and the archival of remaining data into a legacy data platform, such as an active archive. A comprehensive legacy platform helps to control costs while maintaining the integrity of the entire medical record.
Artificial Intelligence, Machine Learning, and Reliance on Data Integrity
It’s equally important to mention the evolution of artificial intelligence (AI)/machine learning (ML) and its relation to data integrity. AI/ML is one of the most exciting developments to occur in the technology world within the past decade. AI promises to change the way decisions are made and lead to new discoveries by identifying complex data patterns that historically would have taken years. The acceleration of AI/ML has placed a new emphasis on improving and maintaining integrity within data sets.
The effective use of AI/ML relies on the fundamentals of data integrity—accuracy, completeness, and consistency. ML use cases are emerging but are still not widely in use. To successfully implement ML, data models, and algorithms must be trained utilizing mass amounts of quality data. Most organizations are able to supply ML models with an adequate amount of data. However, the lack of data quality still limits the ability of ML to reach its full potential in the health care space. Using data that lack integrity in an AI/ML model may yield inaccurate predictions. Also, AI/ML models rely on constant learning and tuning. All new data added to an environment should be governed by policies, procedures, and specifications defined as part of your DLM.
Data Integrity: A Foundation for the Future
The volume of health care data is skyrocketing. RBC Capital Markets projects that “by 2025, the compound annual growth rate of data for health care will reach 36%.” With data increasing at an astonishing rate, the most important question an organization should ask today is, “Can I trust my data?” It will be extremely rare, and I would be quite skeptical if an organization answered, “All data within my organization are flawless.” The most appropriate answer would mention a DLM strategy. A successful strategy includes standardized processes when gathering business requirements, a logical design to ensure data are classified appropriately, operational controls that allow for data refinement, and a path to data retirement, archiving, and the destruction of data. I would like to challenge readers to drive data quality within their organizations. We all agree that data-driven decisions allow us to make smarter strategic decisions. However, making decisions based on data with a low level of integrity can cost your organization time, effort, and money. Data-driven decisions are only as strong as the data on which they are based.
— David Navarro is the senior director of data science at Harmony Healthcare IT. With more than 22 years of health IT (HIT) experience in integration and health information exchange, he provides vision and leadership for the design, development, and execution of HIT initiatives. Navarro previously held technical leadership positions at both Indiana Health Information Network and Michiana Health Information Network as a solution architecture director and chief architect. He’s also been an integration support engineer and senior systems analyst at Cerner Corporation. In his professional career, he’s focused on data quality, data insights, and interoperability. Navarro implemented hundreds of interfaces between clinical and financial systems, utilizing a variety of integration platforms; custom extract, transform, and load processes; and nationally accepted standards. He’s driven interoperability initiatives throughout his career and continues to focus on the curation and accessibility of data in the health care ecosystem.