From Chaos to Clarity: Implementing OMOP for DHA Data

dha data harmonization omop

Why the Defense Health Agency Turned to OMOP to Unite 60 Billion Records

DHA data harmonization OMOP is the process of changing the Defense Health Agency’s massive, fragmented military health records—spanning over 35 disparate sources and 1 petabyte of data—into the standardized Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) to enable systematic analysis, AI/ML adoption, and real-time evidence generation for 9.6 million beneficiaries worldwide.

Quick Answer: How DHA Harmonizes Data with OMOP

  1. Assess Source Systems – Profile 35+ legacy data sources (claims, EHR, trauma registries) using OHDSI tools like WhiteRabbit to understand table structures and data density.
  2. Map to OMOP CDM – Transform clinical codes (ICD-9/10, SNOMED CT, RxNorm) into standardized OMOP vocabularies using Rabbit-In-A-Hat for structural mapping and Usagi for semantic mapping.
  3. Execute ETL Processes – Build robust Extract-Transform-Load pipelines to convert source data into OMOP CDM tables (Person, ConditionOccurrence, DrugExposure, etc.) while maintaining referential integrity.
  4. Validate Data Quality – Run Data Quality Dashboard (DQD) checks across 1,500+ rules to ensure conformance, completeness, and plausibility across the entire dataset.
  5. Enable Analytics – Deploy ATLAS for cohort definition and HADES R packages for federated queries, population-level estimation, and AI-ready datasets.

The U.S. military’s healthcare system was historically drowning in data chaos. Analysts spent up to 80% of their time just finding, accessing, and merging records from 53 inpatient platforms, 140 outpatient systems, and hundreds of business intelligence tools. Meanwhile, the MHS Information Platform (MIP) processed 60 billion records annually—yet military readiness, patient safety, and research innovation were bottlenecked by incompatible formats, missing data, and fragmented systems dating back to the 1990s. This fragmentation meant that a service member’s health history might be split across the Armed Forces Health Longitudinal Technology Application (AHLTA), the Composite Health Care System (CHCS), and various theater-specific trauma registries, making a longitudinal view of care nearly impossible.

The solution? OMOP CDM—the same open standard used by over 810 million patient records across 74 countries through the OHDSI community. By harmonizing DHA’s clinical data, claims data, and trauma registries into a single, standardized model, the agency opens up real-time analytics, AI/ML capabilities, and seamless interoperability with the Department of Veterans Affairs (VA) for lifetime health records. This transition is not merely a technical migration; it is a strategic imperative to ensure that the “Golden Hour” of trauma care is supported by data-driven insights derived from decades of combat medical experience.

But the journey from chaos to clarity wasn’t simple. It required mapping legacy vocabularies, imputing missing dates, creating custom concepts for rare conditions, and validating data quality across decades of heterogeneous sources—all while maintaining HIPAA compliance and operational continuity for 164,000 daily patient encounters. The scale of this effort involves migrating over 30 years of health records, representing a petabyte of data that must be made “AI-ready” to support the next generation of military medical intelligence.

As Maria Chatzou Dunford, CEO of Lifebit, I’ve spent 15 years building federated data platforms for public sector institutions and pharma organizations tackling challenges identical to the DHA’s—changing fragmented, siloed health data into AI-ready, OMOP-compliant datasets at scale. Our work powering DHA data harmonization OMOP efforts demonstrates that even the world’s largest combat trauma registries can transition from 100% manual abstraction to automated, real-time evidence generation when the right architecture and tools are in place.

Infographic showing the DHA data modernization journey: from 35+ disparate legacy sources (claims, EHR, trauma registries) through OMOP CDM transformation (mapping, ETL, validation) to unified analytics outputs (AI/ML, federated queries, real-time evidence) with key milestones including 1 petabyte migrated, 60 billion annual records processed, and 80% reduction in analyst data wrangling time - dha data harmonization omop infographic infographic-line-5-steps-dark

Dha data harmonization omop terms to remember:

Why the DHA Modernized Its 60-Billion-Record Information Platform

The Defense Health Agency (DHA) manages one of the most complex healthcare ecosystems on the planet. Serving 9.6 million beneficiaries—including active-duty service members, retirees, and their families—requires an infrastructure that can handle immense scale and geographic distribution. Historically, however, the DHA grappled with a “hodgepodge of older, distributed systems” from the 1990s. This fragmentation meant that critical health information was trapped in silos, making it nearly impossible to gain a comprehensive view of patient health or military medical readiness. For instance, a soldier injured in a theater of operations might have their initial trauma data recorded in a local registry that didn’t communicate with their permanent health record back in the United States.

To solve this, we’ve seen the DHA transition toward the MHS Information Platform (MIP), the largest secondary repository of health data in the Department of Defense (DoD). The scale of this modernization is staggering:

  • Consolidation of 35+ disparate sources into a unified platform, including legacy EHRs like AHLTA and CHCS, as well as newer systems like MHS GENESIS.
  • Migration of over 1 petabyte of data, representing 30 years of health records across inpatient, outpatient, and pharmacy domains.
  • Processing of 60 billion records annually through advanced ETL pipelines that must handle high-velocity data from thousands of global points of care.

Previously, analysts were forced to spend 80% of their time on “data wrangling”—the tedious process of finding, cleaning, and merging data from different systems that used different codes for the same conditions. This left only 20% for actual analysis. This inefficiency wasn’t just a matter of cost; it was a matter of patient safety and operational effectiveness. In a military context, delayed data means delayed improvements to clinical practice guidelines (CPGs) that save lives on the battlefield. By modernizing the platform, the DHA aims to enable “augmented leadership decision-making,” where AI and real-time data provide the insights needed to save lives on and off the battlefield.

According to Scientific research on data harmonization, achieving syntactic and semantic interoperability is the only way to turn these heterogeneous datasets into a reliable asset for large-scale research. The DHA recognized that simply moving data to the cloud wasn’t enough; they needed a universal language. This led to the adoption of the OMOP Common Data Model, which provides the necessary structure to harmonize these diverse data streams into a single, queryable resource.

DHA Data Harmonization OMOP: A Step-by-Step Implementation Guide

Standardizing such a massive volume of data requires a robust framework. The DHA chose the OMOP Common Data Model (CDM), maintained by the OHDSI community. The OMOP CDM is designed to transform disparate observational databases into a common format and representation (terminologies, vocabularies, and coding schemes). This allows for the application of standardized analytics across datasets that were originally captured in different formats.

OMOP Architecture Overview - dha data harmonization omop

The strategic choice of OMOP was driven by its ability to accommodate both administrative claims and Electronic Health Record (EHR) data. This allows the DHA to generate evidence from a wide variety of sources using a single set of standardized analytics tools. Furthermore, the OMOP model is patient-centric, meaning all data points—from drug exposures to procedure occurrences—are linked to a unique individual, enabling longitudinal studies that span a service member’s entire career.

For those looking to replicate this success, we recommend following a structured approach to dha data harmonization omop. While OHDSI suggests a 4-step process, recent research conceptualizes a more detailed 9-step iterative sequence that ensures high-fidelity transformation:

  1. Dataset Specification: Defining the scope of source data (e.g., specific patient cohorts, date ranges, or clinical domains like oncology or trauma).
  2. Data Profiling: Understanding the structure and unique values of the legacy data. This involves identifying “junk” values, null fields, and unexpected data formats.
  3. Vocabulary Identification: Cataloging all terminologies used in the source, such as ICD-9-CM, ICD-10-PCS, CPT-4, and local laboratory codes.
  4. Coverage Analysis: Checking how well OMOP vocabularies cover source codes. This step identifies where custom concept mappings might be required for military-specific injuries.
  5. Semantic Mapping: Mapping local codes to standard concepts (e.g., mapping a local code for “blast injury” to the appropriate SNOMED CT concept).
  6. Structural Mapping: Defining how source tables (which may be highly normalized or flat files) fit into OMOP tables like Condition_Occurrence or Measurement.
  7. ETL Implementation: Running the actual change pipelines. This often involves complex SQL or Spark jobs to transform billions of rows of data.
  8. Qualitative DQ Analysis: Checking for plausibility and conformity. For example, ensuring that a patient’s birth date precedes their first medical encounter.
  9. Quantitative DQ Analysis: Comparing record counts between source and target to ensure no data was lost during the transformation process.

By using the Common Data Model (CDM), organizations can ensure that their data is not just “standardized” but truly interoperable for global research.

Mapping Disparate Sources for DHA Data Harmonization OMOP

The most challenging part of any dha data harmonization omop project is the mapping of legacy vocabularies. The DHA handles data from decades of different coding versions—ICD-9, ICD-10, and custom local codes used in specific military hospitals. To bridge this gap, we use the OHDSI toolset for Extract-Transform-Load (ETL) preparation.

  • WhiteRabbit: This tool scans the source data to identify the distribution of values and the structure of tables. Crucially, it helps prevent the display of personally identifiable information (PII) during the design phase by providing aggregate statistics rather than raw data.
  • Usagi: This tool assists in the manual mapping of source codes to OMOP standard concepts. It uses fuzzy matching logic to suggest the most likely SNOMED or RxNorm codes, which clinical experts then review and finalize.
  • Rabbit-In-A-Hat: Using the scan from WhiteRabbit, this tool provides a graphical interface where our teams can collaboratively map source columns to OMOP CDM tables (like the Person, Observation, or Drug_Exposure tables). It generates the documentation that serves as the blueprint for the ETL developers.

According to the WhiteRabbit documentation, these tools are essential for creating the “blueprint” of the ETL process before any code is written. For the DHA, this meant mapping variations of a single concept—like “obstructive sleep apnea”—into a single, unique OMOP Concept ID, regardless of whether it was recorded in a 1995 legacy system or a 2024 modern EHR.

Validating Data Quality for DHA Data Harmonization OMOP

Once the data is transformed, how do we know it’s accurate? In a system processing 60 billion records, manual checks are impossible. The DHA leverages the Data Quality Dashboard (DQD), an open-source tool that performs over 1,500 automated checks across the Kahn framework (Plausibility, Conformance, and Completeness). These checks ensure that the data adheres to the OMOP specification and that the clinical values make sense (e.g., a body temperature of 105 degrees Celsius would be flagged as an error).

Complementing the DQD is Achilles, which provides characterization and visualization of the CDM database. Achilles allows us to see the “shape” of the data—identifying outliers or unexpected drops in record counts that might indicate an error in the ETL pipeline. For example, if the number of “Condition” records drops by 50% in a specific year, Achilles makes this immediately visible, prompting an investigation into the source data for that period.

Using the Data Quality Dashboard tool, the DHA ensures that the data used for clinical decisions or AI modeling is of the highest integrity. This rigorous validation is what builds trust among clinicians and researchers who rely on the platform to develop new treatments for traumatic brain injury (TBI) or post-traumatic stress disorder (PTSD).

From Manual Abstraction to SIMON: Modernizing Trauma Registries

One of the most impactful applications of dha data harmonization omop is within the Joint Trauma System (JTS). The DoD Trauma Registry (DoDTR) is the world’s most comprehensive combat trauma registry, containing over 93,000 cases from conflicts in Iraq, Afghanistan, and other global theaters. Historically, however, this registry relied on 100% manual abstraction—a process where humans manually review medical records and enter data into the registry. This process is slow, labor-intensive, and can take years to influence policy or clinical practice.

The DHA is now moving toward the SIMON platform (System for Injury Monitoring and Outcomes Nexus). Expected to go into production in late 2025, SIMON fundamentally changes the backend data model to be consistent with OMOP CDM. This shift allows the JTS to move away from manual data entry and toward automated data ingestion from the MHS Information Platform.

The benefits of this transition include:

  • Automation: Using AI/ML and Natural Language Processing (NLP) to assist in medical record abstraction. NLP can scan physician notes to identify injury patterns and outcomes that were previously hidden in unstructured text.
  • Real-Time Evidence: Moving from “years” to “days” for data to influence clinical practice guidelines. This means that a new life-saving technique discovered in the field can be validated and disseminated across the entire military health system in near real-time.
  • Interoperability: Facilitating data sharing between military and civilian trauma centers. By using the OMOP standard, the DoD can compare combat trauma outcomes with civilian trauma data, leading to better care for both populations.
  • Longitudinal Tracking: SIMON will allow researchers to track the long-term outcomes of trauma patients as they transition from active duty to veteran status, providing a complete picture of the “continuum of care.”

As highlighted in Standardizing registry data to OMOP, changing registry data into a CDM allows for more robust conclusions in rare disease and trauma research, where data is often limited. For the DHA, this means the ability to conduct high-powered studies on low-frequency, high-impact injuries like blast-related ocular trauma or complex limb salvage procedures.

Powering AI/ML and Advanced Analytics via Standardized Data

The ultimate goal of dha data harmonization omop is to move beyond retrospective reporting and into the era of predictive intelligence. Standardized data is the “fuel” for AI/ML. Without it, models are brittle and cannot be generalized across different hospital sites. By harmonizing data into the OMOP CDM, the DHA creates a “feature-ready” environment where data scientists can build models that predict patient outcomes with high accuracy.

With data in the OMOP format, the DHA can leverage the full suite of OHDSI analytics tools:

  • ATLAS: A web-based tool for designing and executing population-level characterization and cohort building. Researchers can define complex cohorts (e.g., “patients with TBI who received a specific neuroprotective agent within 4 hours of injury”) without writing a single line of SQL.
  • HADES: A library of R packages for advanced analytics, including patient-level prediction and population-level effect estimation. HADES allows for the application of sophisticated statistical methods to identify which treatments are most effective for specific patient subgroups.

For example, using the ATLAS analytics tool, researchers can quickly identify a cohort of patients with a specific injury pattern and analyze the effectiveness of different treatment protocols. This supports the DHA’s goal of “military readiness,” ensuring that medical personnel are trained on the most effective, data-driven interventions before they ever deploy.

At Lifebit, we integrate these OHDSI tools into our Trusted Research Environment (TRE) and Trusted Data Lakehouse (TDL). This allows researchers to run complex AI models directly where the data resides, ensuring security and compliance while delivering real-time insights through our R.E.A.L. (Real-time Evidence & Analytics Layer). This federated approach is particularly important for the DHA, as it allows for collaboration with academic and industry partners without the need to move sensitive military health data outside of secure DoD environments. By bringing the analysis to the data, we maintain the highest levels of security while accelerating the pace of medical discovery.

Frequently Asked Questions about DHA Data Harmonization

What is the role of the OMOP CDM in military health?

The OMOP CDM acts as the “universal translator” for military health data. It allows the DHA to combine data from hundreds of legacy systems into a single format, enabling the use of standardized analytics tools and facilitating collaboration with the VA and international research partners. It ensures that a “diagnosis” in one system is treated the same as a “diagnosis” in another, regardless of the underlying source code.

How does data harmonization improve patient safety?

Harmonization ensures that a patient’s medical history is consistent and accessible, regardless of where they receive care. By eliminating data silos, clinicians have a more complete picture of patient health, including allergies, past procedures, and medication history. This reduces the risk of medical errors, prevents adverse drug interactions, and enables proactive, personalized care tailored to the individual service member’s needs.

What challenges did the DHA face during migration?

The primary challenges included handling 30 years of legacy data with varying quality, mapping complex combat trauma registries that used non-standard terminology, and managing the cultural shift toward a data-driven model of care. Additionally, ensuring data privacy and security in a petabyte-scale environment required the implementation of robust governance frameworks and federated data architectures.

How does OMOP differ from FHIR in the DHA context?

While FHIR (Fast Healthcare Interoperability Resources) is primarily designed for the exchange of individual patient records in real-time clinical settings, OMOP is designed for large-scale observational research and analytics. The DHA uses both: FHIR for operational interoperability between EHRs, and OMOP for the secondary use of data in the MHS Information Platform to drive research and population health management.

Can OMOP data be used for AI/ML training?

Yes, OMOP-standardized data is ideal for AI/ML because it provides a consistent structure and vocabulary. This allows models to be trained on large, diverse datasets and then validated across different sites within the military health system, ensuring that the models are robust and generalizable.

Conclusion: The Future of Military Health Intelligence

The transition from data chaos to clarity at the Defense Health Agency is more than just an IT upgrade; it is a fundamental shift in how military medicine is practiced. By embracing dha data harmonization omop, the agency has built a foundation for a unified, secure, and intelligent health ecosystem. This foundation supports everything from individual patient care to global health surveillance and combat casualty research.

This journey is supported by the global OHDSI community, whose tools and collaborative spirit have made this scale of standardization possible. The ability to leverage a global network of researchers and open-source tools ensures that the DHA remains at the forefront of medical innovation. At Lifebit, we are proud to support these efforts by providing the federated AI platforms and governance frameworks necessary to open up the full potential of biomedical and multi-omic data.

As we look to the future, the integration of AI/ML, real-time trauma registry monitoring, and seamless cross-agency collaboration will continue to improve military readiness and save the lives of those who serve. The blueprint is clear: through standardization and secure collaboration, we can turn 60 billion records into a powerful engine for health innovation. The ultimate success of this effort will be measured not in petabytes of data migrated, but in the lives saved on the battlefield and the improved health outcomes of our nation’s veterans.


Federate everything. Move nothing. Discover more.


United Kingdom

3rd Floor Suite, 207 Regent Street, London, England, W1B 3HH United Kingdom

USA
228 East 45th Street Suite 9E, New York, NY United States

© 2025 Lifebit Biotech Inc. DBA Lifebit. All rights reserved.

By using this website, you understand the information being presented is provided for informational purposes only and agree to our Cookie Policy and Privacy Policy.