How NIAID Keeps Patient Data Under Lock and Key While Linking Records

How NIAID Privacy Preserving Record Linkage Unlocks 23M Patient Records Securely

NIAID privacy preserving record linkage (PPRL) is the method used to securely connect patient records across separate data sources — such as electronic health records (EHR) and insurance claims — without ever exposing personally identifiable information (PII).

Here is a quick summary of how it works and why it matters:

Key Question	Quick Answer
What is PPRL?	A technique that links patient records using cryptographic tokens instead of real names or IDs
Who uses it?	NIH-supported initiatives like N3C, with alignment to NIAID-funded COVID-19 research
What data does it link?	EHR data from 240+ health systems + CMS Medicare and Medicaid claims
How big is the dataset?	Over 23 million individuals, 33 billion rows of data
Is it accurate?	Yes — age collision rates as low as 0.11%, gender collision rates as low as 0.08%
Who manages the linkage?	An independent Linkage Honest Broker (Regenstrief Institute) that never sees raw PII
Why does it matter?	It fills critical gaps in fragmented US healthcare data for research on Long COVID, disparities, and more

The challenge driving all of this is simple but serious. The US healthcare system is deeply fragmented. A single patient may receive care from a primary doctor, a specialist, and a hospital — each keeping separate records. No single dataset tells the full story.

For researchers studying COVID-19 outcomes, Long COVID, or treatment disparities, incomplete data means incomplete answers. PPRL solves this by linking records across systems using secure, hashed tokens — no PII required, no privacy risk introduced.

The result? A national research asset that supports over 4,100 investigators, 580+ clinical studies, and some of the most important post-COVID science happening today.

I’m Maria Chatzou Dunford, CEO and Co-founder of Lifebit, with over 15 years of experience in computational biology, federated data infrastructure, and biomedical data integration — areas that sit at the heart of NIAID privacy preserving record linkage implementation. In this guide, I’ll walk you through exactly how PPRL works inside N3C, how it was validated, and what it means for the future of privacy-safe research at scale.

Niaid privacy preserving record linkage terms simplified:

How NIAID Privacy Preserving Record Linkage Works Without Exposing PII

At its core, niaid privacy preserving record linkage is a technical solution to a legal and ethical puzzle: how do we connect data about the same person from two different places without knowing who that person is? In COVID-19 research, this is essential for following a patient’s journey from their first positive test in an EHR to their long-term recovery tracked in insurance claims. The US healthcare landscape is notoriously siloed, with clinical data often separated from administrative and claims data. PPRL acts as the bridge that spans these silos while maintaining a zero-trust security posture.

The process relies on a technique called tokenization. Instead of sending a patient’s name, Social Security number, or date of birth to a central database, each data-contributing site uses specialized software to “scramble” these identifiers into cryptographic hashes. This ensures that the raw PII never leaves the secure environment of the healthcare provider. This local-first approach is critical for compliance with HIPAA and the Common Rule, as it ensures that the central repository never actually handles identifiable information.

The Magic of Cryptographic Hashing and Salting

The specific algorithm often used is SHA-256 (Secure Hash Algorithm 256-bit). This is a one-way mathematical function. If you put “John Doe” into the function, it spits out a long string of random-looking characters (a token). Because it is a one-way function, you cannot turn that token back into “John Doe.” Even a single character change in the input—like changing “John” to “Jon”—results in a completely different hash, which is why data cleaning at the source is so critical. This sensitivity to input is known as the “avalanche effect,” where a minor change in the input leads to a drastically different output, ensuring that no patterns can be easily discerned from the hashes themselves.

To make this even more secure, a “salt” — a secret piece of extra data — is added to the hashing process. This prevents “dictionary attacks,” where someone tries to guess the original name by hashing every name in a phonebook until they find a match. By using a site-specific or project-specific salt, the tokens generated for NIAID research are unique to that ecosystem. These de-identified tokens allow us to see that “Token A” in a hospital database is the same person as “Token A” in a pharmacy database, all while keeping the patient’s identity under lock and key. The management of these salts is a high-security operation, often involving hardware security modules (HSMs) to ensure that the “key” to the hashing process is never exposed.

Advanced PPRL: Bloom Filters and Probabilistic Matching

In many niaid privacy preserving record linkage implementations, researchers use Bloom Filters. A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a member of a set. In PPRL, PII elements (like surname, first name, and DOB) are broken into small pieces called bigrams (e.g., “John” becomes “Jo”, “oh”, “hn”). These bigrams are hashed and mapped into a bit array. This allows for a more nuanced comparison than simple exact-string matching.

This approach allows for “fuzzy matching.” If a patient’s name is misspelled in one system but correct in another, the Bloom Filters will still show a high degree of similarity, allowing the system to link the records with a high degree of confidence without ever needing the actual name. This balances the need for high-quality data linkage with the absolute requirement for privacy. The mathematical beauty of Bloom filters lies in their ability to provide a similarity score (often using Jaccard similarity) between two records without ever revealing the underlying characters. This is particularly useful in the US, where data entry errors in EHRs are common due to the high volume of patient intake.

For a deeper dive into how these connections are made across public health, see this study on Privacy preserving record linkage for public health action: opportunities and challenges. We also have a comprehensive guide on the broader concept of Data Linkage for those looking to understand the foundational principles.

The Technical Components of NIAID Privacy Preserving Record Linkage

The success of this system isn’t just about the math; it’s about the architecture. In the National COVID Cohort Collaborative (N3C) ecosystem, which aligns with NIAID’s research goals, three key players ensure the system remains neutral and secure:

The Tokenization Contractor: Provides the specialized software used by health systems to generate the hashes locally. This software runs behind the hospital’s firewall, ensuring that the raw PII is processed in a trusted environment. The contractor often provides “data quality” tools that help sites clean their PII before hashing, which significantly improves the match rate.
The Linkage Honest Broker (LHB): This is a neutral third party — specifically the Regenstrief Institute. The LHB receives the tokens but never sees the clinical data or the original PII. Their only job is to match tokens from different sources and create a unique, non-identifying “Match ID.” The LHB acts as a “blinded” intermediary, ensuring that the data-contributing site and the data-analyzing site are never the same entity.
The Data Analytics Contractor: Manages the secure enclave (like N3C) where researchers analyze the clinical data. This data is stripped of PII and instead uses the Match ID to connect records across the dataset. The analytics contractor is responsible for the “Safe Harbor” or “Expert Determination” processes that ensure the final dataset is truly de-identified according to HIPAA standards.

By separating these roles, no single organization ever has enough information to re-identify a patient. The hospital has the PII but no outside data; the LHB has the tokens but no clinical data; and the researchers have the clinical data but no PII. This “separation of powers” is a cornerstone of modern Data Matching Technology. This architecture is often referred to as a “multi-party computation” light model, where the computation is distributed to prevent any single point of failure or privacy breach.

Overcoming Data Silos with NIAID Privacy Preserving Record Linkage

Why go to all this trouble? Because fragmented care is the enemy of good science. In the US, a patient’s medical history is often a series of “silos.” One hospital has their lab results, a different clinic has their vaccination record, and a state database might have their mortality information. This fragmentation is particularly acute for marginalized populations who may rely on multiple safety-net providers, making their data even harder to track without robust linkage tools.

For example, a patient might be treated for acute COVID-19 at a major academic medical center, but receive follow-up care for Long COVID at a community clinic that uses a different EHR system. Without niaid privacy preserving record linkage, these two encounters look like they belong to two different people. By using PPRL, we can build a longitudinal history that follows the patient across all these touchpoints. This is vital for “patient-centered outcomes,” where we want to know how a patient is doing months or years after their initial infection. It allows researchers to see the “full arc” of the disease, from the initial viral insult to the long-term sequelae that may manifest as cardiovascular or neurological issues.

To learn more about how we analyze this data without moving it from its secure home, explore our insights on Privacy Preserving Statistical Data Analysis on Federated Databases.

Connect N3C Clinical Data with CMS Claims for a Complete Patient View

One of the greatest achievements of the N3C and NIAID-supported efforts is the linkage of EHR data with CMS Medicare and Medicaid claims. While EHRs are great for clinical details (like blood pressure, oxygen levels, or specific lab titers), claims data is the “gold standard” for seeing the big picture of healthcare utilization. Claims capture every doctor visit, every prescription filled, and every hospital stay, regardless of where it happened, as long as it was billed to the insurance provider. This provides a “safety net” for data, capturing events that occur outside of the primary health system’s network.

This linkage addresses the problem of “missingness.” For example, if a patient gets a COVID-19 vaccine at a local pharmacy or a mass-vaccination site instead of their usual hospital, that record might be missing from the hospital’s EHR. However, it will show up in their insurance claims. By merging these sets, we get a much clearer view of vaccine effectiveness and long-term health utilization. This is particularly important for studying populations that may have high mobility or receive care across multiple state lines, such as seasonal workers or students.

The N3C infrastructure has successfully harmonized data from over 240 participating organizations, creating a repository of over 23 million individuals. You can read more about the design of this massive effort in The national COVID cohort collaborative (N3C): rationale, design, infrastructure, and deployment. This type of large-scale work is a prime example of Federated Data Analysis in action, where the data is brought together virtually to solve problems that no single institution could tackle alone.

The Role of the OMOP Common Data Model

To make this linked data usable, it must be “harmonized.” Every hospital uses different codes for the same thing—one might use a local code for a “dry cough,” while another uses an ICD-10 code. NIAID-supported initiatives use the OMOP (Observational Medical Outcomes Partnership) Common Data Model. This model is maintained by the OHDSI (Observational Health Data Sciences and Informatics) community and provides a standardized structure for observational data.

In the OMOP model, all disparate data types are mapped to a standard set of tables and vocabularies. This means a researcher can write one piece of code to analyze “Type 2 Diabetes” and it will work across data from 240 different hospitals. The process of moving data from a local EHR into OMOP is called ETL (Extract, Transform, Load). This is a massive undertaking that requires clinical experts to ensure that the meaning of the data is preserved during the transformation. For instance, mapping a local lab result for “SARS-CoV-2 PCR” to the standard LOINC code requires careful validation to ensure that negative and positive results are interpreted correctly across different testing platforms. When combined with PPRL, OMOP allows for high-speed, high-fidelity research across tens of millions of records, enabling “real-world evidence” generation at an unprecedented scale.

Validating Accuracy in NIAID Privacy Preserving Record Linkage

You might wonder: if we are scrambling the names, how do we know the matches are actually correct? We use “collision rates” to measure accuracy. A “collision” happens when the system thinks two different people are the same person because their tokens are too similar. This is the “false positive” of the linkage world. Conversely, we also measure “missed matches,” where the system fails to link two records that actually belong to the same person.

The results for niaid privacy preserving record linkage are impressively precise, as validated by the Regenstrief Institute using a “gold standard” dataset where the true identities were known for validation purposes:

Age Collisions: Only 0.11% for EHR-to-EHR links and 0.15% for EHR-to-CMS links. This means the system almost never confuses people of different ages, which is critical for pediatric and geriatric research.
Gender Collisions: Only 0.42% for EHR and a tiny 0.08% for EHR-to-CMS links. These low rates ensure that sex-disaggregated analyses remain highly accurate.
Within-Site Duplication: The median rate is a mere 0.0034%, showing that the system is excellent at identifying when the same person appears multiple times in the same hospital’s records, often due to different registration events.

These stats prove that the tokenization process is highly reliable. We aren’t just guessing; we are building a high-fidelity map of the patient population that rivals the accuracy of traditional PII-based matching while maintaining 100% privacy. The use of multiple tokens (e.g., one for Name+DOB, another for SSN, another for Address) allows for a “voting” mechanism that further increases the precision of the match.

For a deeper look at how software handles these complex identity challenges, check out our Deep Dive into Entity Resolution Software.

Governance and Access Requirements for Linked Datasets

With great data comes great responsibility. Accessing these linked CMS-N3C datasets isn’t as simple as clicking a download button. Investigators must navigate a strict governance process to ensure the data is used ethically and only for approved public health purposes. This governance is overseen by the N3C Data Access Committee (DAC), which includes representatives from NIH, clinical sites, and patient advocacy groups.

Data Use Request (DUR): Researchers must submit a formal proposal to the N3C DAC explaining exactly what they want to study and why they need the linked data. This proposal is reviewed for scientific merit, ethical compliance, and potential for re-identification risk. The DUR must also specify the “Level” of data requested (e.g., Level 2 De-identified or Level 3 Limited Data Set).
Linkage Honest Broker Agreement (LHBA): Institutions must sign this legal agreement to participate in the PPRL process, acknowledging their responsibilities in protecting the de-identified data and agreeing to the “rules of the road” for data sharing.
Site Permissions: Individual health systems must “opt-in” to allow their data to be linked with CMS records. This respects the autonomy of the data-contributing institutions and ensures that they are comfortable with how their patient data is being utilized in the broader collaborative.
Secure Enclave Environment: Researchers never download the data to their own laptops. Instead, they log into a secure, monitored cloud environment (the N3C Data Enclave) where they can run their analyses using tools like R, Python, and SQL. Only the aggregate results (like a graph or a table) can be exported, and only after a “disclosure review” by a human data steward to ensure no individual can be identified from the results (e.g., ensuring no cell sizes are smaller than 10).

This layered approach ensures that while the data is “open” for science, it remains “closed” to misuse. It’s a perfect illustration of Federated Data Governance, where trust is built through transparency, technical controls, and legal accountability.

Solve Long COVID and Health Disparities with Securely Linked Data

The real-world impact of niaid privacy preserving record linkage is found in the papers being published and the lives being improved. By having a complete view of the patient, researchers can tackle questions that were previously impossible to answer because the data was too fragmented. The N3C enclave has already supported hundreds of peer-reviewed publications that have directly influenced clinical guidelines and public health policy.

Key Research Areas Enabled by PPRL:

1. Long COVID (PASC) Identification

Identifying the “fingerprints” of post-acute sequelae of SARS-CoV-2 (PASC) is incredibly difficult. Symptoms like fatigue, brain fog, and joint pain are often non-specific and may not be coded consistently in EHRs. By linking EHRs (which contain clinical notes and lab tests) with insurance claims (which show long-term healthcare utilization and new diagnoses months after the initial infection), researchers can identify patterns that define Long COVID. For instance, a researcher might see a spike in cardiology referrals in the claims data that corresponds to a specific inflammatory marker found in the EHR during the acute phase. This helps in developing diagnostic criteria and potential treatments for millions of sufferers.

2. Pregnancy and Neonatal Outcomes

Understanding how COVID-19 affects birth outcomes requires linking maternal health records with neonatal data. Often, a mother and her newborn may have separate medical records that aren’t easily connected in a de-identified dataset, especially if they are discharged at different times. PPRL allows researchers to link the mother’s COVID-19 severity and vaccination status with the baby’s health outcomes, such as preterm birth or NICU admission. This has provided critical insights for prenatal care guidelines, confirming that COVID-19 vaccination during pregnancy is both safe and protective for the infant.

3. Evusheld and Therapeutic Utilization

Tracking how life-saving antibody treatments, like Evusheld, reached high-risk immunocompromised populations was a major priority during the Omicron wave. By linking pharmacy data with clinical records, researchers could see which patients were eligible for the treatment based on their underlying conditions (like organ transplants or cancer) versus who actually received it. This highlighted significant gaps in the supply chain and clinical implementation, showing that many of the most vulnerable patients were not receiving the therapies they needed. These insights allow for better preparation for future variants and more equitable distribution of limited resources.

4. Health Disparities and Paxlovid

Analyzing Paxlovid treatment rates by race, ethnicity, and zip code is essential for identifying and closing gaps in care. PPRL allows for the integration of Social Determinants of Health (SDOH) data—such as the Social Vulnerability Index (SVI)—with clinical outcomes. Researchers found that even when accounting for clinical risk, certain populations were less likely to receive antiviral treatments, leading to targeted public health interventions. For example, by linking data, researchers could identify “pharmacy deserts” where Paxlovid was unavailable, prompting the deployment of mobile clinics to those areas.

One notable study utilized these methods to look at the relative effectiveness of COVID-19 vaccination and booster dose combinations, a study that required the massive scale of the N3C enclave to achieve statistical significance across different age groups and comorbidities. To understand how we help researchers navigate these complex data landscapes, see our NIAID Data Access Program guide.

Future Expansions for NIAID Privacy Preserving Record Linkage

The work doesn’t stop with EHRs and CMS claims. The “linkage map” is constantly expanding to include more diverse data types, moving toward a truly holistic view of human health. Future plans for niaid privacy preserving record linkage include:

Mortality Data: Integrating more comprehensive death records from the National Death Index (NDI) and state-level registries to better understand the true case-fatality rates of different variants. This is crucial for capturing deaths that occur at home or in hospice, which are often missed by hospital-based EHRs.
Medical Imaging: Linking de-identified CT scans and X-rays to clinical histories. This would allow AI models to learn how lung damage visible on an image correlates with long-term respiratory symptoms recorded in the EHR. The challenge here is ensuring that the metadata of the images is also properly de-identified.
Social Determinants of Health (SDOH): Bringing in neighborhood-level data (like air quality, food access, or transportation availability) to see how environment affects COVID-19 recovery. This is done by hashing the patient’s address into a “census tract token” or a “geohash” that provides location context without revealing the exact home address, protecting the privacy of individuals in small communities.
Scalability and Real-Time Linkage: Improving the speed of tokenization and matching to handle even larger datasets as more health systems join the collaborative. The goal is to move from monthly data refreshes to near real-time insights, which would be invaluable for tracking the emergence of new variants in real-time.

As we move toward more advanced analytics, the role of Privacy Preserving AI will become even more critical in extracting insights from these massive, linked datasets without ever compromising the “lock and key” protection of patient privacy. This ensures that the US research infrastructure remains at the global forefront of both innovation and ethics, providing a model for international data sharing efforts.

NIAID PPRL: Your Top Questions Answered

How does PPRL protect patient identity?

PPRL uses cryptographic hashing to turn sensitive identifiers (like names, Social Security numbers, and exact dates of birth) into “de-identified tokens.” These tokens are unique but cannot be reversed to reveal the original name. Furthermore, the system employs a “Linkage Honest Broker” (a neutral third party like the Regenstrief Institute) to handle the matching. This ensures that no one person or agency ever sees both the patient’s identity and their clinical data simultaneously. This multi-layered architecture is designed to be fully HIPAA compliant and meets the highest standards of the Common Rule for human subjects research. The use of “salting” further ensures that tokens cannot be cracked using brute-force or dictionary attacks.

What data types are currently linked in N3C?

As of today, the N3C enclave successfully links EHR data with CMS Medicare and Medicaid claims, mortality data from various sources, and viral variant information from genomic sequencing labs. This creates a multi-dimensional view of the pandemic that includes clinical observations, insurance records, and genomic data. There are also ongoing efforts to integrate data from the Department of Veterans Affairs (VA) and other federal health systems to ensure the dataset is representative of the entire US population, including veterans and active-duty military personnel. This diversity of data is what allows for the study of rare conditions and specific sub-populations.

How can investigators access linked CMS-N3C data?

Investigators must be part of an institution that has a signed Data Use Agreement (DUA) with the NIH. They must then submit a detailed Data Use Request (DUR) proposal to the N3C Data Access Committee. This proposal must outline the specific research questions, the variables needed, and the planned statistical methods. Once approved, researchers can access the data within a secure, cloud-based enclave (the N3C Data Enclave). N3C also provides extensive training modules, community forums, and educational resources to help researchers use these complex, harmonized datasets effectively. The process is rigorous but designed to facilitate high-impact science while maintaining public trust.

Is there a risk of re-identification in these large datasets?

While no system is 100% risk-free, PPRL is designed to make re-identification mathematically and procedurally improbable. By using salted hashes, Bloom filters, and a Linkage Honest Broker, the system prevents “linkage attacks” where an adversary tries to join the research data with an external identifiable dataset. Additionally, the data available to researchers is further de-identified (e.g., dates are shifted, and rare conditions may be masked) to comply with “Expert Determination” or “Safe Harbor” standards under HIPAA. The secure enclave also monitors all user activity, including the code being run and the data being viewed, to prevent any attempts at re-identification. Any violation of these terms can lead to the loss of data access for the entire institution.

How does PPRL handle data quality issues like misspellings?

Modern niaid privacy preserving record linkage uses probabilistic matching techniques. Instead of requiring an exact match of a long hash, the system can compare “Bloom Filters” which represent pieces of the patient’s information. This allows the system to recognize that “Jonathon Smith” and “Jonathan Smith” are likely the same person if other factors (like year of birth and zip code) align, even if the tokens aren’t an identical match. This significantly increases the “recall” or completeness of the linked data. The system also uses “weighting” where more unique identifiers (like a rare last name) carry more weight in the matching process than common ones (like the name “Smith”).

What is the latency for linked data availability?

Because the linkage process involves multiple parties (the clinical sites, the tokenization contractor, the Honest Broker, and the Data Enclave), there is a natural latency in the data. Typically, data is refreshed on a monthly basis. This includes the time needed for sites to perform their ETL into OMOP, generate tokens, and for the Honest Broker to perform the matching and update the Match IDs in the enclave. While not “real-time,” this monthly cadence is sufficient for most longitudinal research and public health surveillance needs. Efforts are underway to streamline these pipelines to reduce the turnaround time even further.

Secure Your Research Future with Federated AI

The success of niaid privacy preserving record linkage within the N3C ecosystem proves that we don’t have to choose between patient privacy and high-impact science. By using smart technology like cryptographic hashing and the Honest Broker model, we’ve built the largest centralized COVID-19 data resource in US history.

At Lifebit, we are proud to be part of this movement toward a more connected, secure biomedical future. Our federated AI platform is designed for exactly this kind of work — enabling researchers to access and analyze global data without ever moving it from its secure home. Whether it’s through our Trusted Research Environment or our advanced AI-driven safety surveillance, we are committed to powering the next generation of life-saving discoveries.

Ready to see how federated solutions can transform your research? Learn more about our federal health solutions.

By Industry

By Goal

By Goal

Software

1. FEDERATED RESEARCH & DISCOVERY

2. FEDERATED DATA AUTOMATION

3. FEDERATED DATAHUB

Trusted Data Hub

4. ULTIMATE SECURITY & GOVERNANCE SOLUTIONS

Data

By Use Case

1. FEDERATED RESEARCH & DISCOVERY

Data Enclave

Biomarker Discovery

Back or reverse translation

2. FEDERATED DATA AUTOMATION

OMO/FHIR & Custom Data Model Standardisation

Enterprise Data Catalog (EDC)

Health & Variant Store

3. FEDERATED DATAHUB

DataHub

4. ULTIMATE SECURITY & GOVERNANCE SOLUTIONS

Airlock

FedRamp-in-a-box

By Use Case

Data Solutions

Learn

Contact

Support

Help center

24/7 support

By Industry

By Goal

By Goal

Software

1. FEDERATED RESEARCH & DISCOVERY

2. FEDERATED DATA AUTOMATION

3. FEDERATED DATAHUB

Trusted Data Hub

4. ULTIMATE SECURITY & GOVERNANCE SOLUTIONS

Data

By Use Case

1. FEDERATED RESEARCH & DISCOVERY

Data Enclave

Biomarker Discovery

Back or reverse translation

2. FEDERATED DATA AUTOMATION

OMO/FHIR & Custom Data Model Standardisation

Enterprise Data Catalog (EDC)

Health & Variant Store

3. FEDERATED DATAHUB

DataHub

4. ULTIMATE SECURITY & GOVERNANCE SOLUTIONS

Airlock

FedRamp-in-a-box

By Use Case

Data Solutions

Learn

Contact

Support

Help center

24/7 support

How NIAID Privacy Preserving Record Linkage Unlocks 23M Patient Records Securely

How NIAID Privacy Preserving Record Linkage Works Without Exposing PII

The Magic of Cryptographic Hashing and Salting

Advanced PPRL: Bloom Filters and Probabilistic Matching

The Technical Components of NIAID Privacy Preserving Record Linkage

Overcoming Data Silos with NIAID Privacy Preserving Record Linkage

Connect N3C Clinical Data with CMS Claims for a Complete Patient View