What is Real World Data (RWD)?
What is the Official Real World Data Definition?
When people talk about Real-World Data (RWD) they mean information created as a by-product of routine care \u2013 not in a lab, but in everyday clinics, pharmacies, hospitals, and even a patient\u2019s home. This distinguishes it fundamentally from data collected in highly controlled, often artificial, research settings. The U.S. Food and Drug Administration (FDA) has provided a foundational definition, describing RWD as \u201cdata relating to patient health status and/or the delivery of health care routinely collected from a variety of sources.\u201d This definition is crucial because it underpins the FDA’s increasing reliance on RWD for regulatory decision-making. This shift was significantly propelled by the 21st Century Cures Act (2016), which mandated the FDA to develop a framework for incorporating RWD and Real-World Evidence (RWE) into regulatory processes, particularly for new indications for approved drugs and for post-market surveillance. The publication of its RWE Program Framework marked a pivotal moment, signaling a clear pathway for RWD to influence drug development and approval.
A Closer Look at the Regulatory View
Regulatory bodies worldwide are aligning on a common understanding of RWD, focusing on several key characteristics:
- What is captured? RWD encompasses a vast array of information reflecting patient health or how care is delivered. This includes clinical diagnoses (e.g., ICD codes), details of treatments administered (medications, procedures), observed patient outcomes (e.g., lab results, disease progression), and even healthcare costs. The breadth of this data allows for a holistic view of a patient’s journey and the healthcare system’s performance.
- How is it captured? A defining feature of RWD is its collection as part of routine workflows, not through bespoke, pre-defined research studies. This means data is generated organically during patient visits, hospital stays, pharmacy transactions, or through continuous monitoring devices. This naturalistic collection minimizes observer bias and captures real-world variability that might be absent in controlled trials.
- Where does it come from? RWD originates from numerous sources, reflecting the fragmented yet interconnected nature of healthcare. These include Electronic Health Records (EHRs), administrative claims and billing data, disease and product registries, and increasingly, patient-generated health data (PGHD) from wearables and mobile health applications (see Section 3 for a detailed breakdown of these sources). The ability to link and integrate data from these diverse sources is key to open uping comprehensive insights.
Other global regulators are actively pursuing similar objectives. The European Medicines Agency (EMA), for instance, has set an ambitious target for 2025 to achieve seamless RWD use across Europe through initiatives like DARWIN EU (Data Analysis and Real World Interrogation Network). This network aims to provide robust, reliable RWE to support regulatory decision-making throughout the lifecycle of medicines. The overarching common goal among regulatory bodies globally is to establish a harmonised definition and framework for RWD and RWE. This harmonisation is critical because it allows evidence generated in one region to be understood, trusted, and applied to inform decisions elsewhere, fostering international collaboration and accelerating patient access to beneficial therapies.
Why Definitions Matter
Establishing a clear and universally accepted definition for Real-World Data is not merely an academic exercise; it has profound practical implications across the healthcare ecosystem:
- Standards & trust \u2013 A single, consistent definition ensures that regulators, researchers, pharmaceutical companies, and healthcare providers all speak the same language. This common understanding builds trust in the data’s provenance and interpretation, which is essential for its acceptance in critical decision-making processes, from drug approvals to clinical guidelines.
- Data quality \u2013 Clear expectations around what constitutes RWD drive the development and adoption of common curation, validation, and governance practices. When everyone understands the criteria for “real-world” data, it encourages better data capture at the source, more rigorous cleaning processes, and transparent quality metrics, ultimately enhancing the reliability and utility of the data.
- Faster innovation \u2013 A recognised and validated pathway for RWD to contribute to regulatory submissions significantly accelerates innovation. It allows RWD to support new drug approvals, expand existing drug labels for new indications or patient populations, and bolster post-market safety surveillance. This efficiency means that beneficial treatments can reach patients more quickly and safely.
- Patient-centricity \u2013 Perhaps most importantly, real-world data captures the experiences of diverse patient populations who rarely fit into the narrow, often highly selective, criteria of traditional clinical trials. By reflecting the heterogeneity of real-world patients \u2013 including those with comorbidities, varying adherence patterns, and different socioeconomic backgrounds \u2013 RWD provides insights into how treatments perform in the “messy” reality of everyday care. This patient-centric approach ensures that medical advancements are relevant and effective for the broadest possible patient base, leading to more equitable and personalized healthcare.
RWD vs. RWE: From Raw Data to Actionable Insights
Think of RWD as raw ingredients and Real-World Evidence (RWE) as the finished meal. RWD becomes RWE only after rigorous cleaning, analysis and clinical interpretation.
The Journey
- Collection – Data flow in from EHRs, claims, registries, apps, sensors.
- Curation – Remove errors, standardise codes, de-identify records.
- Analysis – Apply statistics, causal-inference methods and machine learning.
- Evidence generation – Place findings in clinical context to guide real decisions (see When RWD becomes RWE).
RWE in Action
- Post-market surveillance – FDA Sentinel links data sources to detect uncommon safety signals.
- Label expansion – Palbociclib’s approval for male breast-cancer patients was underpinned by EHR-based evidence.
- Point-of-care decisions – Clinicians use RWE dashboards to tailor therapy for patients with multiple comorbidities.
Where Does Real-World Data Come From? Key Sources Explained
Real-World Data is inherently diverse, originating from the myriad points of interaction within the healthcare system. Understanding these primary sources is crucial for appreciating the richness and complexity of RWD, as well as its potential and limitations:
- Electronic Health Records (EHRs) \u2013 These are digital versions of a patient’s paper chart, maintained by healthcare providers. EHRs are a cornerstone of RWD, containing a wealth of clinical information including:
- Clinical notes: Free-text entries by physicians, nurses, and other clinicians detailing patient encounters, symptoms, diagnoses, and treatment plans. While rich in detail, extracting structured insights from these notes often requires advanced Natural Language Processing (NLP) techniques.
- Lab results: Quantitative and qualitative data from blood tests, biopsies, and other diagnostic procedures.
- Imaging reports: Radiologist interpretations of X-rays, MRIs, CT scans, etc. (though the images themselves are often stored separately).
- Medication orders and administration records: Details on prescriptions, dosages, and adherence.
- Problem lists: Coded diagnoses (e.g., ICD codes) that track a patient’s health conditions over time.
- Demographics: Basic patient information like age, gender, and address.
The primary strength of EHRs lies in their longitudinal view of a patient’s health journey, making them ideal for studying disease progression, treatment pathways, and long-term outcomes. However, challenges include data fragmentation across different healthcare systems, variability in data entry practices, and the inherent messiness of free-text data.
- Insurance Claims & Billing Data \u2013 These administrative records are generated when healthcare services are rendered and billed to insurance companies. They represent a vast repository of RWD, often spanning millions of lives, and are particularly valuable for population-level analyses. Key data elements include:
- Diagnosis codes: (e.g., ICD-10) indicating the reasons for patient visits.
- Procedure codes: (e.g., CPT codes) detailing medical services performed.
- Drug codes: (e.g., NDC codes) for dispensed medications.
- Provider and facility identifiers: Information about where and by whom care was delivered.
- Cost and reimbursement data: Financial aspects of care.
Claims data are excellent for utilization patterns, cost-effectiveness analyses, and comparative-effectiveness research at scale due to their large population coverage and focus on billed services. Their limitations stem from being primarily for billing purposes, meaning they may lack clinical depth, omit services not billed, or contain coding errors.
- Product & Disease Registries \u2013 These are purpose-built, organized systems that collect uniform data to evaluate specified outcomes for a population defined by a particular disease, condition, or exposure to a medical product. They are often maintained by professional societies, patient advocacy groups, or pharmaceutical companies. Examples include cancer registries, cystic fibrosis registries, or post-market surveillance registries for specific medical devices.
- Strengths: Registries are critical for rare diseases where large-scale RCTs are impractical, and for tracking long-term outcomes and safety profiles of specific products over many years. They often collect more granular clinical data than claims and can be designed to capture specific endpoints relevant to the disease or product.
- Limitations: They can be expensive to maintain, may suffer from selection bias (patients who enroll might differ from those who don’t), and data quality can vary depending on the rigor of data collection protocols.
- Patient-Generated Health Data (PGHD) \u2013 This rapidly growing category includes health-related data created, recorded, or gathered by patients or their caregivers. It offers a unique, continuous, and real-life context that traditional clinic visits often miss. Sources include:
- Wearables: Smartwatches, fitness trackers, and continuous glucose monitors providing data on heart rate, sleep patterns, activity levels, and physiological parameters.
- Mobile apps: Applications for symptom tracking, medication adherence, mental health monitoring, or chronic disease management.
- Home monitoring devices: Blood pressure cuffs, scales, or pulse oximeters that transmit data directly to healthcare providers.
- Patient-Reported Outcomes (PROs): Data directly reported by patients about their health status, symptoms, and quality of life, often collected via surveys or digital platforms.
PGHD adds invaluable continuous, granular, and contextual data reflecting daily life and patient experience. Programmes like NEST (National Evaluation System for health Technology) help integrate these data safely and effectively into broader evidence generation. Challenges include data validity, interoperability with clinical systems, and ensuring patient privacy and data security.
RWD vs. Randomized Controlled Trials (RCTs): A Head-to-Head Comparison
RCTs are unrivalled for proving causality under ideal conditions, while RWD shows how interventions perform in the messy real world. Together they give the full picture.
Attribute | Randomized Controlled Trials (RCTs) | Real-World Data (RWD) Studies |
---|---|---|
Setting | Controlled research sites | Routine care settings |
Population | Highly selected | Broad, heterogeneous |
Intervention | Randomised, often blinded | Observed as prescribed |
Purpose | Efficacy | Effectiveness, safety, utilisation |
Cost | High | Lower |
Timeline | Fixed, usually shorter | Continuous, often longer |
Internal validity | Very high | Variable; needs adjustment |
External validity | Limited | High |
Rare events | Hard to study | Easier with large datasets |
Working Together
- Hybrid trials embed RCT randomisation inside routine-care data capture.
- External control arms use historical RWD when a placebo is unethical or impractical.
- Recruitment & feasibility – RWD helps identify eligible patients and suitable sites quickly.
- Contextualisation – Post-approval RWD confirms a therapy’s value across broader populations.
The Power of RWD and RWE: Benefits and Applications
Why It Matters
- Faster insight – Data already exist, cutting months or years off evidence generation.
- Lower cost – Leveraging routine data is cheaper than launching new trials of similar size.
- Long-term safety – Continuous follow-up reveals rare or delayed effects.
- Disease understanding – Large cohorts uncover natural history and unmet needs.
- Value-based care – Payers use RWE to link reimbursement to real outcomes.
Real-World Applications
- Post-market safety studies – Mandatory for many approvals.
- Label expansion – New indications without starting from scratch.
- Pragmatic trial design – RWD defines criteria, endpoints and site selection.
- Health-economics & outcomes research (HEOR) – Quantifies cost, quality-of-life and budget impact.
- Crisis response – During COVID-19, RWD clarified vaccine effectiveness and comorbidity risks (see this COVID-19 and HIV study).
Navigating the Problems: Challenges and Limitations of RWD
While the promise of Real-World Data is immense, its effective utilization is not without significant problems. Understanding and proactively addressing these challenges is paramount to generating reliable Real-World Evidence.
Data Quality
One of the most pervasive challenges with RWD is its inherent variability in quality. Records created primarily for billing, administrative, or routine clinical care purposes were not originally designed for research. This often leads to issues such as:
- Missing fields: Incomplete patient records where crucial information (e.g., symptom onset, specific lab values) was not consistently captured.
- Miscoded entries: Errors in diagnosis or procedure codes, or the use of outdated coding systems.
- Inconsistent units or formats: Variations in how data is recorded (e.g., blood pressure in mmHg vs. kPa, or different date formats) across different systems or even within the same system over time.
- Lack of granularity: Data may be too high-level to answer specific research questions.
To mitigate these issues, robust data cleaning, change, and standardization processes are essential. This includes rigorous provenance checks to understand the origin and context of the data, and the development of transparent quality metrics to assess data completeness, accuracy, and consistency. Investing in data governance frameworks and automated quality checks is crucial for building trust in RWD (see this work on assessing data quality).
Interoperability
Healthcare data exists in a highly fragmented ecosystem. Different institutions, clinics, and even departments within the same hospital often use disparate systems that “speak” different “dialects.” For example, one system might use ICD codes for diagnoses, while another uses SNOMED CT, or proprietary lab codes may vary widely. This lack of seamless communication and data exchange \u2013 known as interoperability \u2013 makes it incredibly challenging to merge and analyze datasets across multiple sources. To overcome this, the industry is increasingly adopting common data models and standards like FHIR (Fast Healthcare Interoperability Resources) and CDISC (Clinical Data Interchange Standards Consortium). Furthermore, active data-harmonisation pipelines and semantic mapping tools are required to translate and integrate disparate datasets into a unified, analyzable format, enabling a more comprehensive view of patient populations.
Analysis & Privacy
Because RWD is observational \u2013 meaning researchers do not control the intervention or patient assignment \u2013 it is highly susceptible to bias and confounding. Factors like patient selection bias (e.g., sicker patients receiving a new drug), confounding by indication (the reason a drug was prescribed is also related to the outcome), and missing data can distort findings. Modern causal-inference techniques are indispensable for drawing valid conclusions from RWD. These include methods like propensity scores (to balance covariates between treatment groups), target-trial emulation (to design observational studies that mimic randomized trials), and instrumental variables. Simultaneously, the sensitive nature of health data necessitates stringent privacy and security measures. Regulations such as HIPAA in the U.S. and GDPR in Europe demand robust de-identification protocols and secure analysis environments. This has led to the rise of Trusted Research Environments (TREs) or Trusted Data Lakehouses, which provide secure, controlled spaces where researchers can analyze de-identified data without direct access to raw patient identifiers, ensuring that valuable insights are generated without compromising patient privacy.
Frequently Asked Questions about the Real World Data Definition
It’s natural to have questions when diving into something as powerful and transformative as Real-World Data. Let’s tackle some of the most common ones that pop up when discussing the real world data definition and its impact. Think of this as a quick chat with a friendly expert!
What is a simple example of real-world data?
Imagine you visit your doctor because you’re feeling a bit under the weather. When the doctor diagnoses you with, say, “Seasonal Allergies” and records it in your medical chart, that diagnosis code becomes a piece of real world data. It’s collected as part of your routine care, not for a specific research study, and it gives us a glimpse into your health status.
Similarly, when your pharmacy fills a prescription for you, that prescription record is another excellent example. It shows what medication you’re taking, the dosage, and when you got it. Even a blood pressure reading automatically sent from your home monitoring device directly to your doctor’s system counts! These are all bits of information gathered in the natural flow of healthcare, and they help paint a picture of health in the real world.
Why is real-world data becoming so important?
Real-World Data is truly having its moment, and for good reason! Its growing importance stems from several key shifts in healthcare and technology.
First, think about the digitalization of healthcare. More and more, patient records, lab results, and even appointment schedules are all digital. This means a massive amount of data is being generated every second, creating an incredibly rich resource that we can now tap into.
Second, we’ve seen incredible leaps in advanced analytics, including artificial intelligence (AI) and machine learning (ML). These powerful tools allow us to sift through, organize, and make sense of the huge, complex datasets that RWD represents. What was once just a jumble of information can now be transformed into meaningful insights, helping us understand health trends and treatment impacts like never before.
Third, there’s a strong push for faster, more efficient drug development. Regulatory bodies and pharmaceutical companies are keenly aware that RWD can accelerate the journey of new medicines from lab to patient. It can support new uses for existing drugs and provide crucial safety monitoring long after a drug hits the market, often more efficiently than traditional methods alone.
Finally, RWD helps us focus on what truly matters: patient-centric outcomes. Because RWD captures the experiences of diverse patient populations in their everyday lives, it offers a far more complete picture of how treatments actually perform. This allows us to see the real impact on patient outcomes and quality of life, leading to more personalized and effective care.
Can real-world evidence completely replace randomized controlled trials?
This is a fantastic question, and the short answer is: not entirely, and we don’t believe it should! Think of Real-World Evidence (RWE) and Randomized Controlled Trials (RCTs) not as competitors, but as powerful allies. They are truly complementary approaches, each bringing unique strengths to the table.
RCTs remain the gold standard for establishing the efficacy of a medical product. This means figuring out if a treatment works under ideal, highly controlled conditions. Their strict design, which often includes randomization and blinding, helps minimize bias, allowing researchers to confidently say, “Yes, this intervention caused that effect.” Their high internal validity makes them excellent for proving cause and effect.
However, RWE is ideal for questions RCTs can’t answer easily or even at all. For example, RCTs typically involve a carefully selected group of patients, often excluding those with other health conditions or who are very elderly. RWE, on the other hand, gives us insights into how treatments perform in the diverse, “messy” real world, including patients with multiple health issues, varying adherence levels, and different lifestyles. This provides crucial information about a treatment’s effectiveness in routine clinical practice.
RWE is also perfect for understanding long-term safety profiles, exploring rare side effects that might only appear after years of use or in very large populations. It’s also invaluable for studying rare diseases, where it’s simply not feasible to recruit enough patients for a large RCT.
So, by working together, RCTs provide the strong initial proof of concept, while RWE expands our understanding, showing how those treatments perform in the wild. This collaboration creates a more comprehensive and robust body of evidence, ultimately accelerating medical progress and ensuring patients receive the safest, most effective care.
Conclusion: The Future is Data-Driven
Real-World Data and the evidence it open ups are reshaping how we find, evaluate and pay for medical innovation. By complementing the rigour of RCTs with the breadth of routine-care information, healthcare can move faster and serve a wider, more diverse population.
Looking ahead we expect:
- Smarter AI/ML to mine unstructured text, images and multi-omics.
- Federated learning so models train where data live, preserving privacy.
- Scalable data lakehouses that blend structured and raw data for on-demand analytics.
Lifebit’s federated AI platform already brings these pieces together. Our Trusted Research Environment (TRE), Trusted Data Lakehouse (TDL) and R.E.A.L. analytics layer enable secure, real-time collaboration for biopharma, governments and public-health agencies across five continents.
Learn more about our federated data solutions