In research and healthcare, the size of datasets needed to solve crucial problems is continuing to increase. New technologies including the digitisation of healthcare tools, the accumulation of electronic healthcare records and massively reduced costs for high throughput technologies like genome sequencing all contribute to these large datasets.
However, secure storage and analysis of these large, sensitive datasets is becoming significantly harder. There are three key reasons for this:
Data federation is solving the problem of data access, without compromising data security. In its simplest terms: Data federation is a software process that enables numerous databases to work together as one. Using this technology is highly relevant for accessing sensitive biomedical health data, as the data remains within appropriate jurisdictional boundaries, while metadata is centralised and searchable and researchers can be virtually linked to where it resides for analysis.
This is an alternative to a model in which data is moved or duplicated then centrally housed - when data is moved it becomes vulnerable to interception and movement of large datasets is often very costly for researchers.
The video below highlights Thorben Seeger, Lifebit’s Chief Business Development Officer, discussing how researchers are limited in their ability to access and analyse sensitive data and how organisations are solving this problem using data federation.
It is clear that data federation is the future for enabling secure genomic and health data access at a global level as it brings multiple advantages compared to traditional methods of data access.
This article highlights the crucial requirements to enable data federation, which include:
There are four prerequisites to performing health data federation for research, either as a researcher or organisation, which are:
With the ability to process immense datasets, computational resources are an important consideration. Additionally, a robust database infrastructure is required for efficient data processing and integrated data analysis. Processing such large amounts of data requires a highly scalable platform.
The scale of distributed multi-omics and clinical datasets available today has brought an increasing shift towards commercial cloud infrastructure.
Being cloud-based provides ultimate flexibility and the ‘elastic’ nature of cloud computing means researchers only pay for what they need.
Achieving a federated connection to where the data resides requires a platform that can communicate with distributed data sources and other platforms. Typically, this will require:
Once the relevant infrastructure and data access requirements are in place, researchers will still be limited in the novel insights they can gain if the data cannot be effectively combined to enhance its statistical power. Common Data Models (CDMs) are crucial to ensuring data is interoperable, with several growing in popularity in the health sciences sector recently including Observational Medical Outcomes Partnership (OMOP) in the case of clinical-genomic data
Harmonising health data to OMOP provides structure according to common international standards which ensures it is fully interoperable with other clinical datasets from other labs or clinics. This fully enables the integration and analysis of datasets across distributed sources and platforms.
Additionally, extraction, transformation, loading pipelines (ETL) pipelines that can automate this work to process and convert raw data to analysis-ready data help further simplify this process for researchers.
Combining these datasets securely via federation then allows researchers to increase the statistical power of their research. For example, one genome-wide association study revealed that increasing sample size by 10-fold led to an approximately 100-fold increase in findings, enabling disease-causing genetic variants of interest to be more easily validated and studied. Secure access to full standardised and interoperable large datasets via federation can help to accelerate research by providing great power for clinical studies.
Featured resource:
Read lifebit's whitepaper on data standardisation best practices
Summary
It is clear that data federation can bring many wide ranging benefits to researchers. It can provide secure access to global cohorts of data to help power analysis and ultimately answer important research questions. To enable federated data analysis, researchers and organisations need standardised, interoperable data, appropriate infrastructure including APIs, authentication and analytics technology and robust security measures.
By enabling data federation, organisations can provide researchers secure data access and analysis to ensure they spend time and effort on what matters most- gaining new insights on health and disease.
Author: Hannah Gaimster, PhD
Contributors: Hadley E. Sheppard, PhD and Amanda White
About Lifebit
At Lifebit, we develop secure federated data analysis solutions for clients including Genomics England, NIHR Cambridge Biomedical Research Centre, Danish National Genome Centre and Boehringer Ingelheim to help researchers turn data into discoveries.
Interested in learning more about Lifebit’s federated data solution?