Genomics sequencing projects are taking place in countries around the world to enable population-level genomic medicine. Genomics is the study of the complete set of DNA in a person or other organism. With DNA underpinning a large proportion of an individual’s health and disease status, genomics is beginning to gain traction in clinical settings, as part of a personalised medicine approach.
Combining information on a patient’s clinical outcomes—which examine observable changes in health and wellbeing—with their genomic data can help us better understand how their genome affects their disease risk. Increasingly, breakthroughs in diagnostics, drug development, and targeted therapies are being made possible by advances in our understanding of the genome.
However, researchers can struggle to access and analyse the relevant genomic data to power their research. There are three reasons for this:
Data security and patient privacy are at risk when data is moved. Strict national regulatory frameworks (such as General Data Protection Regulation, GDPR) that differ from country to country are making it near impossible to collaborate across borders using a traditional model of data sharing.
Even if researchers can access disparate datasets, these may not be in the correct format to enable easy collaboration or provided on an easy to use, low code platform.
Consequently, large-scale genomic data migration has become unfeasible. A breakthrough technology that is increasingly used to provide researchers with secure data access is a data federation approach.
The video below demonstrates how federated data analysis functions. Historically, genomic data access has typically required researchers to access and analyse data by downloading it from disparate sources and analysing it together within a centralized location (steps 1 and 2). Federated analysis (step 3) allows the distributed genomic data from multiple sources to be analysed in parallel, saving the researcher time and money, while also keeping the sensitive data secure.
There are four important requirements for an organisation or researcher to be able to perform data federation. These are:
appropriate computing infrastructure,
authentication and analytics technology,
standardised and interoperable data,
and robust security measures.
A federated approach to genomic data analysis allows researchers and clinicians to combine global cohorts of genomics data, to maximise new scientific discoveries that can be made when this data is securely combined. This article discusses key examples where data federation is enabling genomics research worldwide.
Federated data analysis and the COVID-19 pandemic
In the UK, genomic medicine efforts have been spearheaded by Genomics England and its 100,000 Genomes Project, one of the largest cohorts of rare disease and cancer patients globally.
This allowed researchers to query, analyse and collaborate over these very large sets of genomic and medical data in real-time. The enhanced functionality and automated tools helped researchers understand the underlying genetic factors that may explain what makes some patients more susceptible to the virus, or more severely ill when infected.
Multi-party data federation can increase global collaboration
A consortium was formed between Lifebit and its partners as part of the Data and Analytics Research Environments UK (DARE UK) programme, which is funded by UK Research & Innovation and delivered in partnership with Health Data Research UK (HDR UK) and ADR UK (Administrative Data Research UK).
The consortium set out to build a federated ‘virtual’ link between the TREs of NIHR Cambridge BRC and Genomics England. With multi-party federation, distributed genomic data sources could be accessed and utilised simultaneously without having to physically move the data.
Professor Serena Nik-Zainal, NIHR Research Professor and Honorary Consultant in Clinical Genetics, University of Cambridge said:
“This technology has the potential to remove the geographical, logistical, and financial barriers associated with moving exceptionally large datasets. For genomics research, the potential to undertake research across multiple datasets means access to much greater and more diverse data. Applied at scale, this means huge potential for new discoveries, particularly for research into rare diseases and for reducing health inequalities.”
By enabling rapid access to data and secure data sharing, all at reduced costs, these impactful efforts are changing the nature of research collaboration globally for the better. Furthermore, this reduced the current time burden on researchers to conduct their analysis over integrated cohorts, maximising the time they can spend generating new insights and accelerating novel discoveries.
Federated data analysis in newborn genomic screening
One example of where this is being introduced, and federated data analysis is helping to link global cohorts of data for secure analysis is in Greece.
Here, PlumCare RWE and Lifebit have begun a partnership to support Greece’s pioneering national newborn genomic sequencing program, First Steps.
Researchers will be able to access and analyse data securely in combination with global cohorts, whilst ensuring data is kept safe, private and in place in their secure environment using Lifebit’s federated technology.
In conclusion, data federation can aid researchers, organisations and governments in securely accessing and analysing genomic data in a variety of ways. Federated data analysis is especially valuable in genomics research as it circumvents issues surrounding data privacy and security and the movement of large data sets. Data federation can give researchers secure access to genomic data sets globally, enabling them to run analyses, find answers to pressing research issues, and accelerate scientific discoveries.
Author: Hannah Gaimster, PhD
Contributors: Hadley E. Sheppard, PhD and Amanda White
In the last twenty years, there has been an explosion in the production of patient-derived biomedical data. This includes datasets derived from clinical-genomic, Electronic Health Records (EHRs), and real-world data (RWD) sources, which, when utilised together, can hold the answers to the underlying causes of disease.
Unfortunately, the transformative potential of this health data has yet to be realised. To preserve patient privacy, much of the world’s health data is stored within institutional siloed environments that are unavailable to researchers or are difficult to access, link and analyse. Even when researchers can access this data, they are not always equipped with the resources and tools to derive meaningful insights from that data.
To support research and innovation through the power of data, solutions are needed to enable data access, linkage, and analysis while maintaining security.
Data federation has emerged as a solution to increase the useability of sensitive biomedical health data for diagnosing and treating disease. Through data federation, researchers can be virtually linked to datasets of interest that are safely housed in highly secure computing environments known as Trusted Research Environments (TREs). The data is never physically moved or copied and the data controller or custodian maintains control.
This article describes how data federation technologies are used to develop end-to-end federated data platforms, which ultimately help democratise the access to and useability of data. By developing ways to fairly distribute secure access to data, tools, and knowledge, the scientific and health communities will accelerate global collaborations and therapeutic findings that will ultimately benefit patients.
What is an end-to-end federated data platform?
In its simplest terms, data federation is a software process that allows multiple databases function as one. Using this technology is highly relevant for storing sensitive biomedical health data, as the data remains within appropriate jurisdictional boundaries, while metadata (information that describes the data) is centralised and searchable (an alternative to a model in which data is moved or duplicated then centrally housed).
Federated architectures of individual organisations may be connected together into a federated data platform, enabling data access and computation for users across organisations. A prominent example of efforts towards federated data platforms include the UK’s National Health Service’s efforts to securely connect UK health data for approved research use.
Federated data platforms are indeed democratising data access and providing a means for approved users to securely query data irrespective of their physical proximity to where that data resides, but it is only one step in enabling the greater research communities. Incorporating federated analysis into the platform equips researchers and clinicians to derive meaningful insights from that data.
In an example of a genomic medicine end-to-end federated platform, genomic or phenotypic clinical data is first collected and transformed into interoperable formats. Next, these data will be ingested into the federated architecture, which allows authorised users to access and combine this data with other disparate sources to perform federated queries and build unique and valuable cohorts. The researchers can use analytical tools and pipelines built into the platform, and strict security measures will govern results export enabling therapeutic progress, discovery and informed clinical decisions.
Beyond FAIR: How can low/no-code tools contribute to data democratisation?
Following FAIR principles – ensuring datasets are Findable, Accessible, Interoperable and Reusable – is an essential step to promoting the democratisation of data and data quality, but does not entirely address all issues and challenges associated with deriving insight. In particular, researchers and clinicians without a data science background may be at a disadvantage to using analytical tools that require coding.
Interestingly, the software industry is currently shifting towards “no/low-code” tools to support a wider range of end users with and without a data science background, thus enabling full democratisation of access to genomic data and the insights derived. Examples of low/no-code resources include the following:
The Galaxy Community: An initiative within ELIXIR, a federated data infrastructure that brings together life science data sources across europe, this research forum offers a web-based platform to facilitate computational research for a variety of “omics” types and is specifically targeted to users without programming experience.
The Dependency Map (DepMap) Portal: With a mission of mapping the landscape of cancer vulnerabilities across all cancers, DepMap offers an easy-to-use graphical user interface to explore cancer vulnerabilities from available chemical and genetic perturbation data using analytical and visualisation tools.
The National Institute of Health’s (NIH) Common Fund Data Ecosystem (CFDE) Search Portal: The CFDE is a comprehensive resource for datasets generated through NIH funding, with the ultimate goal to make data more usable and useful for researchers and clinicians. There are interactive search functions and visualisations of gene-specific, compound-specific, and disease-specific data and more to empower researchers with and without a data science background.
If tools such as those described above could consistently be implemented within federated data platforms, researchers of diverse backgrounds could spend more time on what matters most – accessing global cohorts securely to progress therapeutic discovery.
Important considerations when democratising data access
A core benefit of federated data platforms is that they can democratise access to health data in a secure manner. While this brings huge potential for advancing medical research, there must be strict regulations over how data is governed and accessed that are applied at the organisational and researcher-level, in order to engender public and participant trust.
In line with the surge in data regulations arising across global jurisdictions, there is an increasing prevalence of accreditation schemes to audit and certify the “owner” of data management platforms. To guarantee ethical and secure usage of federated platforms, the safety and governance of these infrastructures must be regularly reviewed and measured against all aspects relevant to data security and governance, from implementing industry-recognised data protection frameworks, standards and information security measures to compliance with local data regulations and commitments to interoperability.
Access to the data within these federated platforms must be appropriately reviewed and governed by the data controllers – implementing such governance and regulatory bodies that regulate the use of data can help foster public trust in federated research and ensure data use is in the interest of both the public and participants.
Summary
Federated data platforms are emerging as essential entities globally that can scale with increasing volumes of data and ensure its protection, all the while enabling secure access for approved research. This ultimately democratises data access and creates widespread benefits sharing, regardless. However, end-to-end platforms take this one step further by providing researchers with the analytical tools they need to derive insight from biomedical data, regardless of background.
Moving forward, it will be interesting to understand the wider implementation of accreditation frameworks and bodies that regulate how end-to-end federated data platforms are governed in order to ensure the best practices with this data and that the interests of the public and patients are protected.
Lifebit works proactively with clients, including Genomics England, the Danish National Genome Centre, Boehringer Ingelheim, NIHR Cambridge Biomedical Research Centre, and others to comply with sensitive data requirements and establish their end-to-end federated solutions. We ensure that organisations can meet and exceed industry standards amidst the changing regulatory and regional landscape – enabling valuable research at scale to improve patients’ lives.
Authors: Hadley E. Sheppard, PhD
Contributors: Hannah Gaimster, PhD and Amanda White
The term federation comes from the word foederis in Latin, which translates as a treaty, agreement, or contract. When the term federated is used, it usually refers to linking autonomously operating objects.1
The word federated may be more familiar when it’s used in the context of governments. For example, states can be federated to form a single country, or multiple companies can function together as a federation. The advantages that can be gained when states or objects join together are clear, they can be more powerful and have more impact by combining forces in this way, rather than working alone, in isolation.
Defining federated data analysis
Data federation is solving the problem of data access, without compromising data security. In its simplest terms: Data federation is a software process that enables numerous databases to work together as one. Using this technology is highly relevant for accessing sensitive biomedical health data, as the data remains within appropriate jurisdictional boundaries, while metadata is centralised and searchable and researchers can be virtually linked to where it resides for analysis.
This is an alternative to a model in which data is moved or duplicated then centrally housed – when data is moved it becomes vulnerable to interception and movement of large datasets is often very costly for researchers.
Federated architectures of individual organisations may be connected together into a federated data platform, enabling data access for users across organisations.
Federated data analysis takes access a step further and brings approved researcher’s analysis and computation to where the data resides. Federated data analysis allows researchers to analyse data across multiple distinct organisations in a secure manner.
Is federated data analysis the same as federated learning?
These terms refer to quite distinct ideas. Federated learning is when researchers train machine learning (ML) algorithms/models across dispersed data/platforms whilst the data remains in place.2
Federated data analysis however is when researchers perform joint querying and analyses on data across distributed locations or platforms. So, whilst they share the same concept of enabling research analyses to be done without centralising data collection, they have different purposes and outcomes.
The diagram below shows how federated data analysis works. More traditional methods of data access would typically involve researchers accessing data by downloading the data from where it is stored separately to an institutional computing cluster (steps 1 and 2). However in federated data analysis, the analysis is securely brought to where the distributed data lies (step 3).
What are the different levels of federation?
There is no single approach to federation and various configurations will entail different legal, compliance, and technical requirements (Table 1). Full federation refers to when both data and compute access are federated over distributed compute and databases to allow querying and joint analyses on the data. However, there also exists a potential for partial federation (I and II) when either compute access or data access are federated and compute or databases are distributed.
1) No Federation
DEFINITION: Manual data access and analysis across different organisations results are aggregated and sent back to a central platform
REQUIREMENTS: Manual intervention Containerised/portable, versioned and Findable, Accessible, Interoperable, and Reusable (FAIR) tools/algorithms that users can run in different environments
EXAMPLES WHERE THIS LEVEL IS USEFUL: Optimal when federated access to organisations is not enabled, eg a researcher downloads publically available genomic data from various sources and combines/analyses it together in-house
REQUIREMENTS: Requires a central, unified and federated platform and federated access for compute (i.e. via Application Programming Interface (API))
EXAMPLES WHERE THIS LEVEL IS USEFUL: Optimal when security and governance clearance is provided and federated linkage is permitted, eg the Trustworthy Federated Data Analytics (TFDA) consortium that enables federated data learning on disparate clinical imaging datasets
3) Partial federation II
DEFINITION: Federated data access, distributed databases and joint analyses
REQUIREMENTS: Requires a central unified and federated platform and federated access for data queries and retrieval (i.e. via API, database queries)
EXAMPLES WHERE THIS LEVEL IS USEFUL: Optimal when security and governance clearance is provided and federated linkage is permitted, eg ELIXIR federated data platform to connect Europe’s data sources
4) Full Federation
DEFINITION: Federated data access – joint querying over distributed data and joint analysis
REQUIREMENTS: Requires a central, unified and federated platform across each network in the federation
Today, huge datasets are the norm in research and healthcare. There are many reasons for this, including the digitisation of many healthcare tools and massively reduced costs for high throughput technologies like sequencing.3
This has lead to a large increase in the amount and quantity of many types of healthcare and biomedical data including:
CLINICAL TRIAL DATA: Clinical trials are performed to assess the effects of one treatment compared to another, and to see if a new drug offers any improvements. The number of clinical trials, and therefore the amount of data being generated during clinical trials, being conducted worldwide is constantly increasing. There was a fivefold increase in the number of trials between 2004-2013 for example.4
REAL WORLD DATA (RWD):
Per the definition by the US Food and Drug Administration, real-world data (RWD) in the medical and healthcare field “are the data relating to patient health status and/or the delivery of healthcare routinely collected from a variety of sources. The amount of RWD is increasing, which is not surprising when considering where this data comes from; places such as electronic health records (EHR), social media accounts and many other tech-driven sources and wearable devices. 5
NEXT GENERATION SEQUENCING (NGS):
This type of work produces billions of short reads per experiment. 6 Furthermore, it is estimated that over 60 million patients will have had their genome sequenced in a healthcare context by 2025. 7
ADDITIONAL ‘OMICS’ TECHNOLOGIES:
These include proteomic, transcriptomic, metabolomic approaches and the amount of data produced by these studies is increasing every year. 8 An example where researchers have taken multiple omics approaches is The Cancer Genome Atlas Programme which contains 2.5 petabytes of omics data.
These large datasets are often coupled with artificial intelligence (AI) and advanced analytics approaches. These complex additional analyses and combinations of data further contribute to the huge increase in the quantity of health data. Furthermore, these are sensitive datasets, which may contain identifiable patient data or data that may be commercially sensitive, so there are valid security and privacy concerns surrounding safe access.Taken together this produces a tough problem in enabling researchers to securely access, analyse and draw novel insights from them.
Today, researchers are stifled by restrictions on data access. Agreements to enable data sharing between organisations are slow, typically taking organisations six months or more to gain data access. 9 Further, this data access is often provided via inflexible platforms with limited functionality.
Data access and usability has become a significant barrier in research progress and therapeutic discovery. Researchers should be enabled to access and use this health data to make new discoveries and improve patient outcomes. An effective way to achieve secure data interoperability and access across the distributed healthcare ecosystem is through federated data analysis.
The solution: federation enables secure access to data
Allows researchers to move their data out of silos and into action where new insights can be gained. One genome wide association study showed that increasing sample size by 10-fold led to an approximate 100-fold increase in findings, enabling genetic variants of interest to be more easily validated and studied. 10
Better value for money
Reduced cost for researchers and organisations due to limited data movement and storage, i.e. costly copying or moving data is not needed.
Increased compliance
Increasing regulations (eg GDPR) mean that data cannot cross jurisdictional borders, federation allows organisations to adhere to these regulations, as no data movement or copying is required.
Increased sustainability
Federated data access across cloud-based platforms represents the most environmentally sustainable way to share data – it reduces resource consumption by minimising data duplication and completely eliminating transfers of files. 1112
Federated data analysis in action: using data to tackle the Covid-19 pandemic
Worldwide, large population genomics sequencing projects are being established with the aim of implementing population-level genomic medicine. Genomics is the study of the complete set of DNA in a person or other organism. With DNA underpinning a large proportion of an individual’s health and disease status, a genomic medicine approach is increasingly being applied in clinical settings.
Here, the study of clinical outcomes (measurable changes in health and well-being) is combined with genomics so researchers can better understand how a person’s genome contributes to disease. Increasingly, advances in our understanding of the genome are contributing to improvements in disease diagnosis, drug discovery and targeted therapeutics.13 14
In the UK, the genomic medicine efforts have been spearheaded by Genomics England and its 100,000 Genomes Project, one of the largest cohorts of rare disease, infection and cancer patients globally.
During the pandemic, Genomics England worked with the NHS to deliver whole genome sequencing of up to 20,000 COVID-19 intensive care patients, and up to 15,000 people with mild symptoms. This allowed researchers to query, analyse and collaborate over these very large sets of genomic and medical data in seconds. Its enhanced functionality and automated tools helped researchers’ understanding of the underlying genetic factors that may explain what makes some patients more susceptible to the virus, or more severely ill when infected.15
The ability to obtain secure data access going forward, whilst meeting the needs of researchers, organisations and governments, will support research at scale so the maximum scientific insights can be gained, in the quickest time frame.
What else has health data federation helped us achieve so far?
It’s not too surprising that federation is gaining popularity amongst big data initiatives across the life sciences sector and having impressive impacts. Examples across industries worldwide include:
POLICY / GOVERNMENTS: A recent Genome UK policy paper produced by the UK government outlined their plan to set up a federated infrastructure for the management of UK genomics data resources. The paper states that adopting a federated approach to genome data access will enable wide-reaching benefits for patients, the NHS and will ensure that a patient’s genomic data can inform their care throughout their life.
STANDARDS SETTING ORGANISATIONS: The Global Alliance for Genomics and Health (GA4GH), which was set up to promote the international sharing of genomic and health-related data, endorses a federated analysis approach.7 GA4GH states that federation offers organisations more control without limiting collaboration and openness, whilst remaining flexible and adaptable to specific contexts.
INTERNATIONAL BIOBANKS:
Canadian Distributed Infrastructure for Genomics (CanDIG) employs federation to draw insights from both genomic and clinical datasets. 16 In Canada each province has its own health data privacy legislation so data generated in each province must follow provincial governance laws. The CanDIG platform tackles this using a fully distributed federated data model, enabling federated querying and analysis while ensuring that local data governance laws are complied with.
Australian Genomics, the national genomics service in Australia is also developing a federated repository of genomic and phenotypic data to bridge the gap between its national health system and state-funded genetic services.17
Recently, multi-party federation was successfully demonstrated between trusted research environments (TREs) for the first time in the UK, linking the TREs of the University of Cambridge and Genomics England – this approach has the potential to remove the geographical, logistical, and financial barriers associated with moving exceptionally large datasets. It will heavily reduce the current time burden on researchers to conduct their analysis over integrated cohorts.
Deploying federated data analysis: standards and infrastructure required
It is clear that federation is the future for enabling secure genomic and health data access, but what is required to start employing this approach as a researcher or organisation?
1. SUITABLE INFRASTRUCTURE Additionally, a robust database infrastructure is required for straightforward data handling and integrated data analyses. A highly scalable platform is required to deal with such large data, these are often cloud-based to offer the required flexibility.
2. ADVANCED API, AUTHENTCIATION AND ANALYSIS TECHNOLOGY A platform that can interface with distributed data sources and other platforms is needed so that federated linkage can be achieved. This typically requires:
a set of APIs that enable computational coordination and communication (i.e. federation) between platforms to allow federation
the ability to merge authentication/authorisation systems for researchers accessing data across platforms.
the platform should have all the downstream tools needed to run analyses on the federated data.
3. STANDARDISED DATA Data needs to be standardised to a common data language (CDM) such as Observational Medical Outcomes Partnership (OMOP). When health data are all structured to a common international standard, it makes merging and analysing datasets across distributed sources and platforms possible as they become interoperable.
4. MAXIMUM SECURITY Finally, as federation of patient or volunteer data will involve highly sensitive health information, it is imperative to ensure all data is secure. This will involve:
strong data encryption standards. Data should be encrypted at all stages including at rest (eg when data is in storage), in transfer (eg when data is moving between storage buckets and compute machines) and during analysis.
pseudonymisation of data
role-based access control to data can only be de-encrypted by authenticated
staff, and the security network imposes additional constraints on which specific users can access, view, or edit encrypted files.
consideration of how results will be securely exported, i.e via an Airlock process.
What’s next for data federation?
The next section considers the future opportunities and challenges for data federation.
1. DEMOCRATISING ACCESS TO DATA AND INSIGHTS To advance health data research safely, there must be strict regulations over how data is governed and accessed that are applied at the organisational- and researcher-level. The advantages of increased security and reduced cost that federation brings can help to securely democratise health data access, by enabling more data custodians to securely share, access and collaborate over data. This will help fully democratise access to health data and the insights derived.
2. IMPROVING REPRESENTATION / DIVERSITY IN GLOBAL HEALTH DATA The vast majority of genomic studies performed to date represent populations of European ancestry.18
This lack of diversity in genomics research is a real issue as the potential insights that can be gained from such studies (for example, increased understanding of disease prevalence, early detection and diagnosis, rational drug design and improved patient care and outcomes) may not be relevant to the underrepresented populations absent from a sample, leading to misdiagnosis, poor understanding of conditions and inconsistent delivery of care.19
As a result, genomic medicine does not always benefit all people equally. A worldwide, concerted engagement effort and increased transparency is required to help improve public trust and willingness to engage in research for underrepresented communities.
However, it is also possible that federated platforms, with its associated benefits of lower cost, could help make big data analytics more accessible to lower and middle income countries. Additionally, they can help improve diversity of the cohorts that can be built and accessed via federated networks.
3. ENHANCING COLLABORATION ACROSS THE WORLD’S DISTRIBUTED DATA An additional aim for the future is to enable widespread multi-party federation20 21 that spans international borders and sectors, eg international public-private federation. This will require significant work to develop the policies and governance procedures across countries and companies with different compliance requirements. However, federated data analysis can provide secure data access at scale to researchers, establishing large collaborative data ecosystems to be built. This will allow research to be done across larger and more diverse datasets, improving outcomes.
Summary
In conclusion, data federation can aid researchers, organisations and governments in a variety of ways. It can give researchers secure access to large data sets from around the world, enabling them to run analyses, find answers to pressing research issues, and make scientific discoveries. Federated data analysis maximises financial efficiency by avoiding costly data transfers. Data federation can also help support global collaboration and democratise data access to support fair benefit distribution.
Nik-Zainal S et al. Multi-party trusted research environment federation: Establishing infrastructure for secure analysis across different clinical-genomic datasets. (2022) doi:10.5281/zenodo.7085536.
Introduction
In research and healthcare, the size of datasets needed to solve crucial problems is continuing to increase. New technologies including the digitisation of healthcare tools, the accumulation of electronic healthcare records and massively reduced costs for high throughput technologies like genome sequencing all contribute to these large datasets.
These biomedical datasets can help provide answers to important questions and ultimately improve patient outcomes. Recent landmark studies that have utilised the power of big data to derive healthcare insights include the 100,000 Genomes study on rare diseases and the work detailing the host factors underlying severe COVID-19 which was conducted on almost 60,000 individuals. Both of these important studies used data from the UK’s national genomics initiative Genomics England to uncover crucial new scientific insights on important diseases.
However, secure storage and analysis of these large, sensitive datasets is becoming significantly harder. There are three key reasons for this:
Globally, there are increasing restrictions on data access to help keep sensitive information private (eg. General Data Protection Regulation (GDPR))
These datasets are large and can be hard to manage, making it difficult for researchers to identify the right data for their analyses.
Datasets reside in disparate labs and clinics in locations across the globe, and as a consequence they are all too commonly effectively siloed
Data federation as a solution
Data federation is solving the problem of data access, without compromising data security. In its simplest terms: Data federation is a software process that enables numerous databases to work together as one. Using this technology is highly relevant for accessing sensitive biomedical health data, as the data remains within appropriate jurisdictional boundaries, while metadata is centralised and searchable and researchers can be virtually linked to where it resides for analysis.
The video below highlights Thorben Seeger, Lifebit’s Chief Business Development Officer, discussing how researchers are limited in their ability to access and analyse sensitive data and how organisations are solving this problem using data federation.
This article highlights the crucial requirements to enable data federation, which include:
Appropriate computing infrastructure
Authentication and analytics technology
Standardised, interoperable data
Best in class security measures
What is required for employing a federated data analysis approach?
There are four prerequisites to performing health data federation for research, either as a researcher or organisation, which are:
1. Scalable infrastructure
With the ability to process immense datasets, computational resources are an important consideration. Additionally, a robust database infrastructure is required for efficient data processing and integrated data analysis. Processing such large amounts of data requires a highly scalable platform.
The scale of distributed multi-omics and clinical datasets available today has brought an increasing shift towards commercial cloud infrastructure.
Being cloud-based provides ultimate flexibility and the ‘elastic’ nature of cloud computing means researchers only pay for what they need.
2. Advanced APIs, authentication and analytics technology
Achieving a federated connection to where the data resides requires a platform that can communicate with distributed data sources and other platforms. Typically, this will require:
A set of APIs that enable computational coordination and communication between platforms that enable federation.
Ability to integrate an authentication/authorisation system to make sure that only authorised users can access data across platforms.
The platform should have all the downstream tools necessary to perform analytics on federated data. This will enable researchers to perform data analysis, ultimately helping to accelerate research and allowing novel insights to be gained.
3. Standardised data
Once the relevant infrastructure and data access requirements are in place, researchers will still be limited in the novel insights they can gain if the data cannot be effectively combined to enhance its statistical power. Common Data Models (CDMs) are crucial to ensuring data is interoperable, with several growing in popularity in the health sciences sector recently including Observational Medical Outcomes Partnership (OMOP) in the case of clinical-genomic data
Harmonising health data to OMOP provides structure according to common international standards which ensures it is fully interoperable with other clinical datasets from other labs or clinics. This fully enables the integration and analysis of datasets across distributed sources and platforms.
Additionally, extraction, transformation, loading pipelines (ETL) pipelines that can automate this work to process and convert raw data to analysis-ready data help further simplify this process for researchers.
Combining these datasets securely via federation then allows researchers to increase the statistical power of their research. For example, one genome-wide association study revealed that increasing sample size by 10-fold led to an approximately 100-fold increase in findings, enabling disease-causing genetic variants of interest to be more easily validated and studied. Secure access to full standardised and interoperable large datasets via federation can help to accelerate research by providing great power for clinical studies.
Patient or volunteer data used for research may contain highly sensitive health information. To protect patient and public privacy and build trust in the use of health data, it is vital to follow strict data security measures, including:
Strict data encryption standards. Data should be encrypted at all stages, including at rest (such as when it resides in memory) in transit (such as when data moves between storage buckets and computing machines), and during analysis.
Data pseudonymisation. Sometimes referred to as ‘de-identification’, this refers to the removal of any personal identifiers from a dataset, ensuring that a participant’s privacy is maintained.
Role-based access control to data. This ensures that only authorised employees can decrypt it, and the security network imposes additional restrictions on specific users being able to access, view, or edit encrypted files. increase.
Here, data cannot be exported or downloaded out of the environment. Users can only export appropriate, aggregate-level data via the secure airlock process, which allows authorised personnel to approve and validate the purpose of any data-download.
The airlock policy can be fully enforced through workspaces, where no data may be extracted, aside from any previously whitelisted/received authorisation cases.
Summary
It is clear that data federation can bring many wide ranging benefits to researchers. It can provide secure access to global cohorts of data to help power analysis and ultimately answer important research questions. To enable federated data analysis, researchers and organisations need standardised, interoperable data, appropriate infrastructure including APIs, authentication and analytics technology and robust security measures.
By enabling data federation, organisations can provide researchers secure data access and analysis to ensure they spend time and effort on what matters most- gaining new insights on health and disease.
Author: Hannah Gaimster, PhD
Contributors: Hadley E. Sheppard, PhD and Amanda White
In research and healthcare, the size of datasets needed to solve crucial problems is continuing to increase. New technologies including the digitisation of healthcare tools, the accumulation of electronic healthcare records and massively reduced costs for high throughput technologies like genome sequencing all contribute to these large datasets.
These biomedical datasets can help provide answers to important questions and ultimately improve patient outcomes. Recent landmark studies that have utilised the power of big data to derive healthcare insights include the 100,000 Genomes study on rare diseases and the work detailing the host factors underlying severe COVID-19 which was conducted on almost 60,000 individuals. Both of these important studies used data from the UK’s national genomics initiative Genomics England to uncover crucial new scientific insights on important diseases.
However, secure storage and analysis of these large, sensitive datasets is becoming significantly harder. There are three key reasons for this:
Globally, there are increasing restrictions on data access to help keep sensitive information private (eg. General Data Protection Regulation (GDPR))
These datasets are large and can be hard to manage, making it difficult for researchers to identify the right data for their analyses.
Datasets reside in disparate labs and clinics in locations across the globe. Because of this, they are all too commonly effectively siloed as strict data governance laws do not allow the data to be moved and copied.
Data federation is solving the problem of data access, without compromising data security
Researchers and clinicians are missing out on the potential that these huge health datasets can bring as they are difficult to access and combine for analysis for risk of compromising security. Research progress and patient benefits are stalling due to inefficient models for secure health data access.
Data federation as a solution
Data federation is solving the problem of data access, without compromising data security. In its simplest terms: Data federation is a software process that enables numerous databases to work together as one. Using this technology is highly relevant for accessing sensitive biomedical health data, as the data remains within appropriate jurisdictional boundaries, while metadata is centralised and searchable and researchers can be virtually linked to where it resides for analysis.
Federated architectures of individual organisations may be connected together into a federated data platform, enabling data access for users across organisations.
Federated data analysis takes access a step further and brings approved researcher’s analysis and computation to where the data resides. Federated data analysis allows researchers to analyse data across multiple distinct organisations in a secure manner.
With federation, data is never moved or copied. Security is maximised throughout data analysis and querying the data. There are other important advantages in using federated data analysis, which are summarised in the table below.
There are five key benefits of data federation
Maximum security Federated data analysis maximises security because data is never copied or moved. Organisations maintain full security controls over their data. Additionally, organisations can create permissioned-based access to guarantee that only the right people have access to the required data for their work.
Increased novel insights Federated data analysis enables the use of all available data to power insights. When disparate cohorts are combined to increase sample numbers, the studies increase their statistical power and findings. For example, one genome-wide association study revealed that increasing sample size by 10-fold led to an approximately 100-fold increase in findings, enabling disease-causing genetic variants of interest to be more easily validated and studied. Secure access to larger datasets via federation can help to accelerate research by providing great power for clinical studies.
Increased compliance Sensitive personal data such as healthcare data cannot traverse jurisdictional borders due to rising local, national, and international restrictions (e.g., GDPR). Federation enables organisations to fully comply with these rules because no data transfer or copying is necessary.
Data federation can ultimately help democratise access to data and insights gained
The benefits of expanded security and decreased costs that data federation brings serve to safely democratise valuable access to health and biomedical information, ultimately empowering researchers to share safely, access and collaborate over data worldwide.
In the cases of genomics, the majority of research undertaken to date focuses on populations of European heritage. This lack of diversity in genomics research is a serious problem because it can result in misdiagnosis, inadequate understanding of conditions, and inconsistent care delivery. As a result, not everyone benefits equally from genetic medicine. To boost confidence and encourage participation in research for underrepresented communities, a global, focused engagement effort alongside enhanced transparency and building public trust are needed.
Public and patient trust remains a key factor in participant recruitment, particularly for historically marginalised populations. In a federated data access model, the public’s data remains in the secure control of the data custodian, which could help engender increased trust. However, it is crucial that data access agreements must be negotiated in a manner that is acceptable for research participants, particularly in historically underrepresented, marginalised or vulnerable groups.
It is also possible that federated platforms, with their associated benefits of lower cost, could help make big data analytics more accessible to lower and middle income countries. Additionally, this could help improve diversity of the cohorts that can be built and accessed via federated networks.
Ultimately, data federation can help democratise data access and promote global collaboration to help ensure equitable benefits sharing
Summary
In summary, data federation can bring many wide ranging benefits to researchers. It can provide secure access to global cohorts of data to help power their analysis, answer important research questions and lead to scientific discovery. Federated data analysis offers maximum value for money as costly data transfers are avoided. Ultimately, data federation can help democratise data access and promote global collaboration to help ensure equitable benefits sharing.
Look out for the next blog in our series where we will take a detailed look into the key technical requirements that are required for organisations to enable data federation.
Author: Hannah Gaimster, PhD
Contributors: Hadley E. Sheppard, PhD and Amanda White
In research and healthcare, the size of datasets needed to solve crucial problems is continuing to increase. New technologies including the digitisation of healthcare tools, the accumulation of electronic healthcare records and massively reduced costs for high throughput technologies like genome sequencing for example all contribute to these large datasets.
These vast datasets can help provide answers to important questions and ultimately change lives for the better. Recent landmark studies that have utilised the power of big data in health research include the 100,000 Genomes study on rare diseases and the work detailing the host factors underlying severe COVID-19 which was conducted on almost 60,000 individuals.
However, secure storage and analysis of these large, sensitive datasets is becoming significantly harder. There are three key reasons for this:
Globally, there are increasing restrictions on data access to help keep sensitive information private eg General Data Protection Regulation (GDPR).
These datasets are large and can be hard to manage, making it difficult for researchers to effectively identify the right data for their analyses.
Datasets reside in disparate labs and clinics in locations across the globe, they are all too commonly effectively siloed.
Researchers and clinicians are missing out on the potential that these huge health datasets can bring as they are difficult to access and combine for analysis for risk of compromising security. Research progress and patient benefits are stalling due to inefficient models for secure health data access.
Data federation as a solution
This article explores how data federation is solving the problem of data access, without compromising data security.
In its simplest terms: Data federation is a software process that enables numerous databases to work together as one. Using this technology is highly relevant for accessing sensitive biomedical health data, as the data remains within appropriate jurisdictional boundaries, while metadata (information about the data) is centralised and searchable.
Data federation is an alternative to a model in which data is moved or duplicated then centrally housed – when data is moved it becomes vulnerable to interception and movement of large datasets is often very costly for researchers. Instead, approved users may access the data via linking technologies such as Application Programming Interfaces, or APIs.
Federated architectures of individual organisations may be connected together into a federated data platform, enabling data access for users across organisations.
Federated data analysis takes access a step further and brings approved researcher’s analysis and computation to where the data resides. Federated data analysis allows researchers to analyse data across multiple distinct organisations in a secure manner.
The video below highlights Professor Serena Nik-Zainal, Professor of Genomic Medicine and Bioinformatics at The University of Cambridge, discussing why researchers need to securely access and analyse health data and how organisations are solving this problem using data federation.
How does federated data analysis work?
The video below demonstrates how federated data analysis functions. Historically, data access has typically required researchers to access and analyse data by downloading it from its disparate sources and analysing it together within a centralized location (steps 1 and 2). Federated analysis (step 3) allows the distributed data from multiple sources to be analysed in parallel, saving the researcher time and money, while also keeping the data secure.
A federated approach to data analysis allows researchers and clinicians to combine global cohorts of data, to maximise new scientific discoveries that can be made when this data is securely combined.
Where is federated data analysis is being used?
Whilst this groundbreaking technology is still relatively new, federated architectures and data federation are increasingly becoming a trusted solution to both ensure data security while simultaneously enabling global collaboration. At Lifebit, we employ a federated architecture as part of the Lifebit Platform. The Lifebit Platform is being used by leading research organisations, precision medicine initiatives and government biobanks globally, including Genomics England, The Greek Newborn Genomic Screening project and the Danish National Genome Center.
Below are other key examples where federated data analysis is gaining traction across the public sector and industries worldwide:
GOVERNMENT: The UK government has published a strategy paper Genome UK outlining its ambition to create a federated infrastructure for the administration of UK genomics data resources. A federated approach to genomic data access, according to the report, will provide substantial benefits for patients and the national health service (NHS) and ensure that a patient’s genomic information can guide their care for the duration of their lives.
STANDARDS IMPLEMENTING ORGANISATIONS: Federated data analysis is supported by the Global Alliance for Genomics and Health (GA4GH), which was established to encourage the international secure exchange of genomic and health-related data.
According to GA4GH, federation provides organisations more authority over their sensitive data without restricting openness and collaborations, promoting flexibility and adaptability.
GLOBAL BIOBANKS: The Canadian Distributed Infrastructure for Genomics (CanDIG) uses federation to glean new information from both genomic and clinical datasets. Data generated in each province in Canada must abide by provincial governance standards since each province has its own legislation protecting the privacy of health data. With a fully distributed federated data analysis model that allows for federated querying and analysis while also guaranteeing that local data governance laws are followed, the CanDIG platform fully addresses this issue surrounding compliance with state law.
To close the gap between its national health system and state-funded genetic services, Australian Genomics, the country’s national genomics service, is also creating a federated library of genomic and phenotypic data.
PHARMACEUTICAL FIRMS: Boehringer Ingelheim recently announced an approach to speed up research and development efforts by using data federation. In order to create a secure “dataland” for analytics and research and, ultimately, to hasten the development of novel medications and enhance patient outcomes, this will offer strong analytical capabilities and worldwide secure biobank connectivity.
Summary
In summary, federated data analysis is a crucial approach to secure data access that enables authorised research on combined data. It provides both maximum security whilst ensuring disparate datasets can be combined for analysis.
Look out for the next blog in our series where we will take a deep dive into some of the other key benefits of data federation.