CloudOS Meets nf-core: Standardising Cloud-Native Bioinformatics Pipelines

3 minute read
Lifebit

Lifebit

Bioinformaticians in the pursuit of best practices

The democratisation of Next Generation Sequencing technologies has made omics data more available than ever before. Bioinformaticians are now confronted with tackling the NGS data deluge in order to deliver actionable insights. Transforming raw data into impactful insights requires bioinformaticians to process omics data with bioinformatics pipelines. To date, however, no validated standards have been widely established by the bioinformatics community. 

As with omics data, the landscape of bioinformatic tools and software is also evolving at a rapid pace, making it hard for bioinformaticians to keep up with the latest releases. However, some workflows are considered best practice in the field, as is the case with the GATK best practices workflow for standardising secondary NGS data analysis as defined by the Broad Institute, and Illumina’s DRAGEN the accelerated version of BWA/GATK that I extensively discussed in my previous blog post. Although best practices are recommended for use of the GATK workflow, issues arise when the implementation is left in the hands of individual researchers, especially when it is customised to run in local environments, ultimately preventing interoperability, a central pillar of the FAIR (findable, accessible, interoperable, reusable) principle.  

The development of scientific workflow management systems stems from reproducibility, standardisation and portability issues in bioinformatics

reproducibility bioinformatics crisis
Reproducibility crisis in Bioinformatics (source)

There is growing alarm about results that cannot be reproduced, especially in the field of bioinformatics. About 52% of researchers agree that there is a reproducibility crisis (read about how some researchers tackle reproducibility within their own laboratory). Reproducibility, in addition to the rise of standardisation, portability, and data governance issues in bioinformatics, has led to the organic development of powerful scientific workflow management systems which allow users to build pipelines and improve portability by abstracting the computing infrastructure from the pipeline logic. Workflow management systems capture the exact methodology that bioinformaticians have followed for a specific in silico experiment, thereby greatly improving the reproducibility of computational analyses.

A widely adopted workflow management system, Nextflow, is an open-source container-based programming framework which allows workflows to run more efficiently. More than half of all big international pharmas involved in bioinformatic analysis have adopted Nextflow, as it has been demonstrated to be superior in terms of scalability and data handling, and more numerically stable than other open-source workflow management. Nextflow is based on the dataflow programming model which simplifies and streamlines writing complex distributed pipeline. Parallelisation is implicitly defined by the processes input and output declarations, thereby producing parallel applications that can easily be scaled up or down, depending on real-time compute requirements. The automation that is inherent to Nextflow minimises human error and increases reproducibility among different compute environments.

A community-driven initiative to implement best practices for Nextflow pipeline development

nfcore framework bioinformatics
The nf-core framework (source)

In late 2017, the nf-core framework was created to remove existing silos amongst bioinformaticians working on their own isolated copies of workflows built with Nextflow. nf-core’s ultimate goal is to tear down barriers in order to work together as a community to deliver best practice, peer-reviewed pipelines that can be used by anyone, whether it be individual researchers, developers or research facilities. nf-core pipelines are easy to use  (i.e. can be run with a single command!), are bundled with containers with all of the required software dependencies (Docker or Singularity), have continuous integration testing, stable release tags, and provide extensive documentation.

The ways Lifebit has been contributing to Nextflow best practices & nf-core

Last year, Lifebit contributed to the nf-core community initiative by developing our DeepVariant pipeline adhering to nf-core guidelines. If you’re curious about what it takes to develop a best practice nf-core pipeline, check out Lifebit Bioinformatician, Phil Palmer’s account of his experience whilst making a pipeline nf-core compliant.

One of Lifebit’s core missions is to make all data, tools and computational resources needed to run reproducible bioinformatics analyses available in one place. Therefore, to facilitate the use of gold-standard Nextflow pipelines, we have made available all of the stable nf-core pipelines on the CloudOS platform, which will allow CloudOS users to easily access nf-core approved pipelines within their own CloudOS environment.

The benefits of running nf-core bioinformatics pipelines through CloudOS in your own AWS or other cloud

Why run nf-core stable release pipelines over CloudOS rather than the command line interface, you ask? At Lifebit, we always prioritise CloudOS features that reinforce the concepts of standardisation, reproducibility and auditability, thereby ensuring that all stages of bioinformatics analysis are FAIR. By running nf-core stable release bioinformatics pipelines in CloudOS, you will be able to:

  • Run nf-core stable release pipelines through a graphical user interface (GUI), where you benefit from both job deployment and scheduling 
  • Manage all your data, environment and pipelines in one place
  • Scale – you have access to a near-infinite compute infrastructure
  • Monitor jobs in real-time and easily visualise results
  • Painlessly collaborate with the rest of your team through 1-click cloning and sharing of any analysis
  • Reduce costs of running nf-core stable release pipelines with AWS spot instances
  • Above all, have analysis versioning with the only GitHub-like system that can version all the elements of analysis (data, pipeline, resources used, costs, owner, duration, timestamps etc.) – instead of just having only the nf-core pipeline versioning 

Stay tuned as we will be continuously growing our pipeline catalogue on CloudOS as more stable releases of Nextflow-based pipelines are curated and released via the nf-core initiative.

A special thanks to the nf-core team for their help & suggestions with this article!

We would like to know what you think! Please fill out the following form or contact us at hello@lifebit.ai. We welcome your comments and suggestions!

 

Featured news and events

What is a Data Lakehouse?
Continue reading

Lifebit and Lupus Research Alliance Partner to Accelerate Lupus Research through Secure Data Analytics Platform
Continue reading

Lifebit and Flatiron Health Bring Cutting-Edge Research Technology to Japan, Advancing Global Cancer Care through Real-World Data
Continue reading

Lifebit Joins AWS Marketplace to Boost Health Data Research
Continue reading

Streamlining Internal Data Analysis with Trusted Research Environments
Continue reading

Data Security and Compliance in Nonprofit Health Research
Continue reading

Data Harmonization: Overcoming Challenges with Proprietary and Outsourced Datasets
Continue reading

Lifebit, CanPath and AWS Collaborate to Advance Health Research with Innovative Cloud-Based Data Analytics Platform
Continue reading

Maximizing Research Efficiency with Trusted Research Environments
Continue reading

Revolutionizing Pharma: Unlocking the Power of a Global Federated Data Network
Continue reading