Nextflow Tutorial: Developing nf-core DeepVariant, a Google Variant Caller
Lifebit
If you are a bioinformatician with a bit of Nextflow knowledge and want to level up your Nextflow skills, then you are in the right place. That was me just one short week ago…
It’s still me now, but I’ve learnt a lot since then. Hopefully, this blog post can help you quickly improve your Nextflow ability and get you started with nf-core, and if you don’t know what it is then it’s worth first taking 5 minutes and reading this introduction blog. If you are thinking of making your own nf-core pipeline I would highly recommend you follow these useful instructions for doing so.
This blog post is essentially a list of things I wish I knew before I started making an nf-core pipeline, which would have made things easier. Currently, the only way to really get this information is from reading through the many nf-core pipelines and code reviews.
While not essential, it’s beneficial if:
- You’re familiar with nf-core and Nextflow
- You’re used to working with git and GitHub
- You have a workflow you’re thinking of adding that meets the nf-core guidelines
Parameters
It’s best that all of your “params” definitions (e.g. params.fasta
) are moved into your nextflow.config
file instead of main.nf
. If they are all in the same place within the config file, then they are far easier to be overwritten/removed. It’s also often worth setting their default to false
instead of other values such an empty string. This is also true for parameters such as genome
in order to force users to explicitly choose the genome version to avoid accidents.
Channels
How it can feel using channels in Nextflow
The most difficult part of Nextflow is channels. Using them feels like an elaborate game of hot potato where you have to make sure that the correct files are passed to the correct process. If you’re not careful with both channels and hot potato, it’s likely to end badly.
Why is it worthwhile using them at all? Using channels to process input files instead of using file()
directly ensures proper file staging. This can be a problem in obscure environments because using file()
requires the task to have access to that part of the disk/storage. At Lifebit, we are far too familiar with this problem due to our use of cloud infrastructure such as Amazon’s AWS instances.
Channels also have other benefits too! For example, by using channels you can use their built-in functionality for handling missing files. Here is what such a channel might look like for processing a bed file input:
bed = Channel
.fromPath(params.bed)
.ifEmpty { exit 1, "please specify --bed option (--bed bedfile)"}
Another tip for using channels is to make sure that they have informative names, for example, bed. Short and to the point. In general, it’s a good idea to make your code explicit. An example of this may be to redefine variables when remapping in order to increase readability:
// going from this
fastaChannel.map{file -> tuple (1,file[0],file[1],file[2],file[3],file[4], file[5])}
// to something more like this with explicit filenames
fastaChannel.map{fasta, fai, fastagz, gzfai, gzi -> tuple (1, fasta, fai, fastagz, gzfai, gzi)}
Here you can see that all of the filenames are used. While this may be verbose, it’s more clear and readable, which in most cases, is more important for writing nf-core pipelines and for most Nextflow pipelines.
As a side note, on the topic of channels, be careful with your operators in Nextflow. I learnt the hard way the difference between the mix()
and merge()
operators. I was getting intermittent errors where after using exactly the same command, sometimes it would work and sometimes it wouldn’t. Ten commits later, I realised that mix doesn’t preserve the order of your files while merge does, and this was causing the error. To avoid this issue, you can find a long list of Nextflow operators in the documentation here.
// this
fastaChannel = Channel.from(fastaCh).mix(bedCh, faiCh, fastaGzCh, gzFaiCh, gziCh).collect()
// should have been this
fastaChannel = Channel.from(fastaCh).merge(bedCh, faiCh, fastaGzCh, gzFaiCh, gziCh).collect()
Logic
Make sure that any logic, for instance, conditionals such as if statements are outside of your script tag and are done in Nextflow. This increases traceability and is easier to debug. Let’s take an example:
// this (checks if params.fai is false before executing samtools)
process preprocess_fai{
input:
File fasta from fasta
output:
file fai into fai
script:
"""
[[ "${params.fai}"==false ]] && samtools faidx $fasta
"""
// can be rewritten as this
process preprocess_fai {
when:
!params.fai
input:
file fasta from fasta
output:
file fai into fai
script:
"""
samtools faidx $fasta
"""
}
Notice the when
conditional which ensures that the process is only executed if params.fai
is not set and is false. This can be incredibly useful for only executing code when certain conditions are met and you can chain together multiple conditions like so :
if (params.fasta && !params.fai) { //only executed if fasta and no fai }
The power of conda
In order to release a nf-core pipeline, it is recommended that software dependencies are installed with conda. Fortunately, most of this is generated as a template from the nf-core create command, part of nf-core/tools. This is very powerful, as it means that you can support both Docker and Singularity containers while only maintaining one file. Nextflow also supports conda natively, so with one file the pipelines can support three different software dependency systems. Even if you don’t use nf-core, this is a good way to install software dependencies for Nextflow projects whenever possible. There is a large number of packages available through conda and it makes updating software dependencies really simple.
General tips
The following are some general tips that I won’t write about in detail to keep the blog post short:
- Don’t use absolute paths for executables in containers as this will need to be manually changed every time the software is updated
- Separate config files can be useful. For example, with DeepVariant this was done for the reference genome but similarly, this is done for different computing environments such as a computing cluster vs cloud environment
- Keep style consistent such as indentation, and either use camelCase or snake_case for process names
Final words on practice and simplicity
A closing tip is to make sure you don’t get too lost in the details. If all of this seems rather overwhelming, then perhaps the most important thing is to just get started coding. If you’re unsure of anything you can always ask the Nextflow or nf-core community via their gitter channel, as they are always happy to help.
If all of this seemed straightforward, then perhaps it is best to think at a higher-level and question if there is any way you can simplify your code. I know some programmers prefer not to spend too much time planning as that means less time coding. However, planning and adhering to these guidelines can really be worth it. Ultimately, the goal is for other people to use and share your pipeline and it’s by following these principles that others can understand and use your code. Why do we write the code in the first place if it’s not for people to use it?
Thank you for reading, and thanks to everyone at nf-core for all their help and being so welcoming.
Now go build your own nf-core pipeline!
We would like to know what you think! Please fill out the following form or contact us at hello@lifebit.ai. We welcome your comments and suggestions!