Handling failing jobs with Nextflow

2 min readOct 8, 2019

I decided to start writing some short blog-post about hacks and tricks to help other people not to struggle too much with coding… Who knows if this will end up really saving some time / headache to someone! :)

One of the things I like the most about Nextflow is the ability to handle the jobs that fail for whatever reason.
In past my beautiful but fragile pipelines (proudly written in Perl) were not really handling this… So let’s take advantage of this built-in ability of Nextflow and let’s see how to make use of it.
We already mention previously that Nextflow can separate configuration and main code. Within the configuration we can specify categories of resources (flagged with “withLabel”) needed for some processes.

We can then assign a process to a given category using the “label” directive.

Now we can make a category with the ability to retry a failed execution automatically using the “errorStrategy” directive. Setting this variable to retry allows you executing that process again up to “maxRetries” times.

In the example below I created a configuration for a kind of process that can be executed up to 4 times.

Thanks to this part:

memory = {3.GB * task.attempt}time = {6.h * task.attempt}

Each execution will increase both the time and the memory limit.

In this particular process the failure for out of time or out of memory was giving me exit status 140, so I decided to retry only in this case and to fail in other cases.

The way to do that is to assign the value of maxRetry depending on “task.exitStatus”: a variable that capture the exit status of that process.

I’ll assign 4 for maxRetry in case of exit status 140 and just one in any other case:

maxRetries = { task.exitStatus == 140 ? 4 : 1 }

In case we know that a failure is a possibility, and we want to pipeline to go through we can define the value of “errorStrategy” depending on the number of attempts (task.attempt). We can try up to 3 times and then ignore that process.

However, this failure will be reported in the log and in case you launch again the pipeline with the option “-resume” you will re-submit these failed processes again.

Sometimes the failure can be generated by a resource that is unavailable, like missing internet connection, etc... for making the pipeline waiting for a given time before a new attempt you might want to use this syntax:

So, now you can make your pipeline more resilient to single failure without adjusting each time the resources!

Handling failing jobs with Nextflow

Written by Luca Cozzuto

Responses (1)