[BioRuby] Improve rake/snakemake/nextflow.io?

Mon Mar 2 17:00:24 UTC 2015

Hi Yannick,
that's an interesting topic.
I have been working for a while on a Ruby package to handle pipelines and
distributed analyses in our Bioinformatics core: the code is here
https://github.com/fstrozzi/bioruby-pipengine .

With this solution we have decided to stick to a simple approach, i.e.
pipelines templates written in YAML where you can put raw command lines
with simple placeholders that get substituted at run time according to your
project and samples. So the DSL is reduced to a minimum and the tool then
creates runnable scripts that can be send through a queuing system. There
is also a simple error control for jobs and also checkpoints to skip
already completed steps for a given pipeline.
This is *very* Illumina-centric and so far it works only through a
Torque/PBS scheduler (this is what we have in-house). It is a bit rough but
we are using it since >2 years now and we are quite happy. I know it has
been used also in other places. I've recently started a Scala
implementation of this code (https://github.com/fstrozzi/PipEngine), to
make it more portable and also to introduce a number of improvements. It's
still very work in progress, but among other things we want to add the
support for multiple queuing systems, step dependencies and Docker support.

Anyway, the point with these solutions, in my opinion, is that I do not
think there could be a perfect tool that can fit every purpose or scenario
or environment. There was a similar discussion also on the biocore mailing
list some time ago and it turned out that many centres either use their own
systems or take existing solutions, such as for instance Bpipe, and modify
them to fit their needs. Nextflow is also a very nice tool.

In the end we have done the same and developed a solution that, even if
with its own limitations, fits our needs and our way of structuring and
organising the data analyses.

Cheers
Francesco

On Mon, 2 Mar 2015 at 16:50 Yannick Wurm <y.wurm at qmul.ac.uk> wrote:

> Hi all,
>
> dumb question that hasn't been asked/discussed here for a while...
> What's the easiest way to make a *simple* pipeline?
>
> Two contenders that come up in google:
>
> * snakemake
>   http://metagenomic-methods-for-microbial-ecologists.readthed
> ocs.org/en/latest/day-1/#merge-paired-end-illumina-data
>
> * nextflow
>   http://www.nextflow.io/example4.html
>   This one clearly allows grouping of files (e.g. read_pairs)
>
> Any other rake/make-killers?
>
> Criteria I think are important are:
>  * simple syntax (yaml?)
>  * easy wild-carding syntax/DSL
>       XXX.bam requires #{basename($_)}.sam
>  * easy grouping of files (for paired reads; for samples split across
> multiple files)
>  * easy error checking & failing
>    - e.g. checking that output files are not empty
>    - e.g. checking that files have same length (when appropriate)
>    - e.g. checking return code or presence/absence of specific text in
> stdout or stderr
>
> The additional killer would be amazing visual progress output & if it
> learnt how long specific times are likely to take to provide an ETA.
>
> Cheers,
>
> Yannick
>
>
> -------------------------------------------------------
> Yannick Wurm - http://wurmlab.github.io
> Ants, Genomes & Evolution ⋅ y.wurm at qmul.ac.uk ⋅ skype:yannickwurm ⋅ +44
> 207 882 3049
> 5.03A Fogg ⋅ School of Biological & Chemical Sciences ⋅ Queen Mary,
> University of London ⋅ Mile End Road ⋅ E1 4NS London ⋅ UK
>
>
> _______________________________________________
> BioRuby Project - http://www.bioruby.org/
> BioRuby mailing list
> BioRuby at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/bioruby