[Bioperl-pipeline] better re-run management

Thu, 22 Aug 2002 15:17:35 +0800 (SGT)

Not sure how we want to do this, but I was thinking that a typical problem
of the pipeline is that things go wrong in the middle (data or
whatever) and then often all the rules get deleted,etc. and reloaded, and
often again stuff gets reloaded from the next step onwards... the final
result is that the rule,etc. tables will not actually contain a clean
history of what has been done to the sequence....

A few things we might want to look into:

XML update functionality, i.e. I loaded a pipeline XML template, I did
steps 1-6 fine, and step 7 was wrong, so it failed after that. I go and
change the XML file, run XMLtoDB.pl -update, it will rewrite only the
analyses that have changed, i.e. 7 onwards

One of the things that would aid this process would be to have some kind
of completed_analysis table, so that once all jobs are finished for an
analysis, it is ticked as finished (e.g. repeatmasker now for Ciona). It
would also help when there is a WAIT_FOR_ALL condition so that the complex
code to check if all have finished is run only once, and at any other time
later you can just query that table.

It's basically a way to "safeguard" the analysis, logic and data of parts
of a pipeline that are done and dusted. This would enable also, for
example, packaging up whole databases to run multiple analysis over the
half-baked databases, e.g. package fugu genome with blasts, repeats, and
genscans, and then run differen flavours of genebuilding without losing
the history of what has been done before and with what logic.

How does it sound?

Elia

********************************
* http://www.fugu-sg.org/~elia *
* tel:    +65 6874 1467        *
* mobile: +65 9030 7613        *
* fax:    +65 6779 1117        *
********************************