[Bioperl-pipeline] Changes to come (long long mail)

Sat Aug 16 01:06:06 EDT 2003

I'm halfway thru adding more functionality to biopipe. I've been 
mulling about the idea
of allowing analysis to be chained in memory and I hope this doesn't go 
against any biopipe
philosophy ha..if there are any.  These changes will require 
modifications to the xml and schema.

Motivation
---------------
During the execution of a series of analysis, the system requires that 
each analysis has some place
to store(in db) or dump(to file)  in order to pass the results between 
analysis.
This means is that one

1) will store all intermediate results so that in the event
that analysis fails, you can rerun from the last failed analysis.
2) will need to design a dumper/schema in which to hold the 
intermediate results

1) saves compute time while 2) requires programmer to do work, design 
temporary databases and dbadaptors
etc.

An alternative to this is to write a Combo-Runnable for example 
BlastEst2Genome which is not very modular and extensible.

Sometimes, the cost of doing 2) is greater than 1) especially if the 
analysis are mini jobs that run quickly.
So for the scenario where we have a series analysis that run fast,
and we are only interested in storing the result of the last analysis, 
it makes sense to allow
chaining of jobs in memory.

My current use case:

Running a targetted est2genome/genewise to map cdna/proteins to a 
genome.

The strategy is to run a blast  of the sequence with high cutoffs
against the genome to map the approximate location, then run a 
sensitive est2genome or genewise
against the smaller region.

For my case, I only want the run the alignments on the top 2 blast hits 
(2 haplotypes).

So rather than doing the following:

est->
	 Analysis: Run Blast against genome -> Output(store blast hit)
	 Analysis: setup_est2genome -> Input(fetch_top_2 blast_hit)
	Analysis: Est2Genome -> Output (store gene)

I now do the following

est->
	Analysis: Run Blast against genome
		-> Chain_Output (with filter attached ) && (Output(store blast hit) 
{Optional})
			->Analysis(setup_est2genome)
	Analysis: Est2Genome-> Output(store gene)

  We do not need to have some temporary blast hit database but we can
still have it stored if we want to by attaching an additional output 
iohandler.

The Guts
---------------

What I'm proposing is to have a grouping of rules.

A rule group  means that I will chain a group of analysis in a single 
job.

Sample rule table:

+---------+---------------+---------+------+---------+
| rule_id | rule_group_id | current | next | action  |
+---------+---------------+---------+------+---------+
|       1 |             1 |       1 |    2 | NOTHING |
|       2 |             2 |       2 |    3 | CHAIN   |
|       3 |             3 |       3 |    4 | NOTHING   |
+---------+---------------+---------+------+---------+

Analysis1: InputCreate
Analysis2: Blast
Analysis3: SetupEst2Genome
Analysis4: Est2Genome

So here we have 3 rule groups. Each job will have its own rule group.

For a single est input, it will create 3 jobs during the course of the 
pipeline execution.
Job 1: Input Create (fetch all ests and create blast jobs)
Job 2: Blast (blast est against database)
             Output is chained to Analysis 3 (setup est2genome) using a 
IOHandler of type chain with a blast filter attached
Job 3:  Run Analysis 4(est2genome) of jobs created by analysis 3

Only between analysis 2 and 3 do chaining occur.

If Job 2 fails, the blast and setup_est2genome analysis will have to be 
rerun.

You could imagine having multiple analysis chained within a rule_group.

I have working code for this.  The next thing that I'm still thinking 
about is to have a stronger
form of datatype definition between the runnables which is currently 
not too strongly
enforced . It will be  probably based on Martin's (or Pise or emboss) 
Analysis data
definition interface. We can either have this information done at the 
runnable layer
or the bioperl-run wrappers layer or both.

Once this is done, we can have a hierarchical organization of the 
pipelines:

- chaining analysis within rule groups
- chaining rule groups ( add a rule_group relationship table)(defined 
within 1 xml)

- chaining pipelines(add a meta_pipeline table) which means re-using 
different xmls
as long as the inputs and outputs of first and last analysis of the 
pipelines match.

I would like some help with regards to this application definition 
interface if people are interested or have
comments...

sorry for the long mail..if u get to reading to this point.

shawn