[Bioperl-pipeline] things to do

Sat Jan 25 02:36:15 EST 2003

Hi Jerm,
	welcome back..god. it's gonna be year soon since we were in hinxton! 
Caution this mail is long.
Dev has been quiet over the last couple of months either 
because biopipe is underused...or the the code is getting stable ..somewhat.
I am beginning to see that the work on biopipe is now more than ever centered 
about writing xmls(pipeline instances), the fun stuff. I have written a simple
way of doing system tests for pipelines, one for each pipeline. And these are what I consider
to be more generic pipelines as they have to be non-specific to input/output database working
mainly with flat files. So current pipelines that have tests are:

1) Blast  Pipe, a simple flat file based blast pipeline for demo
2) Phylip Pipe, a more complex pipeline that takes proteins seqs and runs
		it through a couple of phylip programs resulting in trees
3) Cdna2Genome Pipe, a cdna to genome alignment pipeline that produces
		     SeqFeatures which may be stored in BioSQL or dumped as gff files

4) ProteinAnnotation Pipe, (no tests yet but soon coming...)

5) Protein Family Clustering Pipe (no test yet..)

Take a look at them and see if they run fine for u.

Other pipelines that are very DOABLE but just not done yet are the 

6) pairwise alignment stuff for non-coding sequences

Of course, there is also our genome annotation pipe that is under perpetual development..
that has been left untouched for sometime but I will need to work on the next few weeks.

Some statistical modules would be nice too.
..

I think we can partition these more generic pipelines from more specific ones that people can
deposit.

I think the runnable ('running programs on sequences') part of the pipeline is quite mature, and new pipelines
really can get implemented without major design issues. The main thing that is now quite apparent is data
preparation. I think a major part of analysis involves data extraction and sanitization. 

InputCreates:
I think a lot of the conceptualization for designing pipelines is made a whole lot easier if one doesn't have to
worry too much about how to translate input data to jobs. This is where the InputCreates may or may not 
be doing such a great job. InputCreates have become our defined box for containing
'hacky' code that setups analysis. I can see how these may grow unwieldy and too adhoc, to be used
by others even if it is meant for hacky codes. In anycase it works currently. We have modules
for setting up file based analysis (basically given a set of files, split it up, convert to certain format, 
create jobs etc) and db analysis(gets dbIDs and creates jobs).
I have been toying around with a module that given key words, fetches sequences remotely
from NCBI using Bio::DB::GenBank and runs it through the pipeline. 
I think raw data is easy to handle with the Bio::DB::*, Bio::Index::* and Bio::SeqIO::* modules,. 
The challenge will come from computed data like features, genes, other biological objects.
If we have more resuable ways of extracting these data that would be great.

Filters:
This is majorly underdeveloped. I think a lot of logic in filter scripts out there are wasted in not being
reusable. We should have a better framework/interface in which we can plug in different filters for different
uses.  It could be object-centric(features, sequences,trees etc) Currently, filters are attached to IOHandlers. 
In some sense, they may also become their own runnables and an entire pipeline is just filters.
Because biopipe is flexible they are both valid solutions. I think we should think about how to develop
filters. I think people want filters that allow human eyeballing and input, and also backup of filtered data.
Hilmar I think has started some neat filtering code in SeqFeatures which we could use and extend.

Merging Pipelines:
We have XML that allows people to share pipelines.At some point, Elia pointed out that it would be neat 
if it was easier for people to merge pipelines together without much fuss. More explicit datatype definitions
(like EMBOSS's acd) between runnables would be the way to go for this I think...something worth exploring.

Pipeline Optimization:
We started developing biopipe with the goal of flexibility but I know there is plenty of room for optimization
in the biopipe. At this point, we are still encountering leaky dbconnections resulting in  too many db connection errors.
We should also look at benchmarking the pipelines which Frans have been doing of late with the numder of jobs he is sending
to the poor 60+ nodes far ;).

Job Tracking Interface
We should have better interfaces of job tracking and management in Biopipe. This is something really good if
can be developed. Right now, we use SQL to count jobs delete jobs etc Better API or a separate application or even
a shell to allow one to query jobs, stop/start/ delete/pause jobs would be cool. We should also flesh out better
wrappers to the underlying BatchSubmission systems to utilize their more sophisticated functions. For example,
one shortfall in Biopipe is if a job fails due to too many dbconnection errors, it may not be able to update the 
job table to say that it has failed. As such, its state gets stuck in submitted and never gets resubmitted. We should
have a smarter way of querying LSF /PBS and determine that a particular job is taking too long and by querying its
state through bjobs -jobid, figure out its no longer running, set it to fail and rerun...

Result Viewing
Not really part of biopipe since I/O sources are really abstracted out of biopipe. But practically I think
the particular pipelines and data that one generates will drive the development of the data visualization software.
We have been using the GBrowse closely with the protein annotation pipeline for really quick viewing of the features
through BioSQL(and I mean quick, one config file and we are up!)I think the plans that others have for viewing
trees and alignments would be great too. Integration with BioSQL to store richer objects is important. Right
now the failsafe solution is to dump things out to files since bioperl has richest set of modules for file I/O. Not
something great for highthroughput stuff even though Bio::DB::Fasta scales quite well. 

Thats all I can think of right now ;) sorry for the long mail, this is also  for the entire list
for people who have ideas and want to pick things up and code or see what biopipe is up to....

Very interested to hear what you think.

shawn

On Fri, 24 Jan 2003, Jerm wrote:

> Hey guys.
> 
> It's been a while since I've done anything on the biopipe. I thought I would
> get a picture of the TO DO LIST, so that I'll be able to get back into the
> development circle.
> 
> Can someone please give me a feel on what you guys are tackling at the
> moment?
> 
> thanks.
> Jerm
> 
> _______________________________________________
> bioperl-pipeline mailing list
> bioperl-pipeline at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-pipeline
> 

-- 
********************************
* Shawn Hoon
* http://www.fugu-sg.org/~shawnh
********************************