[Bioperl-l] Re: Pipeline Input/Output refactoring plan

Michele Clamp michele@sanger.ac.uk
Wed, 13 Mar 2002 11:46:37 +0000 (GMT)


Hi Elia and Jerm,

This is a preliminary mail  - you got us thinking over here and there will
be a followup mail saying what we would like to do with the
RunnableDB/Runnable system.

On Tue, 12 Mar 2002, Elia Stupka wrote:

>Ok, Jerm and I thought instead of clogging mail boxes we would do the
>long-honoured thing of sitting down and writing some specs, for your
>perusal for 24 hours, then we start coding :)
>
>The idea is to start from one specific problem, input fetching and output
>storing generalisation. We want to do this by adding a couple of tables
>and adaptors.
>
>***************************
>
>input_id would become the internal id of an input table consisting of:
>
>Table input:
>
>input_id #internal id, usual auto-increment int
>IO_adaptor_id #foreign key to the input_adaptor table
>input_name #this is a varchar that specifies accession
>            number/identifier,etc or it can be set to 'all' to fetch_all
>            from the adaptor

We really like the IOAdaptor thing.  I'm not convinced about storing it in
the database like that though.  We were thinking more of having it in the
analysisprocess table (one input_adaptor column and one output_adaptor
column).  Presumably you want to run the same analysis on differently
shaped inputs which may come from different databases is that right?

>
>Table IO_adaptor:
>
>IO_adaptor_id #usual internal id
>db_locator #string with host, user, etc. to connect to db
>dbadaptor_module #DBAdaptor module to be used,
>                  e.g. Bio::EnsEMBL;:Compara::DBAdaptor
>biodb_adaptor_module #this is used to specify biodatabase 
>                      adaptor to be used, such as 
>                      Bio::EnsEMBL::Compara:GenomeDB
>biodb_name #specifies name of biodb, e.g.swissprot
>IO_adaptor #specifies input/output adaptor, e.g. RawContigAdaptor, or
>           ProteinAdaptor, FamilyAdaptor or registry supported sequence
>           fetcher 
>IO_adaptor_method #specifies the method to use on the IO Adaptor class,
>                   e.g. fetch_by_dbID or get_Seq_by_id
>

Now this has us foxed as we (well me especially) don't really understand
the biodb stuff.  Can you explain the reasoning behind this.

>
>The idea is not to leave anything hard coded in the runnable about how it
>should fetch its input and write its output.

Agreed - you mean the RunnableDB yes?

>
>The RunnableDB will use the specified adaptors and method calls to get the
>input, and the same for the output. If "all" is specified rather than an
>identifier, the runnable will take all sequences and run on whole-db level
>(e.g. family stuff)

I was thinking more of one adaptor per input type i.e. there is no choice
of methods.  The number of input types we have is very small.

>On the input side things should plug and play reasonably well without
>modifying too much code, since so much information is given in the adaptor
>table.

I'm worried about having to set up even more stuff in a database.  People
have enough trouble loading up an analysisprocess table as it is.  I would
like people to take a read-only ensembl/biosql database build a runnableDB
and point it at that database.  Actually we're ok here thinking about it -
apart from the IO_Adaptor table which I don't understand.

>
>On the output side, internal methods (or factory classes) would need to be
>written in order to be able to support multiple outputs, but at the moment
>its secondary in priority (basically mainly interesting for supporting
>both bioperl-db and ensembl support,e.g. write genes as seqfetures in
>biosql or as gene objects in ensembl)

We have GeneAdaptors and FeatureAdaptors and PredictionAdaptors already
and these can be reused.

>
>We've gone through a few complex pipeline cases, and we think it could
>work nicely and cleanly.
>
>In order to start coding on this I wonder whether you want us to work on
>main trunk or whether you would like to branch before we do that, since it
>will break every existing runnabledb right now.

There are strong noises for branching from some corners here :-)  

>Theoretically it should be a breeze to make this even support the Registry
>in bioperl or other specific registry-supported methods for fetching
>sequences such as a web biofetch,etc. thus allowing to run the pipeline
>without local sequences, and store only results.

This sounds very nice indeed.

-- 
And so as the stripey-winged owl's genome of Fate 
is decoded by the great sequencer of Time,
and as the big grep of Eternity uses all the cpu of Destiny
I come to the end of the mail.