[Open-bio-l] bio-pipeline schema in

Elia Stupka elia@fugu-sg.org
Wed, 20 Mar 2002 16:40:42 +0800 (SGT)


Hello,

I've just committed our first stab at the schema for the new bio-pipeline.
I will soon import the bioperl-pipeline CVS module, where we will work on
the perl-port of the pipeline, trying to please both the generic bioperl
user who might want to use it to fetch some sequences from GENBANK and run
them through his pipeline as well as gradually porting some (hopefully)
all of the ensembl pipeline to this schema.

The schema is in biosql-schema/sql/biopipelinedb-mysql.sql

The schema was more or less described the other day, save a few changes.
We have renamed the IO table to datasource table and we have now added
another table, because we realised we will have cases where we want to
take multiple input ids (for example two sequences to crossmatch). So we
have added the IO table:

**********
Table IO
**********
IO_id (internal id)
datasource_id (foreign key to the datasource table which has all the
locator, adaptor,etc. stuff)
IO_type (input or output)
**********

Then the analysis table keys off this IO table, and has a runnable column
(instead of module), so via these two keys when the analysis object is
passed to the runnabledb it knows the input adaptors, the output adaptor
and the runnable to use.

The LSFid in the job table has been changed to queue_id since we are
planning to allow local use as well as PSB,etc.

All column and table names follow the sane new-style table_id naming
schema.

The class column has been removed from both job and input_analysis since
that is all encapsulated in the datasource table.

The input table is now made of an internal_id, foreign key to the
datasource and a name which corresponds to identifier.

We are going to start coding the bioperl-pipeline modules, and the first
three test cases we want to get working in the next few months are:

[preliminary get one simple runnable like repeatmasker to work in the new
schema] :)

1)Have 3 genomes in ensembl schemas and run a pipeline to tblastx all
against all.

2)Generate families between 3 genomes and store clustalw alignments for
them.

3)Generate and store conserved syntenyc regions (using our protein
ensembl-compara stuff already working) between organisms and run DBA
(DnaBlockAligner) on the non-coding portions of the conserved segments.

etc.etc.etc. :)

Elia

-- 
********************************
* http://www.fugu-sg.org/~elia *
* tel:    +65 874 1467         *
* mobile: +65 90307613         *
* fax:    +65 777 0402         *
********************************