[Open-bio-l] Schema for genes & features & mappings to assemblies

Tue, 23 Apr 2002 10:09:09 +0100 (BST)

On Tue, 23 Apr 2002, Elia Stupka wrote:

> > Ensembl's sweet spot is assemblies+automatic pipeline (ala Ensembl). Most
> > people get put off by how much "stuff" there is inside Ensembl but infact
> 
> I just had to jump in here. We over here would be keen and are working on
> getting more into bioperl, biosql and bioperl-pipeline. Ideally I think we
> should aim at allowing people the same sweet automatic feeling that
> ensembl has for genome annotation become one of the many sweet features of
> a bioperl-pipeline. With the notable difference that it would be just one
> of the many sweet spots in that it could be used also for much smaller
> jobs (a simple blast pipeline for lab sequences for example) with the same
> ease, and knows how to interact with multiple dbs,etc.

I think what is happening here is very natural (Ensembl expertise
trickling down in some sense without us entangling people in Ensembl's
schedule - which is often pretty forced). 

I *do* think people will be surprised and how clean the next generation
Ensembl schema is, which we are working on in a very steady manner and
adaptors. It really unifies lots of ideas that have been floating around
for a while in a clean-and-explicit schema, with a clean-and-explicit code
layer on top. However, BioSQL is also very clean.

> 
> > The big benefits are (a) schema and data which can be downloaded for
> > human, mouse, zebra, fugu, (and soon... anopheles) which is guarenteed to
> > work (b) very functional web site which is portable (c) ability to run
> > automatic systems which scale into a "please completely annotate this
> > genome in 2 weeks" scale
> 
> I admit I cannot see any of us or bioperl moving soon to some fancy
> website building and it'd be useless. Probably more on the lines of
> genquire, or a bioperl-gui... but apart from that there would be no issue
> once things are setup to port bioperl-pipeline over to ensembl with a
> couple of parsers or the other way around.
> 

Flexiblity and reuse here are watchwords. David - I think you are best off
backing BioSQL - it is the schema which represents the most complete and
shared view of dna+features.

We do need to discuss assemblies. I vote for "flat" one level assemblies
(set of contigs form a chromosome), ala Ensembl, as I believe that the
assummed heirarichal nature of assemblies is (a) mainly a consequence of
how it is put together and the intermedaites in the heirarchies between
contigs of DNA and chromosomes are nearly never stable (b) means you
always have to use software to do conversions and can never do it easily
with SQL (PL/SQL probably can...).

Other options are :

  (a) multi-level (thomas')

  (b) zero level (Lincoln likes this). The schema stores contigs as
"features" on DNA Sequences which are chromosome length.

> Elia
> 
> 
> 
> ********************************
> * http://www.fugu-sg.org/~elia *
> * tel:    +65 874 1467         *
> * mobile: +65 90307613         *
> * fax:    +65 777 0402         *
> ********************************
> 
> 
> _______________________________________________
> Open-Bio-l mailing list
> Open-Bio-l@open-bio.org
> http://open-bio.org/mailman/listinfo/open-bio-l
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------