[Bioperl-l] Re: Bio::EnsemblLite::UpdateableDB

Jason Stajich jason@chg.mc.duke.edu
Sat, 15 Jul 2000 14:54:21 -0400 (EDT)


On Sun, 16 Jul 2000, Ewan Birney wrote:

> On Fri, 14 Jul 2000, Jason Stajich wrote:
> 
> > Preliminary version of the module Bio::EnsemblLite::UpdateableDB is
> > checked in to bioperl (bioperl-db repository).
> > 
> > This just does basic stuff by talking to a mysql db server, add a Bio::Seq 
> > to the db, remove a seq, and fetch a seq from the db and make it into a
> > Bio::Seq object.  This had to be separate code with separate tables 
> > (from ensembl) because I am not expecting sequences to be part of contigs
> > at this time.
> 
> Hmmmm. I guess this was always going to go this way, but I feel that
> EnsemblLite and Ensembl are forking too early in the process. I guess we
> could claim that
> 
> 	Ensembl - database for fragmentory genomes
> 
> 	EnsemblLite - database for sequences (stand-alone)

> With code reuse occuring principly because they are both based on bioperl
> and design reuse because they have to handle roughly speaking the same
> things. Are we missing some sort of alignment between Ensembl and
> EnsemblLite?

I guess this is the most correct assesment.  Obviously I'd like to see
more overlap because that means less duplicate coding, but also means we
have to break down the goals better. I didn't really want to fork from
Ensembl, but it seems to be addressing data from a different perspective.
Maybe we should talk about goals of EnsemblLite again.

I am most interested in better integrating the 'public domain'
genome data with laboratory produced experimental data (ie 'OUR' sequences
for BAC123X12 ).  In the best of all possible worlds - would like to be
able to: 
(ewan and I have had this discussion before, but I would like to throw it
out there and see what the opinions are)

- build a virtual contig (from 100 kb to a couple of MB ) between marker
  D2SXX and D2SXXX that consisted of data in public domain and
  experimentally produced in-house.  
- Annotations and features included and updated automagically from public
  sources.
- Analyze this X MB of sequence, finding and identifying known and
  predicted genes (this is ensembl like stuff), match them up with
  observed and reported data, find homologies, essentially try and know
  what this sequence does because we think it might be involved in disease
  Y.  

This is really hard to do right now, but is also really what I think
researchers want to do.  Computers should make this easy, instead of
clicking away at multiple genome web sites we should be able to put
together the known information and sprinkle in our own data.  Maybe this
is what commerical services provide and I am just not in the know... =)
 
> 
> BTW - Jason - have you handled the "how to store a SeqFeature::Generic"
> type problem in the SQL?

check out the schema sql/ensembl-lite-mysql-addon.sql  (I'll have a pretty
graphic on ensembl wikki by next week )

dna_description - describe the sequence, accession number (didn't build in
	           multiple accession numbers right now)
generic_feature - a generic feature for a sequence, 
		  (name,strand, source, start & end positions)
feature_detail  - tag,value pairs that exist for a feature
feature_detail_association - associate details with generic features.

Haven't built in the multi-level feature support just yet, simply generic
sequence features.

May have to make this structure more well thought out - depends on where
people think this will go and whether or not we want to really fork away
from Ensembl.
I can imagine an OO strategy where there is an abstract feature object,
and specific implementations (repeat, exon, gene, generic, est, BLAST 
similarity, misc, etc) which all derive
from it.  Could translate this in to SQL with some work, but a lot of
this is done in Ensembl but from using contigs not seqs.  But to just do
it for a sequence would be essentially storing an EMBL/GenBank file in
SQL, have to ask why this would be worthwhile.

I guess what needs to happen is the goals of EnsemblLite have to outlined
and I'd like some input from others on this.  What do you need that you
can't do with bioperl scripts?  Are you dealing with collections of
information indifferent places and want to integrate them into one place.
Researchers want a simple one-stop-shopping for their information.  If
they can view data in one places that is actually a representation of
multiple information sources, it is all the easier to worry about what it
means.


> > 
> > So there are a set of add-on tables to provide a seq description 
> > and an association of generic features to seqs.
> > 
> > sql/ensembl-lite-mysql-addon.sql - add ons
> > sql/ensembl-lite.sql - ensembl-lite code (only the dna table is really
> >                        used from this at present)
> > 
> > I have been trying to put the EnsemblLite spec that I will
> > propose on ensembl.org's wikki, but have been getting web errors.  (It
> > just KNOWS everyone is getting ready for summer holidays).  Will try again
> > over the weekend.  
> > 
> > short TODO list (lest you think this is really finished code being
> >                  submitted)
> > 
> >  - implement _update function for updating seqs
> 
> What does this function do? 

The Bio::DB::UpdateableSeqI has four functions that must be implemented
(in addition to the SeqI interface).  The write_seqs function takes three
arguments, a reference to an array of sequences to be updated, deleted,
and added to the sequence database.  
It will then call each of the _add_seq, _remove_seq, and _update_seq
functions for each element in the respective seq arrays.  

I have written the fetch-a-seq, add-a-seq, remove-a-seq, but not
update-a-seq. (ie here is my seq, it already exists
in the database with an id number, update the information about it in
the db to be equal to the sequence object I am passing in)   
 
> >  - implement get_PrimarySeq_stream
> >  - table schema discussion with interested parties, it is not really
> >    CORRECT to refer to a table of sequences as 'dna' if some are protein
> >    seqs...  Include a graphical table schema on doc.
> >  - Start looking at the analysis pipline runnable/runnabledb system from
> >    Ensembl and how we can hook into it.
> > 
> 
> I suspect this means you will be able to reuse our "Runnables" you will
> need different "RunnableDBs"
>
> 
> Runnable reuse will be a big win. We might want to specialise some
> runnables in something like FeatureProducingRunnable...
> 

Yes, yes code reuse good.  Have to just read more to figure out how the
your "runnables" work. Just haven't had any time lately.

 
Jason Stajich
jason@chg.mc.duke.edu
http://galton.mc.duke.edu/~jason/
(919)684-1806 (office) 
(919)684-2275 (fax) 
Center for Human Genetics - Duke University Medical Center
http://wwwchg.mc.duke.edu/