[BioSQL-l] Consistency between bio* projects

Wed Jan 19 16:35:18 EST 2005

Most of the specifics of what fields the annotations are stored in the 
BioPerl Annotation model is documented in the Bio::SeqIO::genbank and 
Bio::SeqIO::embl POD.   Some of the info is in the Bio::Seq::RichSeq 
(and its interface) documentation.

Brian has written down a lot of the nuances in his HOWTOs as well see 
(http://bioperl.org/HOWTOs/).

The object model is different for different projects - Bioperl 
differentiates between Features (things with a defined location in 
sequence) and Annotations (things attached to an entire record).

Some proposals for a next-generation (NG) object model will probably 
not distinguish these types as much remove the dependancy on things 
being sequence-centric.  (Our parsers were written first to be able to 
roundtrip the formats where most of the data was stored - the sequence 
repositories so genbank,embl,swissprot).  A NG model will probably be 
more feature-centric - in the DAS-flavor of things probably.

I'm not sure what will really happen, requires people to roll up their 
sleeves and get some prototypes out there for us to really play with.

-jason
On Jan 17, 2005, at 3:00 PM, Hilmar Lapp wrote:

>
> On Jan 16, 2005, at 6:41 PM, mark.schreiber at group.novartis.com wrote:
>
>> It would seem that what is needed is a mapping of each field from a 
>> file
>> format to a field in a BioSQL table. I think initially this would only
>> need to be done for EMBL, SwissProt and GenBank.
>
> Note that this has at least been started for bioperl in the extent 
> that the destination in the bioperl object model is documented. 
> (Jason, Brian, anything you wanted to comment?)
>
> Once you know where it is in the bioperl object model, it is 
> relatively straight forward to predict where it ends up in the schema; 
> still it's not written down in plain text anywhere I think.
>
>> In many ways I prefer the idea of developing a SQL API which would be 
>> more
>> robust and would serve to define what is expected of each proceedure 
>> call.
>> However I think it should be achievable for the schema. In fact there 
>> is
>> no reason why both cannot co-exist. For any API there should be a 
>> possbile
>> implementation so naturally the schema could be used to generate an 
>> API.
>> People could then happilly make other schemata that fit the API which 
>> may
>> be optimised for their needs.
>
> Right - I guess so far my idea was that the object model is the API, 
> and the OR mapper implements the bridge to your chosen schema.
>
> Clearly, the problem with this level of API is that it's not 
> cross-bio* by definition since we don't use the exact same object 
> model.
>
>>
>> Does anyone have a recent UML or similar diagram for the schema?
>
> There is a ERD in the doc directory. Other than that, there is no UML 
> model.
>
>> I can then use this to suggest mappings from GenBank fields to the 
>> API. I think
>> it may be easier in many cases to follow bioperl's lead. BioJava 
>> seems to
>> follow the 'store everything that isn't a feature as a 
>> bioentry_qualifier'
>> approach so I just need to add some special cases.
>>
>> Hilmar, would you be prepared to do any work on the BioPerl side for
>> synchronization of the two?
>
> Certainly, if it is really needed. Generally speaking, I would not 
> want to introduce object model and genbank-to-object model mapping 
> changes to bioperl if they openly break backward compatibility unless 
> everybody agrees to go forward. It's also not necessarily needed; the 
> OR mapping code (bioperl-db) may be the better place, depending on 
> scope and what's involved.
>
> 	-hilmar
>
>
>>
>> - Mark
>>
>>
>>
>>
>>
>> Hilmar Lapp <hlapp at gnf.org>
>> 01/15/2005 01:58 AM
>>
>>
>>         To:     Mark Schreiber/GP/Novartis at PH
>>         cc:     biosql-l at open-bio.org
>>         Subject:        Re: [BioSQL-l] Consistency between bio* 
>> projects
>>
>>
>>
>> On Friday, January 14, 2005, at 01:10  AM,
>> mark.schreiber at group.novartis.com wrote:
>>>  Unfortunately, Bioperl stores identifiers as
>>> follows:
>>>
>>> Bioentry.bioentry_id is the unique internal reference number
>>> Bioentry.name is the GI number
>>
>> The GI number goes to Bioentry.Identifier, which is was designated the
>> purpose of storing the identifier within an external database.
>>
>> Bioentry.name should hold the locus name, which for contigs and many
>> other entries etc will be identical to the accession (but not the GI
>> number!).
>>
>> If you find it in Bioentry.name then I suspect you weren't loading 
>> from
>> genbank or embl formatted input?
>>
>>  From memory the basic idea of BioSQL was to define a schema that bio*
>>> projects could both read and write from in a language independant
>>> manner.
>>> For reasons best left to the designers (mostly I think cause MySQL
>>> couldn't handle stored proceedures) the level of interaction is right
>>> down
>>> at the schema level.
>>
>> Right. Also, not all database drivers in all languages support stored
>> procedure calls equally well. In e.g. PostgreSQL and Oracle you can
>> always get around this by writing a view and putting an INSTEAD OF
>> INSERT (or UPDATE) trigger on it that will then call the procedure, 
>> but
>> this is clearly not even close to an option in MySQL.
>>
>> It's maybe worth considering whether opening a dichotomy here between
>> MySQL and the rest to provide people who need it with a SQL-level API
>> that both perl and java will use. People who are interested in this by
>> definition will not be interested in MySQL anyway.
>>
>>>  Unfortunaltey this means that the way data is stored
>>> needs to be very consistent between projects if any API's that use
>>> BioSQL
>>> can be portable. My biojava API cannot be applied to a DB previously
>>> setup
>>> with bioperl which was the original idea behind BioSQL in the first
>>> place.
>>>
>>> Help!!
>>
>> I think you're raising a great point. Indeed, such a contract hasn't
>> really been written. We're probably one of few who use both perl and
>> java to access a biosql database (and I'm not using biojava as the
>> object model on the java side, which is why I'm not running into this
>> problem). (Note as an aside that you could also write adaptors that
>> transform between the SymGene and the Biojava model when storing or
>> retrieving objects from/to the database.)
>>
>> It'd be great if you were willing to take the lead for getting this 
>> all
>> spelled out and laid down in a document?
>>
>>                  -hilmar
>> -- 
>> -------------------------------------------------------------
>> Hilmar Lapp                            email: lapp at gnf.org
>> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
>> -------------------------------------------------------------
>>
>>
>>
>>
>>
>>
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
--
Jason Stajich
Duke University
jason at cgt.duhs.duke.edu