[BioSQL-l] Consistency between bio* projects
Jason Stajich
jason at cgt.duhs.duke.edu
Wed Jan 19 16:35:18 EST 2005
Most of the specifics of what fields the annotations are stored in the
BioPerl Annotation model is documented in the Bio::SeqIO::genbank and
Bio::SeqIO::embl POD. Some of the info is in the Bio::Seq::RichSeq
(and its interface) documentation.
Brian has written down a lot of the nuances in his HOWTOs as well see
(http://bioperl.org/HOWTOs/).
The object model is different for different projects - Bioperl
differentiates between Features (things with a defined location in
sequence) and Annotations (things attached to an entire record).
Some proposals for a next-generation (NG) object model will probably
not distinguish these types as much remove the dependancy on things
being sequence-centric. (Our parsers were written first to be able to
roundtrip the formats where most of the data was stored - the sequence
repositories so genbank,embl,swissprot). A NG model will probably be
more feature-centric - in the DAS-flavor of things probably.
I'm not sure what will really happen, requires people to roll up their
sleeves and get some prototypes out there for us to really play with.
-jason
On Jan 17, 2005, at 3:00 PM, Hilmar Lapp wrote:
>
> On Jan 16, 2005, at 6:41 PM, mark.schreiber at group.novartis.com wrote:
>
>> It would seem that what is needed is a mapping of each field from a
>> file
>> format to a field in a BioSQL table. I think initially this would only
>> need to be done for EMBL, SwissProt and GenBank.
>
> Note that this has at least been started for bioperl in the extent
> that the destination in the bioperl object model is documented.
> (Jason, Brian, anything you wanted to comment?)
>
> Once you know where it is in the bioperl object model, it is
> relatively straight forward to predict where it ends up in the schema;
> still it's not written down in plain text anywhere I think.
>
>> In many ways I prefer the idea of developing a SQL API which would be
>> more
>> robust and would serve to define what is expected of each proceedure
>> call.
>> However I think it should be achievable for the schema. In fact there
>> is
>> no reason why both cannot co-exist. For any API there should be a
>> possbile
>> implementation so naturally the schema could be used to generate an
>> API.
>> People could then happilly make other schemata that fit the API which
>> may
>> be optimised for their needs.
>
> Right - I guess so far my idea was that the object model is the API,
> and the OR mapper implements the bridge to your chosen schema.
>
> Clearly, the problem with this level of API is that it's not
> cross-bio* by definition since we don't use the exact same object
> model.
>
>>
>> Does anyone have a recent UML or similar diagram for the schema?
>
> There is a ERD in the doc directory. Other than that, there is no UML
> model.
>
>> I can then use this to suggest mappings from GenBank fields to the
>> API. I think
>> it may be easier in many cases to follow bioperl's lead. BioJava
>> seems to
>> follow the 'store everything that isn't a feature as a
>> bioentry_qualifier'
>> approach so I just need to add some special cases.
>>
>> Hilmar, would you be prepared to do any work on the BioPerl side for
>> synchronization of the two?
>
> Certainly, if it is really needed. Generally speaking, I would not
> want to introduce object model and genbank-to-object model mapping
> changes to bioperl if they openly break backward compatibility unless
> everybody agrees to go forward. It's also not necessarily needed; the
> OR mapping code (bioperl-db) may be the better place, depending on
> scope and what's involved.
>
> -hilmar
>
>
>>
>> - Mark
>>
>>
>>
>>
>>
>> Hilmar Lapp <hlapp at gnf.org>
>> 01/15/2005 01:58 AM
>>
>>
>> To: Mark Schreiber/GP/Novartis at PH
>> cc: biosql-l at open-bio.org
>> Subject: Re: [BioSQL-l] Consistency between bio*
>> projects
>>
>>
>>
>> On Friday, January 14, 2005, at 01:10 AM,
>> mark.schreiber at group.novartis.com wrote:
>>> Unfortunately, Bioperl stores identifiers as
>>> follows:
>>>
>>> Bioentry.bioentry_id is the unique internal reference number
>>> Bioentry.name is the GI number
>>
>> The GI number goes to Bioentry.Identifier, which is was designated the
>> purpose of storing the identifier within an external database.
>>
>> Bioentry.name should hold the locus name, which for contigs and many
>> other entries etc will be identical to the accession (but not the GI
>> number!).
>>
>> If you find it in Bioentry.name then I suspect you weren't loading
>> from
>> genbank or embl formatted input?
>>
>> From memory the basic idea of BioSQL was to define a schema that bio*
>>> projects could both read and write from in a language independant
>>> manner.
>>> For reasons best left to the designers (mostly I think cause MySQL
>>> couldn't handle stored proceedures) the level of interaction is right
>>> down
>>> at the schema level.
>>
>> Right. Also, not all database drivers in all languages support stored
>> procedure calls equally well. In e.g. PostgreSQL and Oracle you can
>> always get around this by writing a view and putting an INSTEAD OF
>> INSERT (or UPDATE) trigger on it that will then call the procedure,
>> but
>> this is clearly not even close to an option in MySQL.
>>
>> It's maybe worth considering whether opening a dichotomy here between
>> MySQL and the rest to provide people who need it with a SQL-level API
>> that both perl and java will use. People who are interested in this by
>> definition will not be interested in MySQL anyway.
>>
>>> Unfortunaltey this means that the way data is stored
>>> needs to be very consistent between projects if any API's that use
>>> BioSQL
>>> can be portable. My biojava API cannot be applied to a DB previously
>>> setup
>>> with bioperl which was the original idea behind BioSQL in the first
>>> place.
>>>
>>> Help!!
>>
>> I think you're raising a great point. Indeed, such a contract hasn't
>> really been written. We're probably one of few who use both perl and
>> java to access a biosql database (and I'm not using biojava as the
>> object model on the java side, which is why I'm not running into this
>> problem). (Note as an aside that you could also write adaptors that
>> transform between the SymGene and the Biojava model when storing or
>> retrieving objects from/to the database.)
>>
>> It'd be great if you were willing to take the lead for getting this
>> all
>> spelled out and laid down in a document?
>>
>> -hilmar
>> --
>> -------------------------------------------------------------
>> Hilmar Lapp email: lapp at gnf.org
>> GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
>> -------------------------------------------------------------
>>
>>
>>
>>
>>
>>
> --
> -------------------------------------------------------------
> Hilmar Lapp email: lapp at gnf.org
> GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
--
Jason Stajich
Duke University
jason at cgt.duhs.duke.edu
More information about the BioSQL-l
mailing list