[Biojava-l] Change Proposal regarding References

Tue May 31 23:42:30 EDT 2005

OK, I'll bear that in mind. 

Most annotations currently have rank implied by the order they were
loaded, as the underlying class in the commonly-used SimpleAnnotation is
a LinkedHashMap which preserves order of iteration. We can use this
property of LinkedHashMap to assign ranks as annotations pass into
BioSQL. Retrieval will be slightly harder but not impossible - it will
involve loading annotations of all kinds from the database into a
temporary sorted map of rank->annotation then creating the
SimpleAnnotation object to be returned from the value set of this
temporary map ordered by key. (BioSQLSequenceAnnotation will have to be
changed to use SimpleAnnotation on retrieving data - currently it uses
SmallAnnotation which is not ordered).

For sequences annotated with things other than SimpleAnnotation objects
or their subtypes, you will find the annotations come back in a
different order. However I'm not sure if this is the case anywhere at
present.

I should also point out that we should be using the 'bioentry_reference'
and 'reference' tables, and not 'bioentry_dbxref' as I mistakenly
mentioned in the original post.

Note that the 'identifier' and 'provider' fields in BibRef are optional
and only for use when a PubMed/Medline etc. value has been specified in
the original file. They will both be ignored by the BioSQL persistence
layer if either are set to null.

cheers,
Richard

Richard Holland
Bioinformatics Specialist
GIS extension 8199
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------

> -----Original Message-----
> From: mark.schreiber at novartis.com 
> [mailto:mark.schreiber at novartis.com] 
> Sent: Wednesday, June 01, 2005 11:18 AM
> To: Richard HOLLAND
> Cc: biojava-l at biojava.org; 
> biojava-l-bounces at portal.open-bio.org; OBDA BioSQL
> Subject: Re: [Biojava-l] Change Proposal regarding References
> 
> 
> I'd support this and might be able to help out with advice or 
> words of 
> encouragement (coffee at least) for the first few steps.
> 
> I would also encourage you to look into the rank column of 
> the appropriate 
> BioSQL tables. The rank column is intented to help preserve 
> the order of 
> comments, dbxrefs, references, qualifiers etc so that when you dump 
> something out in Genbank format you get everything in the 
> same order it 
> was read in. I'm not sure Biojava makes sensible use of rank 
> columns at 
> the moment.
> 
> - Mark
> 
> 
> 
> 
> 
> "Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
> Sent by: biojava-l-bounces at portal.open-bio.org
> 06/01/2005 11:02 AM
> 
>  
>         To:     <biojava-l at biojava.org>, "OBDA BioSQL" 
> <biosql-l at open-bio.org>
>         cc:     (bcc: Mark Schreiber/GP/Novartis)
>         Subject:        [Biojava-l] Change Proposal regarding 
> References
> 
> 
> Hi all,
> 
> This is a two-pronged change proposal - first to allow 
> BioJava to make 
> correct use of the bioentry_dbxref tables in BioSQL, and 
> second to allow 
> it to parse reference information correctly from EMBL, 
> Genbank, Genpept, 
> GenXML, and SwissProt records and store them within Sequence 
> objects in a 
> consistent manner.
> 
> Currently, references are loaded from only some of the above formats. 
> Depending on the format, they are stored in different ways 
> within Sequence 
> object. 
> 
> Genbank references are stored with each line of the record as 
> a separate 
> annotation. eg. one annotation with a key saying REFERENCE and value 
> giving a location, another with a key saying AUTHOR and a 
> value listing 
> them, etc. etc. As simple String/String annotations, they get 
> persisted to 
> the bioentry_qualifer_value table in BioSQL. As multiple 
> references are 
> read, they get stored with the same keys, so you end up with 
> Annotations 
> for these keys containing ArrayLists of potentially different arity, 
> depending on which of the original references had which 
> optional fields 
> included (eg. PUBMED or MEDLINE). This makes it impossible to 
> accurately 
> reconstruct the original reference information when exporting 
> the sequence 
> to a file.
> 
> EMBL/Swissprot references do almost the same thing, except 
> the parser here 
> gathers up the various reference tags from the file and wraps 
> each set in 
> its own ReferenceAnnotation class, which is just a map which gets 
> flattened out and persisted to bioentry_qualifier_value as 
> String/String 
> annotation pairs as above. When loaded back in from BioSQL the 
> ReferenceAnnotation objects are not recreated, and you end up 
> with the 
> same ArrayList problem as above, leading to the same problem 
> when trying 
> to export the sequence to a file.
> 
> Another problem here is that the two approaches only 
> understand their own 
> methods when it comes to exporting references in their own 
> file formats. 
> So, the Genbank exporter cannot export references that were 
> loaded from 
> EMBL/Swissprot, and vice versa.
> 
> Not good!
> 
> So, I propose the following:
> 
>                  1) Change the file format parsers above to 
> create, when 
> reading sequences from file, an 
> org.biojava.bibliography.BibRef objects 
> for each inputted reference. This object can then be stored 
> against the 
> Sequence as an annotation, with the key of BibRef.class. As 
> with all other 
> kinds of annotation, if multiple references are loaded then 
> the value of 
> the annotation should be an ArrayList of the various BibRef 
> objects. If 
> only one reference is loaded, then the value should be the 
> single BibRef 
> object itself.
>                  2) Change the file format parsers above to 
> understand, 
> when writing sequences to file, how to convert BibRef 
> annotations into 
> their own formats.
>                  3) There is no restriction on which of the 
> established 
> BibRef subtypes from org.biojava.bibliography.* you can 
> actually use to 
> annotate the sequence. Usually you'll be wanting a 
> BiblioJournalArticle 
> object. However, you MUST use certain fields as follows:
>                                  a) use the 'identifier' 
> field to store 
> the PubMed or MedLine ID (purely the ID, not prefixed with anything).
>                                  b) use the 'publisher' field 
> to store a 
> BiblioOrganisation object with name set to 'PUBMED' or 'MEDLINE' as 
> appropriate (must be upper case - if not, it will get changed 
> to upper 
> case on persistence to BioSQL, so you might as well stick it 
> in upper case 
> to start with).
>                                  c) use the 'type' field to 
> store a TYPE_* 
> value from BibRefSupport to indicate what sort of resource 
> this reference 
> refers to (in most cases you'll want TYPE_JOURNAL_ARTICLE).
>                  4) To alter 
> BioSQLSequenceDB.persistBioentryProperty() to 
> check for annotations with the key of BibRef.class or any of its 
> established subtypes as above, and use special behaviour to 
> persist these 
> to the bioentry_dbxref table (and related tables as appropriate).
>                  5) To alter 
> BioSQLSequenceAnnotation.initAnnotations() to 
> check for and load the bioentry_dbxref data as BibRef.class 
> annotations.
> 
> Any suggestions/changes/volunteers/violent objections? I can 
> manage steps 
> 4 and 5 myself quite easily, but will need help from everyone 
> out there in 
> updating the file parsers to use this proposed mechanism.
> 
> cheers,
> Richard
> 
> Richard Holland
> Bioinformatics Specialist
> Genome Institute of Singapore
> 60 Biopolis Street, #02-01 Genome, Singapore 138672
> Tel: (65) 6478 8000   DID: (65) 6478 8199
> Email: hollandr at gis.a-star.edu.sg
> ---------------------------------------------
> This email is confidential and may be privileged. If you are not the 
> intended recipient, please delete it and notify us 
> immediately. Please do 
> not copy or use it for any purpose, or disclose its content 
> to any other 
> person. Thank you.
> ---------------------------------------------
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 
> 
> 
>