[Biojava-l] Change Proposal regarding References

Tue May 31 23:18:08 EDT 2005

I'd support this and might be able to help out with advice or words of 
encouragement (coffee at least) for the first few steps.

I would also encourage you to look into the rank column of the appropriate 
BioSQL tables. The rank column is intented to help preserve the order of 
comments, dbxrefs, references, qualifiers etc so that when you dump 
something out in Genbank format you get everything in the same order it 
was read in. I'm not sure Biojava makes sensible use of rank columns at 
the moment.

- Mark

"Richard HOLLAND" <hollandr at gis.a-star.edu.sg>
Sent by: biojava-l-bounces at portal.open-bio.org
06/01/2005 11:02 AM

        To:     <biojava-l at biojava.org>, "OBDA BioSQL" <biosql-l at open-bio.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Change Proposal regarding References

Hi all,

This is a two-pronged change proposal - first to allow BioJava to make 
correct use of the bioentry_dbxref tables in BioSQL, and second to allow 
it to parse reference information correctly from EMBL, Genbank, Genpept, 
GenXML, and SwissProt records and store them within Sequence objects in a 
consistent manner.

Currently, references are loaded from only some of the above formats. 
Depending on the format, they are stored in different ways within Sequence 
object. 

Genbank references are stored with each line of the record as a separate 
annotation. eg. one annotation with a key saying REFERENCE and value 
giving a location, another with a key saying AUTHOR and a value listing 
them, etc. etc. As simple String/String annotations, they get persisted to 
the bioentry_qualifer_value table in BioSQL. As multiple references are 
read, they get stored with the same keys, so you end up with Annotations 
for these keys containing ArrayLists of potentially different arity, 
depending on which of the original references had which optional fields 
included (eg. PUBMED or MEDLINE). This makes it impossible to accurately 
reconstruct the original reference information when exporting the sequence 
to a file.

EMBL/Swissprot references do almost the same thing, except the parser here 
gathers up the various reference tags from the file and wraps each set in 
its own ReferenceAnnotation class, which is just a map which gets 
flattened out and persisted to bioentry_qualifier_value as String/String 
annotation pairs as above. When loaded back in from BioSQL the 
ReferenceAnnotation objects are not recreated, and you end up with the 
same ArrayList problem as above, leading to the same problem when trying 
to export the sequence to a file.

Another problem here is that the two approaches only understand their own 
methods when it comes to exporting references in their own file formats. 
So, the Genbank exporter cannot export references that were loaded from 
EMBL/Swissprot, and vice versa.

Not good!

So, I propose the following:

                 1) Change the file format parsers above to create, when 
reading sequences from file, an org.biojava.bibliography.BibRef objects 
for each inputted reference. This object can then be stored against the 
Sequence as an annotation, with the key of BibRef.class. As with all other 
kinds of annotation, if multiple references are loaded then the value of 
the annotation should be an ArrayList of the various BibRef objects. If 
only one reference is loaded, then the value should be the single BibRef 
object itself.
                 2) Change the file format parsers above to understand, 
when writing sequences to file, how to convert BibRef annotations into 
their own formats.
                 3) There is no restriction on which of the established 
BibRef subtypes from org.biojava.bibliography.* you can actually use to 
annotate the sequence. Usually you'll be wanting a BiblioJournalArticle 
object. However, you MUST use certain fields as follows:
                                 a) use the 'identifier' field to store 
the PubMed or MedLine ID (purely the ID, not prefixed with anything).
                                 b) use the 'publisher' field to store a 
BiblioOrganisation object with name set to 'PUBMED' or 'MEDLINE' as 
appropriate (must be upper case - if not, it will get changed to upper 
case on persistence to BioSQL, so you might as well stick it in upper case 
to start with).
                                 c) use the 'type' field to store a TYPE_* 
value from BibRefSupport to indicate what sort of resource this reference 
refers to (in most cases you'll want TYPE_JOURNAL_ARTICLE).
                 4) To alter BioSQLSequenceDB.persistBioentryProperty() to 
check for annotations with the key of BibRef.class or any of its 
established subtypes as above, and use special behaviour to persist these 
to the bioentry_dbxref table (and related tables as appropriate).
                 5) To alter BioSQLSequenceAnnotation.initAnnotations() to 
check for and load the bioentry_dbxref data as BibRef.class annotations.

Any suggestions/changes/volunteers/violent objections? I can manage steps 
4 and 5 myself quite easily, but will need help from everyone out there in 
updating the file parsers to use this proposed mechanism.

cheers,
Richard

Richard Holland
Bioinformatics Specialist
Genome Institute of Singapore
60 Biopolis Street, #02-01 Genome, Singapore 138672
Tel: (65) 6478 8000   DID: (65) 6478 8199
Email: hollandr at gis.a-star.edu.sg
---------------------------------------------
This email is confidential and may be privileged. If you are not the 
intended recipient, please delete it and notify us immediately. Please do 
not copy or use it for any purpose, or disclose its content to any other 
person. Thank you.
---------------------------------------------

_______________________________________________
Biojava-l mailing list  -  Biojava-l at biojava.org
http://biojava.org/mailman/listinfo/biojava-l