[Biojava-l] RefSeq bioJava parser problem

Matthew Pocock matthew_pocock@yahoo.co.uk
Tue, 14 May 2002 21:33:21 +0100


Hi.

Anyone that write parsers should take a look at the 
org.biojava.bio.program.tagvalue package. It has cleaner support for 
this kind of parsing problem. We should be able to refactor sequenceIO 
to re-use a lot of this API, giving a much more modular framework for 
handeling nasty things like reference entries. It assumes that a file 
can be broken into tags with zero or more values, and that a given value 
stream may represent itself a sub-document of tag-value pairs; e.g. 
feature tables are a sub-document of both embl and genbank entries, and 
features are sub-documents of feature tables - the handlers for these 
can be re-used easily.

There is currently drop-in support for embl- and genbank-like file 
formats, and for those of you on jdk1.4, there is an implementation that 
processes lines into tag/value pairs, or split a value into a list of 
values based upon a regular expression (very handy).

Coupled with the new annotation property objects, it provides a very 
easy way to build object-trees from text.

If anyone does take a look and gets lost, please contact me and I will 
attempt to make the documentation more explicit.

Matthew

Cox, Greg wrote:
> Unfortunately, not.  This is probably the weakest point in BioJava's parsing
> right now.  
> 
> As you may have noticed, there's a more serious problem with the reference
> information.  If a reference doesn't contain a field that others do, nothing
> is added under that key, causing them to get out of sync.  For example:
> 
> REFERENCE
> 	TITLE foo
> 	TITLE bar
> 	AUTHOR wanner
> 
> When this gets turned into a biojava sequence, TITLE has [foo, bar] and
> AUTHOR has [wanner] but there's no way to tell which one wanner goes with.
> Good luck
> 
> Greg
> 	
> 
> 
>>-----Original Message-----
>>From: wanner.de@pg.com [mailto:wanner.de@pg.com]
>>Sent: Tuesday, May 14, 2002 11:41 AM
>>To: biojava-l@biojava.org
>>Subject: [Biojava-l] RefSeq bioJava parser problem
>>
>>
>>Hi,
>>
>>Appreciate the responses to the refSeq question. We've been 
>>able to put togther
>>a reliable parser using the example in TestRefSeqPrt.
>>
>>Have an additional question now.   Are there any utility 
>>methods within bioJava
>>that can be used to handle parsed values that are returned by 
>>bioJava in list
>>form.
>>
>>For example the following value was returned from bioJava for 
>>a sequence
>>annotation with key MEDLINE:
>>
>>     [98127055, 99357812]
>>
>>
>>Another example is the value that was returned from bioJava 
>>for a feature annotation with key  db_xref:
>>
>>     [LocusID:946, MIM:604405]
>>
>>bioJava does good work in accumulating the information 
>>together and placing it under a specific annotation, does
>>anyone know if there are method to extract listMembers or 
>>parameter/value pairs already available in bioJava?
>>
>>thx,
>>Dave
>>
>>
>>>>LOCUS       NP_000221                167 aa
>>>
>>>linear   PRI 29-JAN-2002
>>>
>>>>DEFINITION  leptin precursor; leptin (murine obesity
>>>
>>>homolog); obesity; obesity
>>>
>>>>            (murine homolog, leptin) [Homo sapiens].
>>>>ACCESSION   NP_000221
>>>>PID         g4557715
>>>>VERSION     NP_000221.1  GI:4557715
>>>>DBSOURCE    REFSEQ: accession NM_000230.1
>>>>KEYWORDS    .
>>>>SOURCE      human.
>>>>  ORGANISM  Homo sapiens
>>>>            Eukaryota; Metazoa; Chordata; Craniata;
>>>
>>>Vertebrata; Euteleostomi;
>>>
>>>>            Mammalia; Eutheria; Primates; Catarrhini;
>>>
>>>Hominidae; Homo.
>>>
>>>>REFERENCE   1  (residues 1 to 167)
>>>>  AUTHORS   Friedman JM, Leibel RL, Siegel DS, Walsh J 
>>>
>>and Bahary N.
>>
>>>>  TITLE     Molecular mapping of the mouse ob mutation
>>>>  JOURNAL   Genomics 11 (4), 1054-1062 (1991)
>>>>  MEDLINE   92147101
>>>>   PUBMED   1686014
>>>>REFERENCE   2  (residues 1 to 167)
>>>>  AUTHORS   Zhang Y, Proenca R, Maffei M, Barone M, Leopold
>>>
>>>L and Friedman JM.
>>>
>>>>  TITLE     Positional cloning of the mouse obese gene and
>>>
>>>its human homologue
>>>
>>>>  JOURNAL   Nature 372 (6505), 425-432 (1994)
>>>>  MEDLINE   95075453
>>>>   PUBMED   7984236
>>>>  REMARK    Erratum:[[published erratum appears in Nature 1995 Mar
>>>>            30;374(6521):479]]
>>>>REFERENCE   3  (residues 1 to 167)
>>>>  AUTHORS   Masuzaki H, Ogawa Y, Isse N, Satoh N, Okazaki
>>>
>>>T, Shigemoto M, Mori
>>>
>>>>            K, Tamura N, Hosoda K, Yoshimasa Y et al.
>>>>  TITLE     Human obese gene expression. Adipocyte-specific
>>>
>>>expression and
>>>
>>>>            regional differences in the adipose tissue
>>>>  JOURNAL   Diabetes 44 (7), 855-858 (1995)
>>>>  MEDLINE   95309556
>>>>   PUBMED   7789654
>>>>REFERENCE   4  (residues 1 to 167)
>>>>  AUTHORS   Green ED, Maffei M, Braden VV, Proenca R,
>>>
>>>DeSilva U, Zhang Y, Chua
>>>
>>>>            SC Jr, Leibel RL, Weissenbach J and Friedman JM.
>>>>  TITLE     The human obese (OB) gene: RNA expression
>>>
>>>pattern and mapping on
>>>
>>>>            the physical, cytogenetic, and genetic maps of
>>>
>>>chromosome 7
>>>
>>>>  JOURNAL   Genome Res. 5 (1), 5-12 (1995)
>>>>  MEDLINE   96352898
>>>>   PUBMED   8717050
>>>>REFERENCE   5  (residues 1 to 167)
>>>>  AUTHORS   Isse N, Ogawa Y, Tamura N, Masuzaki H, Mori K,
>>>
>>>Okazaki T, Satoh N,
>>>
>>>>            Shigemoto M, Yoshimasa Y, Nishi S et al.
>>>>  TITLE     Structural organization and chromosomal
>>>
>>>assignment of the human
>>>
>>>>            obese gene
>>>>  JOURNAL   J. Biol. Chem. 270 (46), 27728-27733 (1995)
>>>>  MEDLINE   96070903
>>>>   PUBMED   7499240
>>>>REFERENCE   6  (residues 1 to 167)
>>>>  AUTHORS   Gong,D.W., Bi,S., Pratley,R.E. and Weintraub,B.D.
>>>>  TITLE     Genomic structure and promoter analysis of the
>>>
>>>human obese gene
>>>
>>>>  JOURNAL   J. Biol. Chem. 271 (8), 3971-3974 (1996)
>>>>  MEDLINE   96223958
>>>>REFERENCE   7  (residues 1 to 167)
>>>>  AUTHORS   Niki T, Mori H, Tamori Y, Kishimoto-Hashirmoto
>>>
>>>M, Ueno H, Araki S,
>>>
>>>>            Masugi J, Sawant N, Majithia HR, Rais N et al.
>>>>  TITLE     Human obese gene: molecular screening in
>>>
>>>Japanese and Asian Indian
>>>
>>>>            NIDDM patients associated with obesity
>>>>  JOURNAL   Diabetes 45 (5), 675-678 (1996)
>>>>  MEDLINE   96198511
>>>>   PUBMED   8621021
>>>>REFERENCE   8  (residues 1 to 167)
>>>>  AUTHORS   Comuzzie,A.G., Hixson,J.E., Almasy,L.,
>>>
>>>Mitchell,B.D., Mahaney,M.C.,
>>>
>>>>            Dyer,T.D., Stern,M.P., MacCluer,J.W. and Blangero,J.
>>>>  TITLE     A major quantitative trait locus determining
>>>
>>>serum leptin levels
>>>
>>>>            and fat mass is located on human chromosome 2
>>>>  JOURNAL   Nat. Genet. 15 (3), 273-276 (1997)
>>>>  MEDLINE   97207647
>>>>   PUBMED   9054940
>>>>REFERENCE   9  (residues 1 to 167)
>>>>  AUTHORS   Clement,K., Vaisse,C., Lahlou,N., Cabrol,S., 
>>>
>>Pelloux,V.,
>>
>>>>            Cassuto,D., Gourmelen,M., Dina,C., Chambaz,J.,
>>>
>>>Lacorte,J.M.,
>>>
>>>>            Basdevant,A., Bougneres,P., Lebouc,Y.,
>>>
>>>Froguel,P. and Guy-Grand,B.
>>>
>>>>  TITLE     A mutation in the human leptin receptor gene
>>>
>>>causes obesity and
>>>
>>>>            pituitary dysfunction
>>>>  JOURNAL   Nature 392 (6674), 398-401 (1998)
>>>>  MEDLINE   98196670
>>>>   PUBMED   9537324
>>>>REFERENCE   10 (residues 1 to 167)
>>>>  AUTHORS   Friedman,J.M. and Halaas,J.L.
>>>>  TITLE     Leptin and the regulation of body weight in mammals
>>>>  JOURNAL   Nature 395 (6704), 763-770 (1998)
>>>>  MEDLINE   99010835
>>>>COMMENT     REVIEWED REFSEQ: This record has been curated
>>>
>>>by NCBI staff. The
>>>
>>>>            reference sequence was derived from U43653.1.
>>>>            Summary: This gene is similar to the mouse
>>>
>>>obesity gene (ob). The
>>>
>>>>            protein encoded by this gene is secreted by
>>>
>>>white adipocytes. In
>>>
>>>>            the mouse study, mutations in this gene are
>>>
>>>linked to severe and
>>>
>>>>            morbid obesity.
>>>>FEATURES             Location/Qualifiers
>>>>     source          1..167
>>>>                     /organism="Homo sapiens"
>>>>                     /db_xref="taxon:9606"
>>>>                     /chromosome="7"
>>>>                     /map="7q31.3"
>>>>     Protein         1..167
>>>>                     /product="leptin precursor"
>>>>                     /note="leptin (murine obesity
>>>
>>>homolog); obesity (murine
>>>
>>>>                     homolog, leptin)"
>>>>     sig_peptide     1..21
>>>>     Region          22..167
>>>>                     /region_name="Leptin"
>>>>                     /note="Leptin"
>>>>                     /db_xref="CDD:pfam02024"
>>>>     mat_peptide     22..167
>>>>                     /product="leptin"
>>>>     CDS             1..167
>>>>                     /gene="LEP"
>>>>                     /coded_by="NM_000230.1:57..560"
>>>>                     /db_xref="LocusID:3952"
>>>>                     /db_xref="MIM:164160"
>>>>ORIGIN
>>>>        1 mhwgtlcgfl wlwpylfyvq avpiqkvqdd tktliktivt
>>>
>>>rindishtqs vsskqkvtgl
>>>
>>>>       61 dfipglhpil tlskmdqtla vyqqiltsmp srnviqisnd
>>>
>>>lenlrdllhv lafskschlp
>>>
>>>>      121 wasgletlds lggvleasgy stevvalsrl qgslqdmlwq ldlspgc
>>>>//
>>>>
>>>>_______________________________________________
>>>>Biojava-l mailing list  -  Biojava-l@biojava.org
>>>>http://biojava.org/mailman/listinfo/biojava-l
>>>>
>>>
>>>
>>>
>>>_______________________________________________
>>>Biojava-l mailing list  -  Biojava-l@biojava.org
>>>http://biojava.org/mailman/listinfo/biojava-l
>>>
>>
>>_______________________________________________
>>Biojava-l mailing list  -  Biojava-l@biojava.org
>>http://biojava.org/mailman/listinfo/biojava-l
>>
>>_______________________________________________
>>Biojava-l mailing list  -  Biojava-l@biojava.org
>>http://biojava.org/mailman/listinfo/biojava-l
>>
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>