[Biojava-l] RefSeq bioJava parser problem
Matthew Pocock
matthew_pocock@yahoo.co.uk
Tue, 14 May 2002 21:33:21 +0100
Hi.
Anyone that write parsers should take a look at the
org.biojava.bio.program.tagvalue package. It has cleaner support for
this kind of parsing problem. We should be able to refactor sequenceIO
to re-use a lot of this API, giving a much more modular framework for
handeling nasty things like reference entries. It assumes that a file
can be broken into tags with zero or more values, and that a given value
stream may represent itself a sub-document of tag-value pairs; e.g.
feature tables are a sub-document of both embl and genbank entries, and
features are sub-documents of feature tables - the handlers for these
can be re-used easily.
There is currently drop-in support for embl- and genbank-like file
formats, and for those of you on jdk1.4, there is an implementation that
processes lines into tag/value pairs, or split a value into a list of
values based upon a regular expression (very handy).
Coupled with the new annotation property objects, it provides a very
easy way to build object-trees from text.
If anyone does take a look and gets lost, please contact me and I will
attempt to make the documentation more explicit.
Matthew
Cox, Greg wrote:
> Unfortunately, not. This is probably the weakest point in BioJava's parsing
> right now.
>
> As you may have noticed, there's a more serious problem with the reference
> information. If a reference doesn't contain a field that others do, nothing
> is added under that key, causing them to get out of sync. For example:
>
> REFERENCE
> TITLE foo
> TITLE bar
> AUTHOR wanner
>
> When this gets turned into a biojava sequence, TITLE has [foo, bar] and
> AUTHOR has [wanner] but there's no way to tell which one wanner goes with.
> Good luck
>
> Greg
>
>
>
>>-----Original Message-----
>>From: wanner.de@pg.com [mailto:wanner.de@pg.com]
>>Sent: Tuesday, May 14, 2002 11:41 AM
>>To: biojava-l@biojava.org
>>Subject: [Biojava-l] RefSeq bioJava parser problem
>>
>>
>>Hi,
>>
>>Appreciate the responses to the refSeq question. We've been
>>able to put togther
>>a reliable parser using the example in TestRefSeqPrt.
>>
>>Have an additional question now. Are there any utility
>>methods within bioJava
>>that can be used to handle parsed values that are returned by
>>bioJava in list
>>form.
>>
>>For example the following value was returned from bioJava for
>>a sequence
>>annotation with key MEDLINE:
>>
>> [98127055, 99357812]
>>
>>
>>Another example is the value that was returned from bioJava
>>for a feature annotation with key db_xref:
>>
>> [LocusID:946, MIM:604405]
>>
>>bioJava does good work in accumulating the information
>>together and placing it under a specific annotation, does
>>anyone know if there are method to extract listMembers or
>>parameter/value pairs already available in bioJava?
>>
>>thx,
>>Dave
>>
>>
>>>>LOCUS NP_000221 167 aa
>>>
>>>linear PRI 29-JAN-2002
>>>
>>>>DEFINITION leptin precursor; leptin (murine obesity
>>>
>>>homolog); obesity; obesity
>>>
>>>> (murine homolog, leptin) [Homo sapiens].
>>>>ACCESSION NP_000221
>>>>PID g4557715
>>>>VERSION NP_000221.1 GI:4557715
>>>>DBSOURCE REFSEQ: accession NM_000230.1
>>>>KEYWORDS .
>>>>SOURCE human.
>>>> ORGANISM Homo sapiens
>>>> Eukaryota; Metazoa; Chordata; Craniata;
>>>
>>>Vertebrata; Euteleostomi;
>>>
>>>> Mammalia; Eutheria; Primates; Catarrhini;
>>>
>>>Hominidae; Homo.
>>>
>>>>REFERENCE 1 (residues 1 to 167)
>>>> AUTHORS Friedman JM, Leibel RL, Siegel DS, Walsh J
>>>
>>and Bahary N.
>>
>>>> TITLE Molecular mapping of the mouse ob mutation
>>>> JOURNAL Genomics 11 (4), 1054-1062 (1991)
>>>> MEDLINE 92147101
>>>> PUBMED 1686014
>>>>REFERENCE 2 (residues 1 to 167)
>>>> AUTHORS Zhang Y, Proenca R, Maffei M, Barone M, Leopold
>>>
>>>L and Friedman JM.
>>>
>>>> TITLE Positional cloning of the mouse obese gene and
>>>
>>>its human homologue
>>>
>>>> JOURNAL Nature 372 (6505), 425-432 (1994)
>>>> MEDLINE 95075453
>>>> PUBMED 7984236
>>>> REMARK Erratum:[[published erratum appears in Nature 1995 Mar
>>>> 30;374(6521):479]]
>>>>REFERENCE 3 (residues 1 to 167)
>>>> AUTHORS Masuzaki H, Ogawa Y, Isse N, Satoh N, Okazaki
>>>
>>>T, Shigemoto M, Mori
>>>
>>>> K, Tamura N, Hosoda K, Yoshimasa Y et al.
>>>> TITLE Human obese gene expression. Adipocyte-specific
>>>
>>>expression and
>>>
>>>> regional differences in the adipose tissue
>>>> JOURNAL Diabetes 44 (7), 855-858 (1995)
>>>> MEDLINE 95309556
>>>> PUBMED 7789654
>>>>REFERENCE 4 (residues 1 to 167)
>>>> AUTHORS Green ED, Maffei M, Braden VV, Proenca R,
>>>
>>>DeSilva U, Zhang Y, Chua
>>>
>>>> SC Jr, Leibel RL, Weissenbach J and Friedman JM.
>>>> TITLE The human obese (OB) gene: RNA expression
>>>
>>>pattern and mapping on
>>>
>>>> the physical, cytogenetic, and genetic maps of
>>>
>>>chromosome 7
>>>
>>>> JOURNAL Genome Res. 5 (1), 5-12 (1995)
>>>> MEDLINE 96352898
>>>> PUBMED 8717050
>>>>REFERENCE 5 (residues 1 to 167)
>>>> AUTHORS Isse N, Ogawa Y, Tamura N, Masuzaki H, Mori K,
>>>
>>>Okazaki T, Satoh N,
>>>
>>>> Shigemoto M, Yoshimasa Y, Nishi S et al.
>>>> TITLE Structural organization and chromosomal
>>>
>>>assignment of the human
>>>
>>>> obese gene
>>>> JOURNAL J. Biol. Chem. 270 (46), 27728-27733 (1995)
>>>> MEDLINE 96070903
>>>> PUBMED 7499240
>>>>REFERENCE 6 (residues 1 to 167)
>>>> AUTHORS Gong,D.W., Bi,S., Pratley,R.E. and Weintraub,B.D.
>>>> TITLE Genomic structure and promoter analysis of the
>>>
>>>human obese gene
>>>
>>>> JOURNAL J. Biol. Chem. 271 (8), 3971-3974 (1996)
>>>> MEDLINE 96223958
>>>>REFERENCE 7 (residues 1 to 167)
>>>> AUTHORS Niki T, Mori H, Tamori Y, Kishimoto-Hashirmoto
>>>
>>>M, Ueno H, Araki S,
>>>
>>>> Masugi J, Sawant N, Majithia HR, Rais N et al.
>>>> TITLE Human obese gene: molecular screening in
>>>
>>>Japanese and Asian Indian
>>>
>>>> NIDDM patients associated with obesity
>>>> JOURNAL Diabetes 45 (5), 675-678 (1996)
>>>> MEDLINE 96198511
>>>> PUBMED 8621021
>>>>REFERENCE 8 (residues 1 to 167)
>>>> AUTHORS Comuzzie,A.G., Hixson,J.E., Almasy,L.,
>>>
>>>Mitchell,B.D., Mahaney,M.C.,
>>>
>>>> Dyer,T.D., Stern,M.P., MacCluer,J.W. and Blangero,J.
>>>> TITLE A major quantitative trait locus determining
>>>
>>>serum leptin levels
>>>
>>>> and fat mass is located on human chromosome 2
>>>> JOURNAL Nat. Genet. 15 (3), 273-276 (1997)
>>>> MEDLINE 97207647
>>>> PUBMED 9054940
>>>>REFERENCE 9 (residues 1 to 167)
>>>> AUTHORS Clement,K., Vaisse,C., Lahlou,N., Cabrol,S.,
>>>
>>Pelloux,V.,
>>
>>>> Cassuto,D., Gourmelen,M., Dina,C., Chambaz,J.,
>>>
>>>Lacorte,J.M.,
>>>
>>>> Basdevant,A., Bougneres,P., Lebouc,Y.,
>>>
>>>Froguel,P. and Guy-Grand,B.
>>>
>>>> TITLE A mutation in the human leptin receptor gene
>>>
>>>causes obesity and
>>>
>>>> pituitary dysfunction
>>>> JOURNAL Nature 392 (6674), 398-401 (1998)
>>>> MEDLINE 98196670
>>>> PUBMED 9537324
>>>>REFERENCE 10 (residues 1 to 167)
>>>> AUTHORS Friedman,J.M. and Halaas,J.L.
>>>> TITLE Leptin and the regulation of body weight in mammals
>>>> JOURNAL Nature 395 (6704), 763-770 (1998)
>>>> MEDLINE 99010835
>>>>COMMENT REVIEWED REFSEQ: This record has been curated
>>>
>>>by NCBI staff. The
>>>
>>>> reference sequence was derived from U43653.1.
>>>> Summary: This gene is similar to the mouse
>>>
>>>obesity gene (ob). The
>>>
>>>> protein encoded by this gene is secreted by
>>>
>>>white adipocytes. In
>>>
>>>> the mouse study, mutations in this gene are
>>>
>>>linked to severe and
>>>
>>>> morbid obesity.
>>>>FEATURES Location/Qualifiers
>>>> source 1..167
>>>> /organism="Homo sapiens"
>>>> /db_xref="taxon:9606"
>>>> /chromosome="7"
>>>> /map="7q31.3"
>>>> Protein 1..167
>>>> /product="leptin precursor"
>>>> /note="leptin (murine obesity
>>>
>>>homolog); obesity (murine
>>>
>>>> homolog, leptin)"
>>>> sig_peptide 1..21
>>>> Region 22..167
>>>> /region_name="Leptin"
>>>> /note="Leptin"
>>>> /db_xref="CDD:pfam02024"
>>>> mat_peptide 22..167
>>>> /product="leptin"
>>>> CDS 1..167
>>>> /gene="LEP"
>>>> /coded_by="NM_000230.1:57..560"
>>>> /db_xref="LocusID:3952"
>>>> /db_xref="MIM:164160"
>>>>ORIGIN
>>>> 1 mhwgtlcgfl wlwpylfyvq avpiqkvqdd tktliktivt
>>>
>>>rindishtqs vsskqkvtgl
>>>
>>>> 61 dfipglhpil tlskmdqtla vyqqiltsmp srnviqisnd
>>>
>>>lenlrdllhv lafskschlp
>>>
>>>> 121 wasgletlds lggvleasgy stevvalsrl qgslqdmlwq ldlspgc
>>>>//
>>>>
>>>>_______________________________________________
>>>>Biojava-l mailing list - Biojava-l@biojava.org
>>>>http://biojava.org/mailman/listinfo/biojava-l
>>>>
>>>
>>>
>>>
>>>_______________________________________________
>>>Biojava-l mailing list - Biojava-l@biojava.org
>>>http://biojava.org/mailman/listinfo/biojava-l
>>>
>>
>>_______________________________________________
>>Biojava-l mailing list - Biojava-l@biojava.org
>>http://biojava.org/mailman/listinfo/biojava-l
>>
>>_______________________________________________
>>Biojava-l mailing list - Biojava-l@biojava.org
>>http://biojava.org/mailman/listinfo/biojava-l
>>
>
> _______________________________________________
> Biojava-l mailing list - Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
>