[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files
mark.schreiber at novartis.com
mark.schreiber at novartis.com
Wed Jun 7 09:09:27 UTC 2006
Presumably the XML it produces should validate against the dtd? It should
also parse anything that validates against the dtd. I think that would be
the base line for behaivour of the parser.
Richard Holland <richard.holland at ebi.ac.uk>
Sent by: biojava-l-bounces at lists.open-bio.org
06/07/2006 05:01 PM
To: Seth Johnson <johnson.biotech at gmail.com>
cc: biojava-l at lists.open-bio.org, (bcc: Mark Schreiber/GP/Novartis)
Subject: Re: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1
daily update files
OK, I've updated INSDseqFormat to 1.4, or my interpretation of it based
on what the guys next door told me. Please let me know if you have
trouble running the XML it produces through any other parsers that can
read it, or if it throws a wobbly whilst reading stuff you are 100% sure
is valid.
cheers,
Richard
On Mon, 2006-06-05 at 12:28 -0400, Seth Johnson wrote:
> I agree with you on that one. However, the problem might be a little
> deeper. Same '?' appear in the INSDseq format bounded by
> <INSDReference_reference> tags and cause the following exception.
> This tells me that the '?' are actually values that are being
> incorrectly parsed. Further examination of the .dtd reveals that
> INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the
> files I obtain are in the INSDSeq v. 1.4 (which among other things
> contain a new tag <INSDReference_position>). Here're links to both
> .dtd's:
>
> http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt
>
> http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt
>
> I think it might be worth accommodating changes for the INSDseq
> format, not sure how that would affect the '?' in Genbank.
>
> Seth
>
> ======================
> org.biojava.bio.BioException: Could not read sequence
> at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> at exonhit.parsers.GenBankParser.main(GenBankParser.java:348)
> Caused by: org.biojava.bio.seq.io.ParseException:
> org.biojava.bio.seq.io.ParseException: Bad reference line found: ?
> at
org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250)
> at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> ... 1 more
> Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line
found: ?
> at
org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901)
> at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
> at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
> at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
> at
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
> at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
> at
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
> at
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
> at
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
> at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)
> at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97)
> at
org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246)
> ... 2 more
> Java Result: -1
> ======================
>
> ~~~~~~~~~~~~~~~~~~~~~~
> <INSDSeq_references>
> <INSDReference>
> <INSDReference_reference>?</INSDReference_reference>
> <INSDReference_position>1..16732</INSDReference_position>
> <INSDReference_authors>
> <INSDAuthor>Bjornerfeldt,S.</INSDAuthor>
> <INSDAuthor>Webster,M.T.</INSDAuthor>
> <INSDAuthor>Vila,C.</INSDAuthor>
> </INSDReference_authors>
> <INSDReference_title>Relaxation of Selective Constraint on Dog
> Mitochondrial DNA Following Domestication</INSDReference_title>
> <INSDReference_journal>Unpublished</INSDReference_journal>
> </INSDReference>
> <INSDReference>
> <INSDReference_reference>?</INSDReference_reference>
> <INSDReference_position>1..16732</INSDReference_position>
> <INSDReference_authors>
> <INSDAuthor>Bjornerfeldt,S.</INSDAuthor>
> <INSDAuthor>Webster,M.T.</INSDAuthor>
> <INSDAuthor>Vila,C.</INSDAuthor>
> </INSDReference_authors>
> <INSDReference_journal>Submitted (06-APR-2006) to the
> EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary
> Biology, Norbyvagen 18D, Uppsala 752 36,
> Sweden</INSDReference_journal>
> </INSDReference>
> </INSDSeq_references>
> ~~~~~~~~~~~~~~~~~~~~~~
>
> On 6/5/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > Hmmm... interesting. I _could_ put in a special case that ignores the
> > question marks, but that wouldn't be 'nice' really - this is more of a
> > problem with the program that is producing the Genbank files than a
> > problem with the parser trying to read them. '?' is not a valid tag in
> > the official Genbank format, and has no meaning attached to it that I
> > can work out, so I'm reluctant to make the parser recognise it.
> >
> > I'd suggest you contact the people who write the software you are
using
> > to produce the Genbank files and ask them if they could stick to the
> > rules!
> >
> > In the meantime you could work around the problem by stripping the
> > question marks in some kind of pre-processor before passing it onto
> > BioJavaX for parsing.
> >
> > cheers,
> > Richard
> >
> > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote:
> > > Removing '?' (or several of them in my case) avoids the following
exception:
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > org.biojava.bio.BioException: Could not read sequence
> > > at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > at
exonhit.parsers.GenBankParser.main(GenBankParser.java:348)
> > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957
> > > at
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
> > > at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > ... 1 more
> > > Java Result: -1
> > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > I don't know where that previous tokenization problem came from
since
> > > I can no longer reproduce it. This time it's more or less straight
> > > forward.
> > > Here's the original file with question marks:
> > > ============================
> > > LOCUS DQ415957 1437 bp mRNA linear VRT
01-JUN-2006
> > > DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a)
mRNA,
> > > complete cds.
> > > ACCESSION DQ415957
> > > VERSION DQ415957.1 GI:89513612
> > > KEYWORDS .
> > > SOURCE Unknown.
> > > ORGANISM Unknown.
> > > Unclassified.
> > > ?
> > > ?
> > > FEATURES Location/Qualifiers
> > > ?
> > > gene 1..1437
> > > /gene="cmg2a"
> > > CDS 1..1437
> > > /gene="cmg2a"
> > > /note="cell surface receptor; similar to
anthrax toxin
> > > receptor 2 (ANTXR2, ATR2, CMG2)"
> > > /codon_start=1
> > > /product="capillary morphogenesis protein 2A"
> > > /protein_id="ABD74633.1"
> > > /db_xref="GI:89513613"
> > > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS
> > > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS
> > > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY
> > > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS
> > > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT
> > > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC
> > > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS
> > > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL
> > > RRQYDRVSVMRPTSADKGRCMNFSRTQH"
> > > ORIGIN
> > > 1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt
ctgtttatgc
> > > 61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct
gtactttgtg
> > > 121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt
tgtcaaaaat
> > > 181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt
ttcatcaaga
> > > 241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg
cctgaagacc
> > > 301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa
attggcaact
> > > 361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt
gactgatgga
> > > 421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc
aaggaagtat
> > > 481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct
agccgatgtg
> > > 541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct
caaaggcatc
> > > 601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc
gtccagcgtc
> > > 661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt
ggggagacaa
> > > 721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca
aaaaccaacc
> > > 781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt
tggacagcaa
> > > 841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc
tttcatcatc
> > > 901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt
gctttttctc
> > > 961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt
cgttattaaa
> > > 1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga
cccggaaccc
> > > 1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc
tggtggaatc
> > > 1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc
aagactagag
> > > 1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat
ggtcaaaaag
> > > 1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac
accaatcaga
> > > 1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt
ttcagttatg
> > > 1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca
gcattaa
> > > //
> > >
> > > ============================
> > >
> > >
> > > On 6/5/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > > Hi again.
> > > >
> > > > Could you remove the offending question mark from the GenBank file
and
> > > > try it again to see if that fixes it? The parser should just
ignore it
> > > > but apparently not. The error looks weird to me because the
tokenization
> > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure
what's
> > > > going on here.
> > > ...
> > > >
> > > > cheers,
> > > > Richard
> > > >
> > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote:
> > > > > Hell again Richard,
> > > > >
> > > > > No sooner I've said about the fix of the last parsing exception
than
> > > > > another one came up with Genbank format:
> > > > > --------------------------------------
> > > > > org.biojava.bio.seq.io.ParseException: DQ431065
> > > > > org.biojava.bio.BioException: Could not read sequence
> > > > > at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > > at
exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151)
> > > > > at
exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246)
> > > > > at
exonhit.parsers.GenBankParser.main(GenBankParser.java:326)
> > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065
> > > > > at
org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
> > > > > at
org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > > > ... 3 more
> > > > > org.biojava.bio.seq.io.ParseException:
> > > > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization
> > > > > doesn't contain character: 't'
> > > > > ----------------------------------------
> > > > > The Genbank file that caused it is as follows:
> > > > > =========================================
> > > > > LOCUS DQ431065 425 bp DNA linear
INV 01-JUN-2006
> > > > > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene,
partial
> > > > > sequence; mitochondrial.
> > > > > ACCESSION DQ431065
> > > > > VERSION DQ431065.1 GI:90102206
> > > > > KEYWORDS .
> > > > > SOURCE Vaccinium corymbosum
> > > > > ORGANISM Vaccinium corymbosum
> > > > > Eukaryota; Viridiplantae; Streptophyta; Embryophyta;
Tracheophyta;
> > > > > Spermatophyta; Magnoliophyta; eudicotyledons; core
eudicotyledons;
> > > > > asterids; Ericales; Ericaceae; Vaccinioideae;
Vaccinieae;
> > > > > Vaccinium.
> > > > > ?
> > > > > REFERENCE 2 (bases 1 to 425)
> > > > > AUTHORS Naik,L.D. and Rowland,L.J.
> > > > > TITLE Expressed Sequence Tags of cDNA clones from
subtracted library of
> > > > > Vaccinium corymbosum
> > > > > JOURNAL Unpublished (2005)
> > > > > FEATURES Location/Qualifiers
> > > > > source 1..425
> > > > > /organism="Vaccinium corymbosum"
> > > > > /mol_type="genomic DNA"
> > > > > /cultivar="Bluecrop"
> > > > > /db_xref="taxon:69266"
> > > > > /tissue_type="Flower buds"
> > > > > /clone_lib="Subtracted cDNA library of
Vaccinium
> > > > > corymbosum"
> > > > > /dev_stage="399 hour chill unit exposure"
> > > > > /note="Vector: pCR4TOPO; Site_1: Eco R I;
Site_2: Eco R I"
> > > > > rRNA <1..>425
> > > > > /product="16S ribosomal RNA"
> > > > > ORIGIN
> > > > > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct
gcccgctgac
> > > > > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc
attagttctt
> > > > > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt
ttgaattgtt
> > > > > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga
gaagacccta
> > > > > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg
gccgtctaat
> > > > > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt
attatattta
> > > > > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta
gggataacag
> > > > > 421 cgtaa
> > > > > //
> > > > > ==================================
> > > > > I think it's the presence of the '?' at the beginning of the
line?!?!
> > > > > I'm not sure wether the information that was supposed to be
present
> > > > > instead of those question marks is absent from the original
ASN.1
> > > > > batch file or it's a bug in the NCBI ASN2GO software. It looks
to me
> > > > > that the former is the case since the file from NCBI website
contains
> > > > > much more information than the batch file. Just bringing this to
> > > > > everyone's attention.
> > > > >
> > > > >
> > > > > --
> > > > > Best Regards,
> > > > >
> > > > >
> > > > > Seth Johnson
> > > > > Senior Bioinformatics Associate
> > > > >
> > > > > Ph: (202) 470-0900
> > > > > Fx: (775) 251-0358
> > > > >
> > > > > On 6/2/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > > > > Hi Seth.
> > > > > >
> > > > > > Your second point, about the authors string not being read
correctly in
> > > > > > Genbank format, has been fixed (or should have been if I got
the code
> > > > > > right!). Could you check the latest version of biojava-live
out of CVS
> > > > > > and give it another go? Basically the parser did not recognise
the
> > > > > > CONSRTM tag, as it is not mentioned in the sample record
provided by
> > > > > > NCBI, which is what I based the parser on.
> > > > > ...
> > > > > >
> > > > > > cheers,
> > > > > > Richard
> > > > > >
> > > > > >
> > > > --
> > > > Richard Holland (BioMart Team)
> > > > EMBL-EBI
> > > > Wellcome Trust Genome Campus
> > > > Hinxton
> > > > Cambridge CB10 1SD
> > > > UNITED KINGDOM
> > > > Tel: +44-(0)1223-494416
> > > >
> > > >
> > >
> > >
> > --
> > Richard Holland (BioMart Team)
> > EMBL-EBI
> > Wellcome Trust Genome Campus
> > Hinxton
> > Cambridge CB10 1SD
> > UNITED KINGDOM
> > Tel: +44-(0)1223-494416
> >
> >
>
>
--
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416
_______________________________________________
Biojava-l mailing list - Biojava-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biojava-l
More information about the Biojava-l
mailing list