[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files

Wed Jun 7 11:56:04 UTC 2006

That'd be nice, except the DTD has bugs in it! I've pointed this out to
them already but no fixes have been made yet.

On Wed, 2006-06-07 at 17:09 +0800, mark.schreiber at novartis.com wrote:
> Presumably the XML it produces should validate against the dtd? It should 
> also parse anything that validates against the dtd. I think that would be 
> the base line for behaivour of the parser.
> 
> 
> 
> 
> 
> 
> Richard Holland <richard.holland at ebi.ac.uk>
> Sent by: biojava-l-bounces at lists.open-bio.org
> 06/07/2006 05:01 PM
> 
>  
>         To:     Seth Johnson <johnson.biotech at gmail.com>
>         cc:     biojava-l at lists.open-bio.org, (bcc: Mark Schreiber/GP/Novartis)
>         Subject:        Re: [Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary  NCBI ASN.1 
> daily update files
> 
> 
> OK, I've updated INSDseqFormat to 1.4, or my interpretation of it based
> on what the guys next door told me. Please let me know if you have
> trouble running the XML it produces through any other parsers that can
> read it, or if it throws a wobbly whilst reading stuff you are 100% sure
> is valid.
> 
> cheers,
> Richard
> 
> On Mon, 2006-06-05 at 12:28 -0400, Seth Johnson wrote:
> > I agree with you on that one.  However, the problem might be a little
> > deeper.  Same '?' appear in the INSDseq format bounded by
> > <INSDReference_reference> tags and cause the following exception.
> > This tells me that the '?' are actually values that are being
> > incorrectly parsed.  Further examination of the .dtd reveals that
> > INSDseqFormat.java is tailord towards the INSDSeq v. 1.3 whereas the
> > files I obtain are in the INSDSeq v. 1.4 (which among other things
> > contain a new tag <INSDReference_position>).  Here're links to both
> > .dtd's:
> > 
> > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDSeq_v1.3.dtd.txt
> > 
> > http://www.ebi.ac.uk/embl/Documentation/DTD/INSDC_V1.4.dtd.txt
> > 
> > I think it might be worth accommodating changes for the INSDseq
> > format, not sure how that would affect the '?' in Genbank.
> > 
> > Seth
> > 
> > ======================
> > org.biojava.bio.BioException: Could not read sequence
> >         at 
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> >         at exonhit.parsers.GenBankParser.main(GenBankParser.java:348)
> > Caused by: org.biojava.bio.seq.io.ParseException:
> > org.biojava.bio.seq.io.ParseException: Bad reference line found: ?
> >         at 
> org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:250)
> >         at 
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> >         ... 1 more
> > Caused by: org.biojava.bio.seq.io.ParseException: Bad reference line 
> found: ?
> >         at 
> org.biojavax.bio.seq.io.INSDseqFormat$INSDseqHandler.endElement(INSDseqFormat.java:901)
> >         at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:633)
> >         at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1241)
> >         at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(XMLDocumentFragmentScannerImpl.java:1685)
> >         at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:368)
> >         at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:834)
> >         at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
> >         at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:148)
> >         at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1242)
> >         at javax.xml.parsers.SAXParser.parse(SAXParser.java:375)
> >         at org.biojavax.utils.XMLTools.readXMLChunk(XMLTools.java:97)
> >         at 
> org.biojavax.bio.seq.io.INSDseqFormat.readRichSequence(INSDseqFormat.java:246)
> >         ... 2 more
> > Java Result: -1
> > ======================
> > 
> > ~~~~~~~~~~~~~~~~~~~~~~
> > <INSDSeq_references>
> >     <INSDReference>
> >       <INSDReference_reference>?</INSDReference_reference>
> >       <INSDReference_position>1..16732</INSDReference_position>
> >       <INSDReference_authors>
> >         <INSDAuthor>Bjornerfeldt,S.</INSDAuthor>
> >         <INSDAuthor>Webster,M.T.</INSDAuthor>
> >         <INSDAuthor>Vila,C.</INSDAuthor>
> >       </INSDReference_authors>
> >       <INSDReference_title>Relaxation of Selective Constraint on Dog
> > Mitochondrial DNA Following Domestication</INSDReference_title>
> >       <INSDReference_journal>Unpublished</INSDReference_journal>
> >     </INSDReference>
> >     <INSDReference>
> >       <INSDReference_reference>?</INSDReference_reference>
> >       <INSDReference_position>1..16732</INSDReference_position>
> >       <INSDReference_authors>
> >         <INSDAuthor>Bjornerfeldt,S.</INSDAuthor>
> >         <INSDAuthor>Webster,M.T.</INSDAuthor>
> >         <INSDAuthor>Vila,C.</INSDAuthor>
> >       </INSDReference_authors>
> >       <INSDReference_journal>Submitted (06-APR-2006) to the
> > EMBL/GenBank/DDBJ databases. Evolutionary Biology, Evolutionary
> > Biology, Norbyvagen 18D, Uppsala 752 36,
> > Sweden</INSDReference_journal>
> >     </INSDReference>
> >   </INSDSeq_references>
> > ~~~~~~~~~~~~~~~~~~~~~~
> > 
> > On 6/5/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > Hmmm... interesting. I _could_ put in a special case that ignores the
> > > question marks, but that wouldn't be 'nice' really - this is more of a
> > > problem with the program that is producing the Genbank files than a
> > > problem with the parser trying to read them. '?' is not a valid tag in
> > > the official Genbank format, and has no meaning attached to it that I
> > > can work out, so I'm reluctant to make the parser recognise it.
> > >
> > > I'd suggest you contact the people who write the software you are 
> using
> > > to produce the Genbank files and ask them if they could stick to the
> > > rules!
> > >
> > > In the meantime you could work around the problem by stripping the
> > > question marks in some kind of pre-processor before passing it onto
> > > BioJavaX for parsing.
> > >
> > > cheers,
> > > Richard
> > >
> > > On Mon, 2006-06-05 at 11:39 -0400, Seth Johnson wrote:
> > > > Removing '?' (or several of them in my case) avoids the following 
> exception:
> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > org.biojava.bio.BioException: Could not read sequence
> > > >         at 
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > >         at 
> exonhit.parsers.GenBankParser.main(GenBankParser.java:348)
> > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ415957
> > > >         at 
> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
> > > >         at 
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > >         ... 1 more
> > > > Java Result: -1
> > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > > > I don't know where that previous tokenization problem came from 
> since
> > > > I can no longer reproduce it.  This time it's more or less straight
> > > > forward.
> > > > Here's the original file with question marks:
> > > > ============================
> > > > LOCUS       DQ415957                1437 bp    mRNA    linear   VRT 
> 01-JUN-2006
> > > > DEFINITION  Danio rerio capillary morphogenesis protein 2A (cmg2a) 
> mRNA,
> > > >             complete cds.
> > > > ACCESSION   DQ415957
> > > > VERSION     DQ415957.1  GI:89513612
> > > > KEYWORDS    .
> > > > SOURCE      Unknown.
> > > >   ORGANISM  Unknown.
> > > >             Unclassified.
> > > > ?
> > > > ?
> > > > FEATURES             Location/Qualifiers
> > > > ?
> > > >      gene            1..1437
> > > >                      /gene="cmg2a"
> > > >      CDS             1..1437
> > > >                      /gene="cmg2a"
> > > >                      /note="cell surface receptor; similar to 
> anthrax toxin
> > > >                      receptor 2 (ANTXR2, ATR2, CMG2)"
> > > >                      /codon_start=1
> > > >                      /product="capillary morphogenesis protein 2A"
> > > >                      /protein_id="ABD74633.1"
> > > >                      /db_xref="GI:89513613"
> > > > /translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS
> > > > GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS
> > > > EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY
> > > > GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS
> > > > SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT
> > > > VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC
> > > > CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS
> > > > TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL
> > > >                      RRQYDRVSVMRPTSADKGRCMNFSRTQH"
> > > > ORIGIN
> > > >         1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt 
> ctgtttatgc
> > > >        61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct 
> gtactttgtg
> > > >       121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt 
> tgtcaaaaat
> > > >       181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt 
> ttcatcaaga
> > > >       241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg 
> cctgaagacc
> > > >       301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa 
> attggcaact
> > > >       361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt 
> gactgatgga
> > > >       421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc 
> aaggaagtat
> > > >       481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct 
> agccgatgtg
> > > >       541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct 
> caaaggcatc
> > > >       601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc 
> gtccagcgtc
> > > >       661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt 
> ggggagacaa
> > > >       721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca 
> aaaaccaacc
> > > >       781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt 
> tggacagcaa
> > > >       841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc 
> tttcatcatc
> > > >       901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt 
> gctttttctc
> > > >       961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt 
> cgttattaaa
> > > >      1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga 
> cccggaaccc
> > > >      1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc 
> tggtggaatc
> > > >      1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc 
> aagactagag
> > > >      1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat 
> ggtcaaaaag
> > > >      1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac 
> accaatcaga
> > > >      1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt 
> ttcagttatg
> > > >      1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca 
> gcattaa
> > > > //
> > > >
> > > > ============================
> > > >
> > > >
> > > > On 6/5/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > > > Hi again.
> > > > >
> > > > > Could you remove the offending question mark from the GenBank file 
> and
> > > > > try it again to see if that fixes it? The parser should just 
> ignore it
> > > > > but apparently not. The error looks weird to me because the 
> tokenization
> > > > > for a DNA GenBank file _does_ contain the letter 't'! Not sure 
> what's
> > > > > going on here.
> > > > ...
> > > > >
> > > > > cheers,
> > > > > Richard
> > > > >
> > > > > On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote:
> > > > > > Hell again Richard,
> > > > > >
> > > > > > No sooner I've said about the fix of the last parsing exception 
> than
> > > > > > another one came up with Genbank format:
> > > > > > --------------------------------------
> > > > > > org.biojava.bio.seq.io.ParseException: DQ431065
> > > > > > org.biojava.bio.BioException: Could not read sequence
> > > > > >         at 
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > > > > >         at 
> exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151)
> > > > > >         at 
> exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246)
> > > > > >         at 
> exonhit.parsers.GenBankParser.main(GenBankParser.java:326)
> > > > > > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065
> > > > > >         at 
> org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
> > > > > >         at 
> org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > > > > >         ... 3 more
> > > > > > org.biojava.bio.seq.io.ParseException:
> > > > > > org.biojava.bio.symbol.IllegalSymbolException: This tokenization
> > > > > > doesn't contain character: 't'
> > > > > > ----------------------------------------
> > > > > > The Genbank file that caused it is as follows:
> > > > > > =========================================
> > > > > > LOCUS       DQ431065                 425 bp    DNA     linear 
> INV 01-JUN-2006
> > > > > > DEFINITION  Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, 
> partial
> > > > > >             sequence; mitochondrial.
> > > > > > ACCESSION   DQ431065
> > > > > > VERSION     DQ431065.1  GI:90102206
> > > > > > KEYWORDS    .
> > > > > > SOURCE      Vaccinium corymbosum
> > > > > >   ORGANISM  Vaccinium corymbosum
> > > > > >             Eukaryota; Viridiplantae; Streptophyta; Embryophyta; 
> Tracheophyta;
> > > > > >             Spermatophyta; Magnoliophyta; eudicotyledons; core 
> eudicotyledons;
> > > > > >             asterids; Ericales; Ericaceae; Vaccinioideae; 
> Vaccinieae;
> > > > > >             Vaccinium.
> > > > > > ?
> > > > > > REFERENCE   2  (bases 1 to 425)
> > > > > >   AUTHORS   Naik,L.D. and Rowland,L.J.
> > > > > >   TITLE     Expressed Sequence Tags of cDNA clones from 
> subtracted library of
> > > > > >             Vaccinium corymbosum
> > > > > >   JOURNAL   Unpublished (2005)
> > > > > > FEATURES             Location/Qualifiers
> > > > > >      source          1..425
> > > > > >                      /organism="Vaccinium corymbosum"
> > > > > >                      /mol_type="genomic DNA"
> > > > > >                      /cultivar="Bluecrop"
> > > > > >                      /db_xref="taxon:69266"
> > > > > >                      /tissue_type="Flower buds"
> > > > > >                      /clone_lib="Subtracted cDNA library of 
> Vaccinium
> > > > > >                      corymbosum"
> > > > > >                      /dev_stage="399 hour chill unit exposure"
> > > > > >                      /note="Vector: pCR4TOPO; Site_1: Eco R I; 
> Site_2: Eco R I"
> > > > > >      rRNA            <1..>425
> > > > > >                      /product="16S ribosomal RNA"
> > > > > > ORIGIN
> > > > > >         1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct 
> gcccgctgac
> > > > > >        61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc 
> attagttctt
> > > > > >       121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt 
> ttgaattgtt
> > > > > >       181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga 
> gaagacccta
> > > > > >       241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg 
> gccgtctaat
> > > > > >       301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt 
> attatattta
> > > > > >       361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta 
> gggataacag
> > > > > >       421 cgtaa
> > > > > > //
> > > > > > ==================================
> > > > > > I think it's the presence of the '?' at the beginning of the 
> line?!?!
> > > > > > I'm not sure wether the information that was supposed to be 
> present
> > > > > > instead of those question marks is absent from the original 
> ASN.1
> > > > > > batch file or it's a bug in the NCBI ASN2GO software.  It looks 
> to me
> > > > > > that the former is the case since the file from NCBI website 
> contains
> > > > > > much more information than the batch file. Just bringing this to
> > > > > > everyone's attention.
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards,
> > > > > >
> > > > > >
> > > > > > Seth Johnson
> > > > > > Senior Bioinformatics Associate
> > > > > >
> > > > > > Ph: (202) 470-0900
> > > > > > Fx: (775) 251-0358
> > > > > >
> > > > > > On 6/2/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > > > > > Hi Seth.
> > > > > > >
> > > > > > > Your second point, about the authors string not being read 
> correctly in
> > > > > > > Genbank format, has been fixed (or should have been if I got 
> the code
> > > > > > > right!). Could you check the latest version of biojava-live 
> out of CVS
> > > > > > > and give it another go? Basically the parser did not recognise 
> the
> > > > > > > CONSRTM tag, as it is not mentioned in the sample record 
> provided by
> > > > > > > NCBI, which is what I based the parser on.
> > > > > > ...
> > > > > > >
> > > > > > > cheers,
> > > > > > > Richard
> > > > > > >
> > > > > > >
> > > > > --
> > > > > Richard Holland (BioMart Team)
> > > > > EMBL-EBI
> > > > > Wellcome Trust Genome Campus
> > > > > Hinxton
> > > > > Cambridge CB10 1SD
> > > > > UNITED KINGDOM
> > > > > Tel: +44-(0)1223-494416
> > > > >
> > > > >
> > > >
> > > >
> > > --
> > > Richard Holland (BioMart Team)
> > > EMBL-EBI
> > > Wellcome Trust Genome Campus
> > > Hinxton
> > > Cambridge CB10 1SD
> > > UNITED KINGDOM
> > > Tel: +44-(0)1223-494416
> > >
> > >
> > 
> > 
-- 
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416