[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files
Seth Johnson
johnson.biotech at gmail.com
Mon Jun 5 15:39:40 UTC 2006
Removing '?' (or several of them in my case) avoids the following exception:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
org.biojava.bio.BioException: Could not read sequence
at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
at exonhit.parsers.GenBankParser.main(GenBankParser.java:348)
Caused by: org.biojava.bio.seq.io.ParseException: DQ415957
at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
... 1 more
Java Result: -1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I don't know where that previous tokenization problem came from since
I can no longer reproduce it. This time it's more or less straight
forward.
Here's the original file with question marks:
============================
LOCUS DQ415957 1437 bp mRNA linear VRT 01-JUN-2006
DEFINITION Danio rerio capillary morphogenesis protein 2A (cmg2a) mRNA,
complete cds.
ACCESSION DQ415957
VERSION DQ415957.1 GI:89513612
KEYWORDS .
SOURCE Unknown.
ORGANISM Unknown.
Unclassified.
?
?
FEATURES Location/Qualifiers
?
gene 1..1437
/gene="cmg2a"
CDS 1..1437
/gene="cmg2a"
/note="cell surface receptor; similar to anthrax toxin
receptor 2 (ANTXR2, ATR2, CMG2)"
/codon_start=1
/product="capillary morphogenesis protein 2A"
/protein_id="ABD74633.1"
/db_xref="GI:89513613"
/translation="MTKENLWSVATTATLFFCLCFSSFKAETPSCHGAYDLYFVLDRS
GSVSTDWSEIYDFVKNLTERFVSPNLRVSFIVFSSRAEIVLPLTGDRSEINKGLKTLS
EVNPAGETYMHEGIKLATEQMKKEPKKSSSIIVALTDGKLETYIHQLTIDEADSARKY
GARVYCVGVKDFDEEQLADVADSKEQVFPVKGGFQALKGIVNSILKQSCTEILTVEPS
SVCVNQSFDIVLRGNGFAVGRQTEGVICSFIVDGVTYKQKPTKVKIDYILCPAPVLYT
VGQQMEVLISLNSGTSYITSAFIITASSCSDGTVVAIVFLVLFLLLALALMWWFWPLC
CTVVIKDPPPQRPPPPPPKLEPDPEPKKKWPTVDASYYGGRGAGGIKRMEVRWGEKGS
TEEGARLEMAKNAVVSIQEESEEPMVKKPRAPAQTCHQSESKWYTPIRGRLDALWALL
RRQYDRVSVMRPTSADKGRCMNFSRTQH"
ORIGIN
1 atgacaaagg aaaatctctg gagcgtggca accacggcga ctcttttctt ctgtttatgc
61 ttttcatctt ttaaagcgga aaccccatct tgtcatggtg cctacgacct gtactttgtg
121 ttggaccgat ctggaagtgt ttcgactgac tggagtgaaa tctatgactt tgtcaaaaat
181 cttacagaga gatttgtgag tccaaatctg cgagtgtcct tcattgtttt ttcatcaaga
241 gcagagattg tgttaccgct cactggagac aggtcagaaa ttaataaagg cctgaagacc
301 ttaagtgagg tcaatccagc tggagaaaca tacatgcatg aaggaattaa attggcaact
361 gaacaaatga aaaaagagcc taaaaagtcc tctagtatta ttgtggcctt gactgatgga
421 aagcttgaaa cgtatatcca tcaactcact attgacgagg ctgattcagc aaggaagtat
481 ggggctcgtg tgtactgtgt tggtgtaaaa gactttgatg aagaacagct agccgatgtg
541 gctgattcca aggagcaagt gttcccagtc aaaggaggct ttcaggctct caaaggcatc
601 gttaactcga tcctcaagca atcatgcacc gaaatcctaa cagtggaacc gtccagcgtc
661 tgcgtgaacc agtcctttga cattgttttg agagggaacg ggttcgcagt ggggagacaa
721 acagaaggag tcatctgcag tttcatagtg gatggagtta cttacaaaca aaaaccaacc
781 aaagtgaaga ttgactacat cctatgtcct gctccagtgc tgtatacagt tggacagcaa
841 atggaggttc tgatcagttt gaacagtgga acatcatata tcaccagtgc tttcatcatc
901 actgcctctt catgttcgga cggcacagtg gtggccattg tgttcttggt gctttttctc
961 ctgttggctt tggctctgat gtggtggttc tggcctctat gctgcactgt cgttattaaa
1021 gacccacctc cacaaagacc tcctccacct ccacctaagc tagagccaga cccggaaccc
1081 aagaagaagt ggccaactgt ggatgcatct tactatgggg gaagaggagc tggtggaatc
1141 aaacgcatgg aggtccgttg gggagaaaaa gggtctacag aggaaggtgc aagactagag
1201 atggctaaga atgcagtagt gtcaatacaa gaggaatcag aagaacccat ggtcaaaaag
1261 ccaagagcac ctgcacaaac atgccatcaa tctgaatcca agtggtatac accaatcaga
1321 ggccgtcttg acgcactgtg ggctcttttg cggcggcaat atgaccgagt ttcagttatg
1381 cgaccaactt ctgcagataa gggtcgctgt atgaatttca gtcgcacgca gcattaa
//
============================
On 6/5/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> Hi again.
>
> Could you remove the offending question mark from the GenBank file and
> try it again to see if that fixes it? The parser should just ignore it
> but apparently not. The error looks weird to me because the tokenization
> for a DNA GenBank file _does_ contain the letter 't'! Not sure what's
> going on here.
...
>
> cheers,
> Richard
>
> On Mon, 2006-06-05 at 10:37 -0400, Seth Johnson wrote:
> > Hell again Richard,
> >
> > No sooner I've said about the fix of the last parsing exception than
> > another one came up with Genbank format:
> > --------------------------------------
> > org.biojava.bio.seq.io.ParseException: DQ431065
> > org.biojava.bio.BioException: Could not read sequence
> > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> > at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151)
> > at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246)
> > at exonhit.parsers.GenBankParser.main(GenBankParser.java:326)
> > Caused by: org.biojava.bio.seq.io.ParseException: DQ431065
> > at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
> > at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> > ... 3 more
> > org.biojava.bio.seq.io.ParseException:
> > org.biojava.bio.symbol.IllegalSymbolException: This tokenization
> > doesn't contain character: 't'
> > ----------------------------------------
> > The Genbank file that caused it is as follows:
> > =========================================
> > LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006
> > DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial
> > sequence; mitochondrial.
> > ACCESSION DQ431065
> > VERSION DQ431065.1 GI:90102206
> > KEYWORDS .
> > SOURCE Vaccinium corymbosum
> > ORGANISM Vaccinium corymbosum
> > Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
> > Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
> > asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae;
> > Vaccinium.
> > ?
> > REFERENCE 2 (bases 1 to 425)
> > AUTHORS Naik,L.D. and Rowland,L.J.
> > TITLE Expressed Sequence Tags of cDNA clones from subtracted library of
> > Vaccinium corymbosum
> > JOURNAL Unpublished (2005)
> > FEATURES Location/Qualifiers
> > source 1..425
> > /organism="Vaccinium corymbosum"
> > /mol_type="genomic DNA"
> > /cultivar="Bluecrop"
> > /db_xref="taxon:69266"
> > /tissue_type="Flower buds"
> > /clone_lib="Subtracted cDNA library of Vaccinium
> > corymbosum"
> > /dev_stage="399 hour chill unit exposure"
> > /note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I"
> > rRNA <1..>425
> > /product="16S ribosomal RNA"
> > ORIGIN
> > 1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac
> > 61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt
> > 121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt
> > 181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta
> > 241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat
> > 301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta
> > 361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag
> > 421 cgtaa
> > //
> > ==================================
> > I think it's the presence of the '?' at the beginning of the line?!?!
> > I'm not sure wether the information that was supposed to be present
> > instead of those question marks is absent from the original ASN.1
> > batch file or it's a bug in the NCBI ASN2GO software. It looks to me
> > that the former is the case since the file from NCBI website contains
> > much more information than the batch file. Just bringing this to
> > everyone's attention.
> >
> >
> > --
> > Best Regards,
> >
> >
> > Seth Johnson
> > Senior Bioinformatics Associate
> >
> > Ph: (202) 470-0900
> > Fx: (775) 251-0358
> >
> > On 6/2/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > Hi Seth.
> > >
> > > Your second point, about the authors string not being read correctly in
> > > Genbank format, has been fixed (or should have been if I got the code
> > > right!). Could you check the latest version of biojava-live out of CVS
> > > and give it another go? Basically the parser did not recognise the
> > > CONSRTM tag, as it is not mentioned in the sample record provided by
> > > NCBI, which is what I based the parser on.
> > ...
> > >
> > > cheers,
> > > Richard
> > >
> > >
> --
> Richard Holland (BioMart Team)
> EMBL-EBI
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> UNITED KINGDOM
> Tel: +44-(0)1223-494416
>
>
--
Best Regards,
Seth Johnson
Senior Bioinformatics Associate
Ph: (202) 470-0900
Fx: (775) 251-0358
More information about the Biojava-l
mailing list