[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files
Seth Johnson
johnson.biotech at gmail.com
Mon Jun 5 14:37:31 UTC 2006
Hell again Richard,
No sooner I've said about the fix of the last parsing exception than
another one came up with Genbank format:
--------------------------------------
org.biojava.bio.seq.io.ParseException: DQ431065
org.biojava.bio.BioException: Could not read sequence
at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
at exonhit.parsers.GenBankParser.getGBSequences(GenBankParser.java:151)
at exonhit.parsers.GenBankParser.runGBparser(GenBankParser.java:246)
at exonhit.parsers.GenBankParser.main(GenBankParser.java:326)
Caused by: org.biojava.bio.seq.io.ParseException: DQ431065
at org.biojavax.bio.seq.io.GenbankFormat.readRichSequence(GenbankFormat.java:245)
at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
... 3 more
org.biojava.bio.seq.io.ParseException:
org.biojava.bio.symbol.IllegalSymbolException: This tokenization
doesn't contain character: 't'
----------------------------------------
The Genbank file that caused it is as follows:
=========================================
LOCUS DQ431065 425 bp DNA linear INV 01-JUN-2006
DEFINITION Reticulitermes sp. ALS-2006c 16S ribosomal RNA gene, partial
sequence; mitochondrial.
ACCESSION DQ431065
VERSION DQ431065.1 GI:90102206
KEYWORDS .
SOURCE Vaccinium corymbosum
ORGANISM Vaccinium corymbosum
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
asterids; Ericales; Ericaceae; Vaccinioideae; Vaccinieae;
Vaccinium.
?
REFERENCE 2 (bases 1 to 425)
AUTHORS Naik,L.D. and Rowland,L.J.
TITLE Expressed Sequence Tags of cDNA clones from subtracted library of
Vaccinium corymbosum
JOURNAL Unpublished (2005)
FEATURES Location/Qualifiers
source 1..425
/organism="Vaccinium corymbosum"
/mol_type="genomic DNA"
/cultivar="Bluecrop"
/db_xref="taxon:69266"
/tissue_type="Flower buds"
/clone_lib="Subtracted cDNA library of Vaccinium
corymbosum"
/dev_stage="399 hour chill unit exposure"
/note="Vector: pCR4TOPO; Site_1: Eco R I; Site_2: Eco R I"
rRNA <1..>425
/product="16S ribosomal RNA"
ORIGIN
1 cgcctgttta tcaaaaacat cttttcttgt tagtttttga agtatggcct gcccgctgac
61 tttagtgttg aagggccgcg gtattttgac cgtgcaaagg tagcatagtc attagttctt
121 taattgtgat ctggtatgaa tggcttgacg aggcatgggc tgtcttaatt ttgaattgtt
181 tattgaattt ggtctttgag ttaaaattct tagatgtttt tatgggacga gaagacccta
241 tagagtttaa catttattat ggtccttttc tgtttgtgag ggctcactgg gccgtctaat
301 atgttttgtt ggggtgatgg gagggaataa tttaacccct cctttttatt attatattta
361 tttatattta tttgatccat ttattttgat tgtaagatta aattacctta gggataacag
421 cgtaa
//
==================================
I think it's the presence of the '?' at the beginning of the line?!?!
I'm not sure wether the information that was supposed to be present
instead of those question marks is absent from the original ASN.1
batch file or it's a bug in the NCBI ASN2GO software. It looks to me
that the former is the case since the file from NCBI website contains
much more information than the batch file. Just bringing this to
everyone's attention.
--
Best Regards,
Seth Johnson
Senior Bioinformatics Associate
Ph: (202) 470-0900
Fx: (775) 251-0358
On 6/2/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> Hi Seth.
>
> Your second point, about the authors string not being read correctly in
> Genbank format, has been fixed (or should have been if I got the code
> right!). Could you check the latest version of biojava-live out of CVS
> and give it another go? Basically the parser did not recognise the
> CONSRTM tag, as it is not mentioned in the sample record provided by
> NCBI, which is what I based the parser on.
...
>
> cheers,
> Richard
>
>
More information about the Biojava-l
mailing list