[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files

Seth Johnson johnson.biotech at gmail.com
Tue Jun 6 15:03:23 UTC 2006


I've found the cause of the incorrect formatting (command line option
for Release formatting) and most of the sequences are parsed
correctly.  However, some of them cause the exception below.  I hope
I'm not being too much of a nuisance.
~~~~~~~~~~~~~~~~~~
org.biojava.bio.BioException: Could not read sequence
        at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
        at exonhit.parsers.GenBankParser.main(GenBankParser.java:366)
Caused by: org.biojava.bio.seq.io.ParseException: Bad ID line found:
DX588312  standard; DNA ; GSS; 25 BP.
        at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:321)
        at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
        ... 1 more
Java Result: -1
~~~~~~~~~~~~~~~~~~
Here's the entire sequence file:
==================
ID   DX588312  standard; DNA ; GSS; 25 BP.
XX
AC   DX588312;
XX
SV   DX588312.1
DT   18-MAY-2006
XX
DE   Lewinski-HIVchimera-HeLa-MLVGagPuro-11D09.rev HIVmGag MLV/HIV chimera
DE   Integration Site Library Homo sapiens genomic, genomic survey sequence.
XX
KW   GSS.
XX
OS   Homo sapiens (human)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini;
OC   Hominidae; Homo.
XX
RN   [1]
RP   1-25
RA   Lewinski M.K., Yamashita M., Emerman M., Ciuffi A., Marshall H., Crawford
RA   G., Collins F., Shinn P., Leipzig J., Hannenhalli S., Berry C.C., Ecker
RA   J.R., Bushman F.D.;
RT   "Retroviral DNA Integration: Viral and Cellular Determinants of Target
RT   Site Selection";
RL   PLoS Pathog. 0:0-0 (2006).
XX
CC   Contact: Bushman FD
CC   Department of Microbiology
CC   University of Pennsylvania School of Medicine
CC   402C Johnson Pavilion, 3610 Hamilton Walk, Philadelphia, PA 19104-6076,
CC   USA
CC   Tel: 215 573 8732
CC   Fax: 215 573 4856
CC   Email: bushman at mail.med.upenn.edu
CC   The hg17 freeze of the human genome was used.
CC   Class: shotgun.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..25
FT                   /organism="Homo sapiens"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:9606"
FT                   /cell_line="HeLa"
FT                   /clone_lib="HIVmGag MLV/HIV chimera Integration Site
FT                   Library"
FT                   /note="HeLa cells were infected with an HIV-based
FT                   chimeric virus with MLV MA, p12 and CA substituted for
FT                   HIV MA and CA and the puromycin resistance gene in place
FT                   of nef. Cells were selected with puromycin for 2 weeks.
FT                   Genomic DNA was extracted, digested with MseI, and
FT                   ligated to a linker. Viral-host DNA junctions were
FT                   amplified by nested PCR and cloned into TOPO TA vectors."
XX
SQ   Sequence 25 BP; 13 A; 0 C; 5 G; 7 T; 0 other;
     agaagtaaaa atgtagatat gatta                                              25
//
==================

On 6/6/06, Seth Johnson <johnson.biotech at gmail.com> wrote:
> I see now! It looks like the ASN2GB converter is taking some liberties
> with EMBL format.  I'll try to experiment with command line options of
> that software and if all else fails get hold of the NCBI developers.
>
> On 6/6/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > The program used to generate that EMBL file is doing it incorrectly - it
> > is missing the XX tag after the feature table, and is also missing the
> > SQ tag before the sequence begins.
> >
> > If you generated it using BJX then that's my problem to fix so let me
> > know ASAP if that is the case!
> >
> > cheers,
> > Richard
> >



More information about the Biojava-l mailing list