[Biojava-l] Parsing Genbank/EMBL/XML Sequences from binary NCBI ASN.1 daily update files
Richard Holland
richard.holland at ebi.ac.uk
Tue Jun 6 15:32:37 UTC 2006
That's an easy one!
The EMBLFormat is very strict about extraneous whitespace. In the
example you give, there's an extra space after the DNA token in the ID
line. Space is not officially allowed at that point, therefore the regex
doesn't check for it, and the ID line doesn't get recognised, and throws
an exception.
cheers,
Richard
On Tue, 2006-06-06 at 11:03 -0400, Seth Johnson wrote:
> I've found the cause of the incorrect formatting (command line option
> for Release formatting) and most of the sequences are parsed
> correctly. However, some of them cause the exception below. I hope
> I'm not being too much of a nuisance.
> ~~~~~~~~~~~~~~~~~~
> org.biojava.bio.BioException: Could not read sequence
> at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:112)
> at exonhit.parsers.GenBankParser.main(GenBankParser.java:366)
> Caused by: org.biojava.bio.seq.io.ParseException: Bad ID line found:
> DX588312 standard; DNA ; GSS; 25 BP.
> at org.biojavax.bio.seq.io.EMBLFormat.readRichSequence(EMBLFormat.java:321)
> at org.biojavax.bio.seq.io.RichStreamReader.nextRichSequence(RichStreamReader.java:109)
> ... 1 more
> Java Result: -1
> ~~~~~~~~~~~~~~~~~~
> Here's the entire sequence file:
> ==================
> ID DX588312 standard; DNA ; GSS; 25 BP.
> XX
> AC DX588312;
> XX
> SV DX588312.1
> DT 18-MAY-2006
> XX
> DE Lewinski-HIVchimera-HeLa-MLVGagPuro-11D09.rev HIVmGag MLV/HIV chimera
> DE Integration Site Library Homo sapiens genomic, genomic survey sequence.
> XX
> KW GSS.
> XX
> OS Homo sapiens (human)
> OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
> OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini;
> OC Hominidae; Homo.
> XX
> RN [1]
> RP 1-25
> RA Lewinski M.K., Yamashita M., Emerman M., Ciuffi A., Marshall H., Crawford
> RA G., Collins F., Shinn P., Leipzig J., Hannenhalli S., Berry C.C., Ecker
> RA J.R., Bushman F.D.;
> RT "Retroviral DNA Integration: Viral and Cellular Determinants of Target
> RT Site Selection";
> RL PLoS Pathog. 0:0-0 (2006).
> XX
> CC Contact: Bushman FD
> CC Department of Microbiology
> CC University of Pennsylvania School of Medicine
> CC 402C Johnson Pavilion, 3610 Hamilton Walk, Philadelphia, PA 19104-6076,
> CC USA
> CC Tel: 215 573 8732
> CC Fax: 215 573 4856
> CC Email: bushman at mail.med.upenn.edu
> CC The hg17 freeze of the human genome was used.
> CC Class: shotgun.
> XX
> FH Key Location/Qualifiers
> FH
> FT source 1..25
> FT /organism="Homo sapiens"
> FT /mol_type="genomic DNA"
> FT /db_xref="taxon:9606"
> FT /cell_line="HeLa"
> FT /clone_lib="HIVmGag MLV/HIV chimera Integration Site
> FT Library"
> FT /note="HeLa cells were infected with an HIV-based
> FT chimeric virus with MLV MA, p12 and CA substituted for
> FT HIV MA and CA and the puromycin resistance gene in place
> FT of nef. Cells were selected with puromycin for 2 weeks.
> FT Genomic DNA was extracted, digested with MseI, and
> FT ligated to a linker. Viral-host DNA junctions were
> FT amplified by nested PCR and cloned into TOPO TA vectors."
> XX
> SQ Sequence 25 BP; 13 A; 0 C; 5 G; 7 T; 0 other;
> agaagtaaaa atgtagatat gatta 25
> //
> ==================
>
> On 6/6/06, Seth Johnson <johnson.biotech at gmail.com> wrote:
> > I see now! It looks like the ASN2GB converter is taking some liberties
> > with EMBL format. I'll try to experiment with command line options of
> > that software and if all else fails get hold of the NCBI developers.
> >
> > On 6/6/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> > > The program used to generate that EMBL file is doing it incorrectly - it
> > > is missing the XX tag after the feature table, and is also missing the
> > > SQ tag before the sequence begins.
> > >
> > > If you generated it using BJX then that's my problem to fix so let me
> > > know ASAP if that is the case!
> > >
> > > cheers,
> > > Richard
> > >
--
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416
More information about the Biojava-l
mailing list