[Bioperl-l] Bug in genbank parsing: CONTIG gaps
Chris Fields
cjfields at uiuc.edu
Fri May 5 14:24:05 UTC 2006
I'm not sure it's a valid CONTIG file w/o the join(...). This is a chunk
from the longer file Michael used as an example here (NW_925173). I believe
the CONTIG line is currently handled like a feature so I think it goes
through Bio::SeqIO::FTHelper, which is where Michael mentions his bugfix is;
I think it's getting beaten up in there somehow. I may see what happens if
it's treated like a WGS line (like a Bio::Annotation::SimpleValue object)
and just glob the whole mess together as is.
Chris
...
FEATURES Location/Qualifiers
source 1..44976370
/organism="Homo sapiens"
/mol_type="genomic DNA"
/db_xref="taxon:9606"
/chromosome="11"
CONTIG join(AADB02014316.1:1..1482320,gap(67),AADB02014317.1:1..577321,
gap(441),AADB02014318.1:1..173584,gap(676),
AADB02014319.1:1..377558,gap(20),
complement(AADB02014320.1:1..431263),gap(20),
AADB02014321.1:1..794957,gap(1241),AADB02014322.1:1..1366198,
gap(6446),AADB02014323.1:1..3366,gap(20),AADB02014324.1:1..4771,
gap(4611),AADB02014325.1:1..383881,gap(20),
complement(AADB02014326.1:1..381633),gap(1930),
complement(AADB02014327.1:1..460053),gap(20),
AADB02014328.1:1..4186,gap(1587),
...
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp
> Sent: Thursday, May 04, 2006 5:39 PM
> To: Chris Fields
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bug in genbank parsing: CONTIG gaps
>
> The two notations are equivalent and syntactically correct, or so I
> believe ... I don't think 100% verbatim preservation should be the
> goal. Or am I missing the point?
>
> On May 4, 2006, at 6:27 PM, Chris Fields wrote:
>
> > Here's another odd bit. This is what I get for the CONTIG line when I
> > passed a simple contig file (NW_925062, with one join) through
> > Bio::SeqIO:
> >
> > -----------------------------------
> > ....
> > FEATURES Location/Qualifiers
> > source 1..8541
> > /db_xref="taxon:9606"
> > /mol_type="genomic DNA"
> > /chromosome="11"
> > /organism="Homo sapiens"
> > CONTIG AADB02014027.1:1..8541
> >
> > //
> > -----------------------------------
> > Here's the original:
> > -----------------------------------
> > FEATURES Location/Qualifiers
> > source 1..8541
> > /organism="Homo sapiens"
> > /mol_type="genomic DNA"
> > /db_xref="taxon:9606"
> > /chromosome="11"
> > CONTIG join(AADB02014027.1:1..8541)
> > //
> > -----------------------------------
> >
> > Looks like it lopped out the 'join' here as well.
> >
> > Chris
> >
> >> -----Original Message-----
> >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >> bounces at lists.open-bio.org] On Behalf Of Chris Fields
> >> Sent: Thursday, May 04, 2006 1:41 PM
> >> To: bioperl-l at lists.open-bio.org
> >> Subject: Re: [Bioperl-l] Bug in genbank parsing: CONTIG gaps
> >>
> >> Are you using the CONTIG record or the full GenBank file? I see
> >> problems with both (using bioperl-live) which seem unrelated to one
> >> another.
> >> The full file seems to be running a bit slow b/c the full GenBank
> >> record
> >> is
> >> huge (~55 MB) but the CONTIG file does exactly what you said (runs
> >> out of
> >> memory).
> >>
> >> Chris
> >>
> >>> -----Original Message-----
> >>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> >>> bounces at lists.open-bio.org] On Behalf Of Michael Rogoff
> >>> Sent: Tuesday, May 02, 2006 10:32 PM
> >>> To: bioperl-l at lists.open-bio.org
> >>> Subject: [Bioperl-l] Bug in genbank parsing: CONTIG gaps
> >>>
> >>>
> >>> I've encountered a pretty serious bug in Bio::SeqIO when parsing
> >>> certain
> >>> genbank
> >>> files that contain CONTIG entries with gaps. One such record is
> >>> NW_925173.
> >>>
> >>> When I try to parse this file using Bio::SeqIO::genbank, it will
> >>> enter
> >> an
> >>> infinite loop and spin until it runs out of memory.
> >>>
> >>> I'm pretty certain it relates to this bug:
> >>> http://bugzilla.bioperl.org/show_bug.cgi?id=1319 which seems to
> >>> indicate
> >>> that
> >>> genbank records with CONTIG gaps are not valid and can't be
> >>> parsed. But
> >>> this
> >>> bug actually claims to be fixed, which is strange, since looking
> >>> at the
> >>> code for
> >>> FTLocationFactory (where the loop is) it's still right there. I
> >>> assume
> >>> that
> >>> this may be fixed in other contexts but is still not fixed in
> >>> Bio::SeqIO::genbank? Or am I doing something wrong?
> >>>
> >>> I think that this should probably be filed as an open bug. I would
> >> think
> >>> that
> >>> even if bioperl isn't interested in parsing this type of file via
> >>> SeqIO,
> >>> certainly you'd want to ensure that no finite input file would
> >>> send the
> >>> parser
> >>> into an infinite loop. Have others encountered this problem? Is
> >>> there
> >>> any plan
> >>> to address it?
> >>>
> >>> Thanks very much for any information or help!
> >>>
> >>> -Mike
> >>>
> >>> P.S. I've played around with my version of FTLocationFactory and it
> >> seems
> >>> to
> >>> actually work and parse the gaps. I'm not sure if I've created
> >>> other
> >> bugs
> >>> or if
> >>> it works in all cases, but at least the parser doesn't die. I also
> >> don't
> >>> know
> >>> that my hacky code is appropriate for putting back in to BioPerl,
> >>> but
> >> I'm
> >>> happy
> >>> to provide it if someone wants to check it out and/or consider it
> >>> for
> >>> checkin.
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at lists.open-bio.org
> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
>
> --
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list