[Bioperl-l] Bug in genbank parsing: CONTIG gaps

Chris Fields cjfields at uiuc.edu
Fri May 5 21:56:29 UTC 2006


Okay, I have changed the way the CONTIG line is handled in
Bio::SeqIO::genbank.  It was handling it as a feature; I just changed it
over to handling it as a Bio::Annotation::SimpleValue object with the value
being the entire contig section.  It seems to pass tests fine but I'm
operating off Windows and my wife's IBook went to the great desktop in the
sky (motherboard), so I can't test it there.  Pulling the file off using
Bio::DB::GenBank (using the no-redirect flag) works w/o crashing out.

Chris

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Chris Fields
> Sent: Friday, May 05, 2006 9:24 AM
> To: 'Hilmar Lapp'
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bug in genbank parsing: CONTIG gaps
> 
> I'm not sure it's a valid CONTIG file w/o the join(...). This is a chunk
> from the longer file Michael used as an example here (NW_925173). I
> believe
> the CONTIG line is currently handled like a feature so I think it goes
> through Bio::SeqIO::FTHelper, which is where Michael mentions his bugfix
> is;
> I think it's getting beaten up in there somehow. I may see what happens if
> it's treated like a WGS line (like a Bio::Annotation::SimpleValue object)
> and just glob the whole mess together as is.
> 
> 
> Chris
> 
> ...
> FEATURES             Location/Qualifiers
>      source          1..44976370
>                      /organism="Homo sapiens"
>                      /mol_type="genomic DNA"
>                      /db_xref="taxon:9606"
>                      /chromosome="11"
> CONTIG
> join(AADB02014316.1:1..1482320,gap(67),AADB02014317.1:1..577321,
>             gap(441),AADB02014318.1:1..173584,gap(676),
>             AADB02014319.1:1..377558,gap(20),
>             complement(AADB02014320.1:1..431263),gap(20),
>             AADB02014321.1:1..794957,gap(1241),AADB02014322.1:1..1366198,
> 
> gap(6446),AADB02014323.1:1..3366,gap(20),AADB02014324.1:1..4771,
>             gap(4611),AADB02014325.1:1..383881,gap(20),
>             complement(AADB02014326.1:1..381633),gap(1930),
>             complement(AADB02014327.1:1..460053),gap(20),
>             AADB02014328.1:1..4186,gap(1587),
> ...
> 
> > -----Original Message-----
> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp
> > Sent: Thursday, May 04, 2006 5:39 PM
> > To: Chris Fields
> > Cc: bioperl-l at lists.open-bio.org
> > Subject: Re: [Bioperl-l] Bug in genbank parsing: CONTIG gaps
> >
> > The two notations are equivalent and syntactically correct, or so I
> > believe ... I don't think 100% verbatim preservation should be the
> > goal. Or am I missing the point?
> >
> > On May 4, 2006, at 6:27 PM, Chris Fields wrote:
> >
> > > Here's another odd bit.  This is what I get for the CONTIG line when I
> > > passed a simple contig file (NW_925062, with one join) through
> > > Bio::SeqIO:
> > >
> > > -----------------------------------
> > > ....
> > > FEATURES             Location/Qualifiers
> > >      source          1..8541
> > >                      /db_xref="taxon:9606"
> > >                      /mol_type="genomic DNA"
> > >                      /chromosome="11"
> > >                      /organism="Homo sapiens"
> > > CONTIG      AADB02014027.1:1..8541
> > >
> > > //
> > > -----------------------------------
> > > Here's the original:
> > > -----------------------------------
> > > FEATURES             Location/Qualifiers
> > >      source          1..8541
> > >                      /organism="Homo sapiens"
> > >                      /mol_type="genomic DNA"
> > >                      /db_xref="taxon:9606"
> > >                      /chromosome="11"
> > > CONTIG      join(AADB02014027.1:1..8541)
> > > //
> > > -----------------------------------
> > >
> > > Looks like it lopped out the 'join' here as well.
> > >
> > > Chris
> > >
> > >> -----Original Message-----
> > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > >> bounces at lists.open-bio.org] On Behalf Of Chris Fields
> > >> Sent: Thursday, May 04, 2006 1:41 PM
> > >> To: bioperl-l at lists.open-bio.org
> > >> Subject: Re: [Bioperl-l] Bug in genbank parsing: CONTIG gaps
> > >>
> > >> Are you using the CONTIG record or the full GenBank file? 	I
see
> > >> problems with both (using bioperl-live) which seem unrelated to one
> > >> another.
> > >> The full file seems to be running a bit slow b/c the full GenBank
> > >> record
> > >> is
> > >> huge (~55 MB) but the CONTIG file does exactly what you said (runs
> > >> out of
> > >> memory).
> > >>
> > >> Chris
> > >>
> > >>> -----Original Message-----
> > >>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> > >>> bounces at lists.open-bio.org] On Behalf Of Michael Rogoff
> > >>> Sent: Tuesday, May 02, 2006 10:32 PM
> > >>> To: bioperl-l at lists.open-bio.org
> > >>> Subject: [Bioperl-l] Bug in genbank parsing: CONTIG gaps
> > >>>
> > >>>
> > >>> I've encountered a pretty serious bug in Bio::SeqIO when parsing
> > >>> certain
> > >>> genbank
> > >>> files that contain CONTIG entries with gaps.  One such record is
> > >>> NW_925173.
> > >>>
> > >>> When I try to parse this file using Bio::SeqIO::genbank, it will
> > >>> enter
> > >> an
> > >>> infinite loop and spin until it runs out of memory.
> > >>>
> > >>> I'm pretty certain it relates to this bug:
> > >>> http://bugzilla.bioperl.org/show_bug.cgi?id=1319 which seems to
> > >>> indicate
> > >>> that
> > >>> genbank records with CONTIG gaps are not valid and can't be
> > >>> parsed.  But
> > >>> this
> > >>> bug actually claims to be fixed, which is strange, since looking
> > >>> at the
> > >>> code for
> > >>> FTLocationFactory (where the loop is) it's still right there.  I
> > >>> assume
> > >>> that
> > >>> this may be fixed in other contexts but is still not fixed in
> > >>> Bio::SeqIO::genbank?  Or am I doing something wrong?
> > >>>
> > >>> I think that this should probably be filed as an open bug.  I would
> > >> think
> > >>> that
> > >>> even if bioperl isn't interested in parsing this type of file via
> > >>> SeqIO,
> > >>> certainly you'd want to ensure that no finite input file would
> > >>> send the
> > >>> parser
> > >>> into an infinite loop.  Have others encountered this problem?  Is
> > >>> there
> > >>> any plan
> > >>> to address it?
> > >>>
> > >>> Thanks very much for any information or help!
> > >>>
> > >>> -Mike
> > >>>
> > >>> P.S.  I've played around with my version of FTLocationFactory and it
> > >> seems
> > >>> to
> > >>> actually work and parse the gaps.  I'm not sure if I've created
> > >>> other
> > >> bugs
> > >>> or if
> > >>> it works in all cases, but at least the parser doesn't die.  I also
> > >> don't
> > >>> know
> > >>> that my hacky code is appropriate for putting back in to BioPerl,
> > >>> but
> > >> I'm
> > >>> happy
> > >>> to provide it if someone wants to check it out and/or consider it
> > >>> for
> > >>> checkin.
> > >>>
> > >>>
> > >>>
> > >>> _______________________________________________
> > >>> Bioperl-l mailing list
> > >>> Bioperl-l at lists.open-bio.org
> > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > >>
> > >> _______________________________________________
> > >> Bioperl-l mailing list
> > >> Bioperl-l at lists.open-bio.org
> > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l at lists.open-bio.org
> > > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> > >
> >
> > --
> > ===========================================================
> > : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> > ===========================================================
> >
> >
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list