[Biopython-dev] [Bug 2745] Bio.GenBank.LocationParserError with a GenBank CON file

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Fri Jan 30 15:11:56 UTC 2009


------- Comment #2 from biopython-bugzilla at maubp.freeserve.co.uk  2009-01-30 10:11 EST -------
It's the "gap(unk100)" entries which are breaking the location parser in
Bruce's examples.  Similarly even "gap()" entries of unknown length like this
will fail:

LOCUS       AH007743     7832 bp    DNA             CON       26-MAY-1999
DEFINITION  Gallus gallus ornithine transcarbamylase (OTC) gene, complete cds.
VERSION     AH007743.1  GI:4927367
SOURCE      chicken.
  ORGANISM  Gallus gallus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Archosauria;
            Aves; Neognathae; Galliformes; Phasianidae; Phasianinae; Gallus.
FEATURES             Location/Qualifiers
     source          1..7832
                     /organism="Gallus gallus"
CONTIG      join(AF065630.1:1..1903,gap(),AF065631.1:1..435,gap(),

Example based on ftp://ftp.ncbi.nih.gov/genbank/README.genbank although this
does not describe the new terms.  Older versions of the release notes do, e.g.

========================= [start quote] =========================

3.4.15 CONTIG Format

  As an alternative to SEQUENCE, a CONTIG record can be present
following the ORIGIN record. A join() statement utilizing a syntax
similar to that of feature locations (see the Feature Table specification
mentioned in Section 3.4.12) provides the accession numbers and basepair
ranges of other GenBank sequences which contribute to a large-scale
biological object, such as a chromosome or complete genome. Here is
an example of the use of CONTIG :

CONTIG      join(AE003590.3:1..305900,AE003589.4:61..306076,

            [ lines removed for brevity ]


However, the CONTIG join() statement can also utilize a special operator
which is *not* part of the syntax for feature locations:

        gap()     : Gap of unknown length.

        gap(X)    : Gap with an estimated integer length of X bases.

                    To be represented as a run of n's of length X
                    in the sequence that can be constructed from
                    the CONTIG line join() statement .

        gap(unkX) : Gap of unknown length, which is to be represented
                    as an integer number (X) of n's in the sequence that
                    can be constructed from the CONTIG line join()

                    The value of this gap operator consists of the 
                    literal characters 'unk', followed by an integer.

Here is an example of a CONTIG line join() that utilizes the gap() operator:

CONTIG      join(complement(AADE01002756.1:1..10234),gap(1206),

The first and last elements of the join() statement may be a gap() operator.
But if so, then those gaps should represent telomeres, centromeres, etc.

Consecutive gap() operators are illegal.

========================= [end quote] =========================

Evidently Biopython doesn't cope with these CONTIG lines - but then they do
have a different syntax to the feature locations.  I never understood why the
current code tries to parse the CONTIG string into a SeqFeature object in the
first place.

Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

More information about the Biopython-dev mailing list