[Biopython-dev] More 'fun' with GenBank

Lenna Peterson arklenna at gmail.com
Tue Jan 15 17:19:48 UTC 2013


+1 for f_loc4. The FeatureLocation/CompoundLocation classes will hopefully
make handling joins and other GenBank operators a little more logical. Not
to mention my CoordinateMapper is based on this branch!

Lenna



On Tue, Jan 15, 2013 at 11:41 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jan 15, 2013 at 3:54 PM, Kai Blin
> <kai.blin at biotech.uni-tuebingen.de> wrote:
> > Hi folks,
> >
> > as people are hitting my web service with all sorts of wonky GenBank
> > files, I've stumbled over another one that throws the GenBank parser off
> > track.
> >
> > The culprit is a SeqFeature with a location line like:
> >
> >      CDS             join(complement(4093..4338),complement(3876..4011),
> >                      complement(3655..3809),complement(3284..3585),
> >                      complement(2421..2813),complement(2057..2303))
> >
> > Now, the way I read the GenBank spec, this is not a valid location line,
> > but should instead be a complement() of joins(). Unfortunately, the NCBI
> > seems to disagree with its own specs, and put the record into their
> > Nucleotide database as CABT02000004, which means that by all practical
> > purposes, it _is_ a valid GenBank file and the parser should cope.
>
> That should work - for a while GenBank and EMBL didn't agree about
> joins on the complement strand, one did complement(join(a..b,c..d))
> and the other join(complement(c..d),complement(a..b)), notice the
> order of the sub-regions flips.
>
> > The parser looks at this location and creates a feature on the -1
> > strand, from 4092:2303. This is caused by by the feature location
> > calculation on
> >
> https://github.com/biopython/biopython/blob/master/Bio/GenBank/__init__.py#L1049
> > and the lines after.
> >
> > In short, we do
> >             s = cur_feature.sub_features[0].location.start
> >             e = cur_feature.sub_features[-1].location.end
> >             cur_feature.location = SeqFeature.FeatureLocation(s, e,
> strand)
>
> For join feature locations, the sub-feature locations should be fine
> but the overall feature location is a bit weird/broken for negative
> and mixed strands.
>
> This was one of the things the re-factoring on this branch aimed to
> fix, https://github.com/peterjc/biopython/tree/f_loc4/
> http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html
>
> I was intending to bring this up again after the next release (which
> could be later this month or February 2012), but perhaps it would
> be worth doing now?
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev
>



More information about the Biopython-dev mailing list