[Bioperl-l] *major* error in genbank parser or am i just insane?

Chris Mungall cjm@fruitfly.org
Tue, 6 Aug 2002 13:14:20 -0700 (PDT)


maybe i'm just hugely confused about split locations, but i think there is
something deeply terribly wrong with how the genbank parser is dealing
with revcomped split locations.

it seems that if you parse this

 mRNA    complement(join(1..100,201..200,

then use a seqio stream of format genbank to spit it out again you get
this:

 mRNA    join(1..100,201..200,

which is highly disturbing

looking at FTHelper it seems that when a split location object is created,
the strand is set in the parent splitlocation, but not in the individual
simple sublocations.

I'm about to commit a fix for this, but I just need a sanity check first:
surely this is one of the most commonly used modules in bioperl? someone
would have noticed this by now? I mean this is 50% of mRNAs on a genomic
entry, and orientation is kind of important in the scheme of things. Elia
- haven't you populated a biosql instance from genbank? Didn't all your
mRNAs come out on the forward strand? Or were you only doing cDNA records?

Looking at SeqIO, it seems to test for a similar case, using
testfuzzy.genbank (although this seems to be a weird made up example of a
different case altogether) and this doesn't round trip correctly.

I'm not sure if this is a recently introduced bug (surely it must be?) or
something that's been around for a while. i only tested 1.0.2 and the main
cvs branch.

hmm, not sure if i know how to fix this - i can set the strand of the
sublocations but this has weird results when you try and export in genbank
format again. i'll go ahead and commit this anyway sometime today unless
anyone has any objections, as it can't be worse than the current handling.

i committed a test datafile in t/data/revcomp_mrna.gb