[Bioperl-l] *major* error in genbank parser or am i just insane?
Hilmar Lapp
hlapp@gnf.org
Tue, 6 Aug 2002 17:16:05 -0600
I doubt that the cross-product of location types and genbank entries
has been tested in its entirety, so something may have easily
escaped.
Getting the business login right here is also not really trivial due
to orthogonal semantics (apparently; Chris are you sure about this):
complement(join(1..100,201-300)) would mean take 1..100 and 201..300
from the minus strand, concatenate, and reverse complement?
In the SplitLocation, it would mean take those from the plus strand,
concatenate, then reverse complement ... Anyway, since the
complement() apparently got lost entirely, something must be wrong
here.
Confused,
-hilmar
On Tuesday, August 6, 2002, at 04:36 PM, Chris Mungall wrote:
>
>
> On Tue, 6 Aug 2002, Jason Stajich wrote:
>
>> I would really love it for someone to do an overhaul on this, if
>> you have
>> use cases which break for you then something is wrong. I think the
>> Location objects were messed with recently, and I don't remember how
>> we were setting strand originally or in FTHelper.
>
> the semantics of start/end seem a bit odd too - one would expect
> start/end
> on a split location to be the min/max from the individual sublocs, this
> doesnt seem to be the case
>
>> The regular expressions currently cannot parse all possible cases
>> of crazy
>> join(complement(..)) and would need to be addressed with a grammar
>> or a
>> more formal regular expression.
>
> ok; this is just the standard case of a revcomped mrna, eg half the
> features on a genbank NT record
>
>> -jason
>>
>> On Tue, 6 Aug 2002, Chris Mungall wrote:
>>
>>>
>>> maybe i'm just hugely confused about split locations, but i think
>>> there is
>>> something deeply terribly wrong with how the genbank parser is
>>> dealing
>>> with revcomped split locations.
>>>
>>> it seems that if you parse this
>>>
>>> mRNA complement(join(1..100,201..200,
>>>
>>> then use a seqio stream of format genbank to spit it out again
>>> you get
>>> this:
>>>
>>> mRNA join(1..100,201..200,
>>>
>>> which is highly disturbing
>>>
>>> looking at FTHelper it seems that when a split location object is
>>> created,
>>> the strand is set in the parent splitlocation, but not in the
>>> individual
>>> simple sublocations.
>>>
>>> I'm about to commit a fix for this, but I just need a sanity
>>> check first:
>>> surely this is one of the most commonly used modules in bioperl?
>>> someone
>>> would have noticed this by now? I mean this is 50% of mRNAs on a
>>> genomic
>>> entry, and orientation is kind of important in the scheme of
>>> things. Elia
>>> - haven't you populated a biosql instance from genbank? Didn't
>>> all your
>>> mRNAs come out on the forward strand? Or were you only doing cDNA
>>> records?
>>>
>>> Looking at SeqIO, it seems to test for a similar case, using
>>> testfuzzy.genbank (although this seems to be a weird made up
>>> example of a
>>> different case altogether) and this doesn't round trip correctly.
>>>
>>> I'm not sure if this is a recently introduced bug (surely it must
>>> be?) or
>>> something that's been around for a while. i only tested 1.0.2 and
>>> the main
>>> cvs branch.
>>>
>>> hmm, not sure if i know how to fix this - i can set the strand of the
>>> sublocations but this has weird results when you try and export
>>> in genbank
>>> format again. i'll go ahead and commit this anyway sometime today
>>> unless
>>> anyone has any objections, as it can't be worse than the current
>>> handling.
>>>
>>> i committed a test datafile in t/data/revcomp_mrna.gb
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l@bioperl.org
>>> http://bioperl.org/mailman/listinfo/bioperl-l
>>>
>>
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
--
-------------------------------------------------------------
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
-------------------------------------------------------------