[Biopython-dev] Replacing SeqFeature sub_features with compound feature locations

Lenna Peterson arklenna at gmail.com
Tue Jul 24 21:08:44 UTC 2012


>> The documentation suggests using + to combine FeatureLocations, which
>> invites the use of sum. However, sum doesn't work properly. I explain
>> why in my StackOverflow question:
>> http://stackoverflow.com/questions/11624955/avoiding-python-sum-default-start-arg-behavior
>
> Huh, I hadn't anticipated that - but I agree trying to use sum seems
> natural.
>
>> I have considered a number of workarounds:
>>
>> 1. Implementing __radd__ on FeatureLocation to return self if other ==
>> 0 allows sum() to work in place, but I am uncomfortable with
>> hard-coding such a condition.
>
> Another idea is to define FeatureLocation or CompoundFeature
> addition with an integer to expose the current private method _shift.
> i.e. Apply an offset to the co-ordinates. Something I'd been pondering
> as a (previously unrelated) enhancement. In this interpretation, adding
> zero would have no effect on the co-ordinates and thus as a side
> effect should also make sum(locations) work. We'd need to test this
> to see if that actually works.

Yes, this works fine:

Modifying FeatureLocation.__add__ with the condition:

    if isinstance(other, int):
        return self._shift(other)

and adding FeatureLocation.__radd__:

    def __radd__(self, other):
        return self.__add__(other)

After these changes, FeatureLocation(3,6) + 3 yields [6:9] and
sum([FeatureLocation(3,6), FeatureLocation(10,13)]) yields join{[3:6],
[10:13]}. (+ of FeatureLocations also still works, as does summing
lists with length > 2)

>
>> 2. Changing the location to subclass set and use xrange for generation
>> would easily allow a number of things: an empty location
>> (FeatureLocation(0,0) prints as [0:0]), union for iteration, and the
>> 'magic' of merging abutting locations that you mention. However, using
>> + and sum() on sets is dubious from a mathematically pure standpoint,
>> and this would only work for ExactPositions. Note that I haven't
>> attempted this yet and it may have disadvantages even for
>> ExactPositions that I've failed to imagine.
>>
>> Let me know your thoughts.
>
> I wouldn't think of FeatureLocation(0,0) aka [0:0] as an empty
> location, but rather as a between location - in this case between
> the last and first base on a circular genome. In Genbank notation
> for a circular genome of length 1234, this would be 1234^1
> (already an annoying special case we have to handle in the
> parser and the writer - although I'd have to check the code
> to see if we store this as [0:0] or [1234:1234] since both make
> sense).

If the length is 1234, [1234] would be an index error. I don't think
[1233:1233] would make sense either; for space-counted genomic
coordinates (http://alternateallele.blogspot.co.uk/2012/03/genome-coordinate-conventions.html),
the index refers to the space to the left of the base pair. By that
convention, [0:0] would refer to the gap between the last base and the
first base.

>
> On the other hand, a CompoundLocation with zero parts might
> make sense. There is something to be said for simply have
> a single (upgraded) FeatureLocation object with a parts list,
> which in the typical case would be length one, and proxy
> methods for start/end as currently defined in CompoundLocation.
> Maybe I should try that on another branch... it might be more
> elegant overall.
>

I haven't tested sum() on CompoundLocations but I would guess they
would need similar treatment to FeatureLocation. Should
CompoundLocation + int also shift each part? I agree that an
"upgraded" FeatureLocation could be more elegant.



More information about the Biopython-dev mailing list