[Biopython] SeqIO feature.location.start and end for genes spanning origin

Fri May 9 09:50:11 UTC 2014

[Would anyone else like to comment on the proposed new properties
on the feature location for biological start and end values?]

On Fri, May 9, 2014 at 12:20 AM, Richard Llewellyn <llewelr at gmail.com> wrote:
>>
>> On Thu, May 8, 2014 at 6:47 PM, Richard Llewellyn wrote:
>>
>> >> > What numbers are you hoping to get out of this location?
>> >
>> > Great question.  I can see that having 0,end is useful as a flag for origin
>> > spanning.  However, it is also the least informative, as neither 0 or the
>> > end are actual locations of the gene starting/ending.
>>
>> Your view "least informative" is subjective. They are end points of
>> the "fake" exons used to describe the feature, and more importantly
>> represent the min/max of the region spanned (very useful for drawing
>> features or considering intersections etc).
>
> Ok.  But I do pause on the statement that these 'represent the min/max
> of the region spanned,' as from my perspective I don't think of a gene
> as spanning the entire chromosome.  Regardless, if I check for this
> special case, no big deal for me.  I realize you have many other
> constraints I haven't needed to worry about.
>
>> > My code would have
>> > expected the start and end to be the sequence locations (so start >> end),
>> > and it would have marked this as a special case of origin spanning.  But it
>> > does require special handling.  I currently use negative numbers for the
>> > start in this situation, though this has its own problems.
>>
>> You mean the biological start and end?
>
> Well, I didn't expect the biological start and end, but I did expect that
> the start and end would represent the start and end of the genes.

The trouble is there are many different interpretations (see below).

>>
>> You didn't answer my question
>
> I sensed quicksand ;-)

Rightly so - this is a tricky business.

>> but I am guessing you wanted start 879 (adjusted for
>> Python counting?) and end 490883 given this location string:
>> complement(join(490883..490885,1..879))
>
>  I would have been fine with start 490883, and end 879.  But I'm not
> pushing for such. I do think most users would naively assume that
> the set of start and end of a gene feature would contain the start
> and end, regardless of order or strand.  At least I did.

I find it hard to give a rational for 490883 and end 879 except
they are the first and last number in the GenBank location string:
complement(join(490883..490885,1..879))

Note this location is  equivalent to this alternative presentation
(which is closer to how we store it in Biopython):

 join(complement(1..879),complement(490883..490885))

The meaning is {reverse strand from 897 to 1} plus
{reverse strand 490885 to 490883}, which is why I was
talking about biological start 897 to end 490883.

>> That would be possible for any stranded feature, although it
>> is less well defined for strand-less features (which in GenBank
>> and EMBL are by default forward strand).
>> Anyway, using NC_005213 as the example:
>>
>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk
>>
>> Sample code:
>>
>>   >>> from Bio import SeqIO
>>   >>> record = SeqIO.read("NC_005213.gbk", "genbank")
>>   >>> feature = record.features[4] # the CDS record spanning origin
>>   >>> print(feature.location)
>>   join{[0:879](-), [490882:490885](-)}
>>   >>> print(feature.location.start)
>>   0
>>   >>> print(feature.location.end)
>>   490885
>>
>> Perhaps there is a case for "biological" start and end properties
>> (calculated from the current data structure on demand)? i.e.
>> something like this:
>>
>>   >>> feature.location.parts[0].end if
>> feature.location.parts[0].strand == -1 else
>> feature.location.parts[0].start
>>   ExactPosition(879)
>>   >>> feature.location.parts[-1].start if
>> feature.location.parts[-1].strand == -1 else
>> feature.location.parts[-1].end
>>   ExactPosition(490882)
>>
>> Note as a convenience, even basic non-compound locations have
>> a parts property - returning a list of one entry, themselves. So that
>> code should work in general :)
>
> Maybe so.  I use left and right for the biopython start and end
> locations myself, to distinguish between biological starts and ends,
> which I calculate with something like your logic, unless around origin.

Great.

With hindsight the Biopython location could have used left and
right for the current properties start and end, where left <= right
regardless of the strand. But we're stuck with that naming choice
now, thus my suggestion of adding bio_start and bio_end alternatives.

>
> I appreciate that len(feature) is correct, even for around origin.

Yep - we have unit tests to double check this matches the length
of the sequence pulled out using the extract method, even for
nasty cases like mixed-strand trans-splicing.

>> The potential enhancement would be to define these are new
>> properties, feature.location.bio_start and feature.location.bio_end
>> or similar naming? Would that be useful? What would name them?
>
> I do rather like that, especially for cases around origin.  bio_start/end
> or transcription_start/end -- nah too long.  I like bio_.  If nothing else,
> would serve as a reminder that start and end are not necessarily the
> transcribed start and end.

The proposed bio_start and bio_end would also be different for
simple reverse strand features, e.g.

complement(1001...2000)

start = 1000 (Python counting)
end = 2000
strand = -1
len = 1000

This would get:

bio_start = 2000
bio_end = 1000

Notice start < end (which we might have better called left/right),
while bio_start > bio_end.

Peter