[Biopython] SeqIO feature.location.start and end for genes spanning origin
Peter Cock
p.j.a.cock at googlemail.com
Fri May 9 09:50:11 UTC 2014
[Would anyone else like to comment on the proposed new properties
on the feature location for biological start and end values?]
On Fri, May 9, 2014 at 12:20 AM, Richard Llewellyn <llewelr at gmail.com> wrote:
>>
>> On Thu, May 8, 2014 at 6:47 PM, Richard Llewellyn wrote:
>>
>> >> > What numbers are you hoping to get out of this location?
>> >
>> > Great question. I can see that having 0,end is useful as a flag for origin
>> > spanning. However, it is also the least informative, as neither 0 or the
>> > end are actual locations of the gene starting/ending.
>>
>> Your view "least informative" is subjective. They are end points of
>> the "fake" exons used to describe the feature, and more importantly
>> represent the min/max of the region spanned (very useful for drawing
>> features or considering intersections etc).
>
> Ok. But I do pause on the statement that these 'represent the min/max
> of the region spanned,' as from my perspective I don't think of a gene
> as spanning the entire chromosome. Regardless, if I check for this
> special case, no big deal for me. I realize you have many other
> constraints I haven't needed to worry about.
>
>> > My code would have
>> > expected the start and end to be the sequence locations (so start >> end),
>> > and it would have marked this as a special case of origin spanning. But it
>> > does require special handling. I currently use negative numbers for the
>> > start in this situation, though this has its own problems.
>>
>> You mean the biological start and end?
>
> Well, I didn't expect the biological start and end, but I did expect that
> the start and end would represent the start and end of the genes.
The trouble is there are many different interpretations (see below).
>>
>> You didn't answer my question
>
> I sensed quicksand ;-)
Rightly so - this is a tricky business.
>> but I am guessing you wanted start 879 (adjusted for
>> Python counting?) and end 490883 given this location string:
>> complement(join(490883..490885,1..879))
>
> I would have been fine with start 490883, and end 879. But I'm not
> pushing for such. I do think most users would naively assume that
> the set of start and end of a gene feature would contain the start
> and end, regardless of order or strand. At least I did.
I find it hard to give a rational for 490883 and end 879 except
they are the first and last number in the GenBank location string:
complement(join(490883..490885,1..879))
Note this location is equivalent to this alternative presentation
(which is closer to how we store it in Biopython):
join(complement(1..879),complement(490883..490885))
The meaning is {reverse strand from 897 to 1} plus
{reverse strand 490885 to 490883}, which is why I was
talking about biological start 897 to end 490883.
>> That would be possible for any stranded feature, although it
>> is less well defined for strand-less features (which in GenBank
>> and EMBL are by default forward strand).
>> Anyway, using NC_005213 as the example:
>>
>> ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Nanoarchaeum_equitans_Kin4_M_uid58009/NC_005213.gbk
>>
>> Sample code:
>>
>> >>> from Bio import SeqIO
>> >>> record = SeqIO.read("NC_005213.gbk", "genbank")
>> >>> feature = record.features[4] # the CDS record spanning origin
>> >>> print(feature.location)
>> join{[0:879](-), [490882:490885](-)}
>> >>> print(feature.location.start)
>> 0
>> >>> print(feature.location.end)
>> 490885
>>
>> Perhaps there is a case for "biological" start and end properties
>> (calculated from the current data structure on demand)? i.e.
>> something like this:
>>
>> >>> feature.location.parts[0].end if
>> feature.location.parts[0].strand == -1 else
>> feature.location.parts[0].start
>> ExactPosition(879)
>> >>> feature.location.parts[-1].start if
>> feature.location.parts[-1].strand == -1 else
>> feature.location.parts[-1].end
>> ExactPosition(490882)
>>
>> Note as a convenience, even basic non-compound locations have
>> a parts property - returning a list of one entry, themselves. So that
>> code should work in general :)
>
> Maybe so. I use left and right for the biopython start and end
> locations myself, to distinguish between biological starts and ends,
> which I calculate with something like your logic, unless around origin.
Great.
With hindsight the Biopython location could have used left and
right for the current properties start and end, where left <= right
regardless of the strand. But we're stuck with that naming choice
now, thus my suggestion of adding bio_start and bio_end alternatives.
>
> I appreciate that len(feature) is correct, even for around origin.
Yep - we have unit tests to double check this matches the length
of the sequence pulled out using the extract method, even for
nasty cases like mixed-strand trans-splicing.
>> The potential enhancement would be to define these are new
>> properties, feature.location.bio_start and feature.location.bio_end
>> or similar naming? Would that be useful? What would name them?
>
> I do rather like that, especially for cases around origin. bio_start/end
> or transcription_start/end -- nah too long. I like bio_. If nothing else,
> would serve as a reminder that start and end are not necessarily the
> transcribed start and end.
The proposed bio_start and bio_end would also be different for
simple reverse strand features, e.g.
complement(1001...2000)
start = 1000 (Python counting)
end = 2000
strand = -1
len = 1000
This would get:
bio_start = 2000
bio_end = 1000
Notice start < end (which we might have better called left/right),
while bio_start > bio_end.
Peter
More information about the Biopython
mailing list