[Biopython] sequence coordinate mapping

Peter biopython at maubp.freeserve.co.uk
Fri Jun 18 18:10:24 UTC 2010


On Fri, Jun 18, 2010 at 3:08 PM, Chris Fields <cjfields at illinois.edu> wrote:
> On Jun 18, 2010, at 8:39 AM, Peter wrote:
>> Perhaps one approach would be to do this in the SeqFeature. If we
>> define a SeqFeature's length in the natural way, then we have
>> len(SeqFeature) == len(SeqFeature.extract(parent_seq)).
>> Now we have two coordinates systems, 0 to len(SeqFeature) and
>> the regions it describes on the parent sequence. Then we could
>> discuss a pair of methods on the SeqFeature for converting
>> between the two coordinate systems. Once you have that, the
>> special case of amino acid coordinates is much easier to do
>> (account for where the start codon is, divide by three).
>>
>> I've made another commit on the __contains__ branch to
>> also implement __len__ for the SeqFeature:
>> http://github.com/peterjc/biopython/commit/74b264acacd228d64859d28d75e2c30a8030d03f
>>
>> Peter
>
> We essentially do this with Bioperl features, locations, and ranges
> (in fact, the coordinate system previously mentioned use these).
> Basically, anything that is-a Range can be compared to anything else
> that is-a Range (this has bitten us a well :).  Beyond the module
> documentation the test suite has a bit more on it, and Aaron Mackey
> has a presentation up on slideshare that touches upon the
> Bio::Coordinate implementation:
>
> http://www.slideshare.net/bosc_2008/mackey-bio-perl-bosc2008

Thanks for the link - I didn't see much in the presentation that I
hadn't seen in the documentation though. I guess the BioPerl
unit tests would be worth checking out. Thanks Chris.

Back in Biopython land (where we seem to have adopted similar
but different names like locations and positions), I had a go at
doing one of the mappings - from feature coordinates to parent
coordinates (e.g. CDS back to genome, or PFAM domain back
to protein if your SeqRecord is for a protein sequence):

http://github.com/peterjc/biopython/tree/feature-coords

As you can tell from the rather lengthy docstring and doctests,
this is quite hairy and difficult to explain. The way Python
sub-setting works also complicates how to translate the
break point between two subfeatures (exons), since you
may want to use the number as the end of the first exon or
as the start of the second exon. I've implemented the
mapping so that single letter access works as expected:

http://github.com/peterjc/biopython/tree/74b264acacd228d64859d28d75e2c30a8030d03f

I'm pretty sure we can do the reverse mapping from a
parent sequence coordinate to a feature coordinate,
although I can come up with pathological examples where
this is not a one to one mapping, but one to many. e.g. a
ribosomal slippage where a base gets used twice. In this
case we could raise an error, or maybe more simply take
the first match.

I'm not convinced about adding these methods just yet -
but the relatively simple work to support "len(feature)"
and "x in feature" look like useful additions.

Peter




More information about the Biopython mailing list