[Biopython-dev] SeqRecord locations; was: Beta code in the official releases?

Thu Sep 6 05:57:04 UTC 2012

Hi guys,

To add my two cents, I am in favor of creating a dynamic SeqRecord
coordinate system using SeqFeature. However, I think it would also be
good if we set some limitations as there are so many ways that slicing
and addition could be used to create new SeqRecords, and anticipating
all these scenarios may create an over-engineered (and probably
slower) SeqRecord.

Some scenarios that I can think now:

1. Slicing SeqRecord objects using step values > 1 (e.g. new_seq = seq[1:120:3])
2. Adding two or more SeqRecord objects with noncontiguous coordinate
(i.e. end coordinate of the first sequence is not directly followed by
the second sequence's start coordinate), and then slice the resulting
object

So maybe some limitations that we could set are:

1. Only update the coordinates if slicing step is 1 (or -1), otherwise
discard it.
2. Only update the coordinates if addition is between contiguous
coordinates, otherwise discard it.

Personally, I think this would cover most use cases for slicing while
allowing us to keep it simple.

As for the name, 'region' sounds better than 'location'. Maybe
'coverage'? I don't have any strong preference between these, but
'subregion' doesn't feel that nice.

Finally, for the coordinate system, I imagine it will use Python's
coordinate system, too? (zero-based, half-open, and the parsers /
writers should do the conversion). Should we also reverse the
coordinates if the objects are sliced in reverse (e.g.
seqrecord[::-1]) or simply inverse the strand value but keep the
coordinates unchanged?

regards,
Bow

On Thu, Sep 6, 2012 at 3:34 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Sep 6, 2012 at 1:10 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>>
>> In my mind, the main technical issue regarding MAF and AlignIO
>> and the common alignment object is the lack of a common way
>> of handling the idea of start/end (and sometimes strand) for
>> each sequence (in a consistent co-ordinate system using Python
>> counting). Evidently I haven't manage to adequately convey my
>> interpretation/concern.
>>
>> Some file formats like EMBOSS' have these number explicitly
>> but we're not parsing them:
>> http://lists.open-bio.org/pipermail/biopython/2012-September/008142.html
>>
>> In the case of "fasta-m10" the numbers are stored in private
>> properties as a 'short term' hack:
>> http://lists.open-bio.org/pipermail/biopython-dev/2012-June/009744.html
>>
>> Others like Stockholm have identifier/start-end as a combined
>> names (but this is not mandatory). Here the start and end are
>> being stored in the annotations dictionary (as unparsed strings,
>> still using 1-based co-ordinates).
>>
>> In MAF the start/end are explicit and much more important.
>> It would be near pointless to parse the the file ignoring these.
>> Maybe your approach is good enough for MAF, and we
>> should have adopted it as is, and delayed better integration
>> with the other AlignIO formats?
>>
>> i.e. This is a general limitation in AlignIO and the object
>> model, somewhat annoying in the formats already supported,
>> but information critical to the MAF format.
>>
>> I was expecting a convention for this to fall out of Bow's GSoC
>> work for 'pairwise alignments' in SearchIO - but the object
>> model he came up with was not SeqRecord based (many
>> of the file formats he was using didn't include sequences).
>>
>> Right now my inclination is still to add a location property to
>> the SeqRecord, usually a FeatureLocation, but it could also
>> be the proposed CompoundLocation for more complex cases.
>> The question then is if/when this would be propagated, e.g.
>> SeqRecord slicing/addition.
>> http://lists.open-bio.org/pipermail/biopython-dev/2012-May/009646.html
>> http://lists.open-bio.org/pipermail/biopython-dev/2012-July/009803.html
>>
>> So the wheels are turning, but slowly. I have not had as
>> much time to dedicate to this as I would like - but other
>> smaller or less inter-connected things are much easer to
>> review and merge.
>
> To expand on the SeqRecord.location property idea, I am
> thinking about (in the typical use cases) using a normal
> FeatureLocation object (from Bio.SeqFeature) where the
> start, end or strand are in the same co-ordinate system
> as the sequence of the SeqRecord.
>
> i.e. For a protein fragment, they would be in amino acids.
> For a nucleotide fragment, they would be in base pairs.
>
> Note that you might want to describe the CDS region
> for a protein sequence (which would be possible even
> for a join using the proposed CompoundLocation), so
> maybe 'location' is the wrong name here, perhaps
> 'fragment' or 'subregion', or something is clearer?
>
> When I talked about adding SeqRecords, and what would
> the combined SeqRecord's location be, we could use
> FeatureLocation addition (as defined on the branch for
> CompoundLocation objects).
>
> For slicing a SeqRecord, provided len(record.location)
> == len(record), this is well defined. However, I expect
> that quite often if used for alignments, what we will have
> instead is len(record.location) = len(record.seq.ungapped())
> so we might be able to update the sub-record's location
> if we count the gap characters and factor them in. This
> equality could be verified in the SeqRecord __init__
> (which would require the gap character, but the AlignIO
> parsers should all set that).
>
> I would like slicing to update the start/end because
> slicing alignment objects seems to be a quite common
> operation - so if you started from an alignment file
> using start/end (like Stockholm or MAF) it would be
> good to update these fields for the sub-alignment.
>
> This feels like it would work, but would it be useful or
> just over engineering? Would a simple static location
> property which is not automatically propagated in
> SeqRecord manipulations be enough (at least initially)?
>
> If so, is Brad's suggestion to just use special values in
> the annotations dictionary a simpler way forward (where
> we already have policies in place for handling generic
> annotation during SeqRecord annotation - in general
> dropping it)?
>
> If so, would this be keys 'start', 'end', 'strand' for
> integer start and end using Python counting, and
> a strand value of +1 or -1 for forward and reverse?
> [We could use strand None for unavailable as in
> the SeqFeature location object, but I think no entry
> in the dictionary is nicer here].
>
> Peter
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython-dev