[Biopython-dev] [BioPython] about the SeqRecord slicing

Fri Mar 27 10:29:10 UTC 2009

On Fri, Mar 27, 2009 at 8:22 AM, Jose Blanca <jblanca at btc.upv.es> wrote:
> On Thursday 26 March 2009 16:32:23 Peter wrote:
>
>> You'd also want the SeqRecord to support __add__ (and __radd__) so
>> that two SeqRecord objects can be added together.  I have thought
>> about this before, and it is a *much* more complicated issue due to
>> the meta data.  In general the only safe and unambiguous choice is to
>> exclude it from the combined record:
>> * sequence - just add (using normal rules for adding Seq objects)
>> * name/id/description - if the two agree, use that?  Otherwise default
>> to a blank value?
>> * annotations - for each keyed value, you could combine the entries?
>> Or just throwing them all away?
>> * letter_annotations - if an entry is present in both you can combine
>> it.  Otherwise throw them away?
>> * features - these could be combined, adjusting the locations for one
>> record's features as appropriate
>
> As I said before I think that the same problem is presented when you do a
> slice. If I have the sequence of a gene named X with some annotations and I
> slice a part, is still be named geneX? Should the annotations be kept?

The problems about the annotation when slicing a SeqRecord are similar, but
I think things are worse when adding two SeqRecords together.

For slicing, there are a few sub of cases:
- per-letter-annotation can be sliced too - easy.
- features - we retain only features fully inside the new sub-sequence (the
  border line features which cross the slice boundary are a small problem -
  excluding them is the simplest solution to code and explain).
- id/name - debatable.  Currently kept.
- description - debatable.  Consider a description which says "whole genome",
  that doesn't really apply to a partial sequence.  On the other hand, it may.
  Currently kept for the sub-record.
- annotations - again debatable.    Without context information, we can't guess.
  The only sensible options are keep it all (as in CVS) or none of it.

I think it is worth keeping the id/name in general (consider typical use cases
like cropping a domain from a gene, or cropping columns off an alignment).
I would be OK with dropping the contents of the annotations dictionary and
description is order to avoid ambiguity, but this would prevent certain tasks.

Peter