[Biopython] Additions to the SeqRecord

Thu Nov 12 12:04:32 UTC 2009

Hello all,

Something we added in Biopython 1.50 was the ability to slice a SeqRecord,
which tries to do something sensible with all the annotation - in particular
per-letter-annotation (like quality scores) and features (which have locations)
are handled as you would naturally expect.

Something you can look forward to in our next release (assuming no
major issues crop up in testing) is adding SeqRecord objects together.
Again, this will try and do something unambiguous with the annotation.

I have two motivational examples in mind which combine slicing and
addition of SeqRecord objects to edit a record while preserving as much
annotation as possible. For example, removing a section of sequence,
say letters from 100 to 200:

from Bio import SeqIO
record = SeqIO.read(...)
deletion_mutant = record[:100] + record[200:]

(The above would make sense for both protein and nucleotide records).
Or, for a circular nucleotide sequence (like a plasmid or many small
genomes), you might want to shift the origin, e.g. by 150 bases:

shifted = record[150:] + record[:150]

You can already do both these examples with the latest (unreleased) code.
However, the situation with the annotation isn't ideal. When slicing a record,
for non-location based annotation there is no way to know for sure if the
annotation still applies to the daughter sequence. Therefore in the face of
this ambiguity, when we added SeqRecord slicing in Biopython 1.50, we
did not copy the dbxrefs and annotations dictionary to the daughter record.
i.e. You currently have to do this manually (if required), for example:

deletion_mutant = record[:100] + record[200:]
deletion_mutant.dbxrefs = record.dbxrefs[:]
deletion_mutant.annotations = record.annotations.copy()

I would like to propose changing the SeqRecord slice behaviour to
blindly copy the dbxrefs list and annotations dict to the daughter record
(just like the id, name and description are already blindly copied even
though they may not make sense for the daughter record). Then these
slicing+addition examples will "just work" without the user having to
explicitly copy the dbxrefs and annotations dict.

This is a non-backwards compatible change, but with hindsight is
perhaps a more natural behaviour. We would of course highlight this
in the release notes (maybe with some worked examples on the blog).

Does changing SeqRecord slicing like this seem like a good idea?

Peter

P.S. The code changes required are very small (two extra lines), see
this commit on my experimental branch on github for details - most
of the changes are documentation and unit tests for this work:
http://github.com/peterjc/biopython/commit/41e944f338476a79bd7f8998196df21a1c06d4f7