[Biopython-dev] [Biopython] SeqRecord reverse complement method?

Wed Sep 30 15:27:03 UTC 2009

Hi all,

A few months back on the main mailing list, Cedar and I were
talking about taking a SeqRecord, and how to write out its reverse
complement to a file. The thread is archived here:
http://lists.open-bio.org/pipermail/biopython/2009-June/005307.html

Cedar - I cc'd you, as I am not sure if you are on the dev list.
I expect this could get technical pretty quickly, so I wanted
to float this idea on the dev list first...

-----------------------------------------------------------------

So, the background this this discussion:

Unless there is some complicated annotation to transfer,
using Biopython as is, making a new SeqRecord using
the reverse complement sequence of the old SeqRecord
isn't very hard, see:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement

This has meant that generally the current status quo isn't
a problem (at least for me). However, what prompted me
to work on this issue was a real world example.

We have a draft genome where after doing a basic
annotation, it would make sense to flip the strands. I
want to be able to load our current GenBank file, apply
the reverse complement, and have all the annotated
features recalculated to match. With more and more
sequencing projects, this isn't such an odd thing to
want to do.

Dealing with the details of potentially complex locations
in SeqFeature object's isn't very nice, so I think it would
be useful to have this particular functionality built into
Biopython. It is also a small step towards making the
SeqRecord more Seq like (which in general seems a
good idea).

On Thu, Jun 25, 2009 at 12:20 AM, Peter wrote:
>
> What you are doing is fine - although personally I might wrap up the
> first line as a function, as done in the tutorial:
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement
>
> While we could add a reverse_complement() method to the SeqRecord
> (and other Seq methods, like translate etc), there is one big problem:
> What to do with the annotation. If your record used to have a name
> based on an accession or a GI number, then this really does not apply
> to the reverse complement (or a translation etc). We could do something
> arbitrary like adding an "rc_" prefix (or variants) but I think the only safe
> answer is to make the user think about this and do what is appropriate
> in their context. And as you have demonstrated, this can still be done
> in one line :)
>
> I make a habit of using this as a justification, but I feel the zen of
> Python "Explicit is better than implicit" applies quite well here.

I've been thinking about this on and off since then, and I still
maintain that for much of the annotation there is no easy answer.
For the sequence itself, the behaviour is well defined. For all
the annotation, there are three possible actions:
(a) User supplies a new value
(b) Reuse the old value
(c) No annotation (the default for a new SeqRecord)

We can do something sensible with the features (if present) and
it will probably make sense to copy but reverse any per-letter
annotation (if present).

On a github branch I have posted some experimental code
which adds a reverse_complement() method to the SeqRecord.
I propose to give the new reverse_complement() a set of
optional arguments (id, name, etc) following the same names
as the existing attributes (and __init__ arguments), allowing
the user to choose between these three actions.

Assuming the general scheme is popular, I'm quite open
to discussing changing these defaults. But for the first
implementation this is what I picked: For the id, name and
description I still lean towards making the user decide this,
and therefore the default is (c). Likewise for the annotations
dictionary and the database cross refs.

For the features and per-letter-annotation, I would opt to
make the default behaviour be to reuse the old data, option
(b) above. For the per-letter-annotation (the restricted
dictionary, letter_annotations) this just means reversing
each entry. For the features, this means reversing the
order of the features, switching their strands (if set), and
calculating the new coordinates (taking care of all the
possible fuzzy locations and sub-features).

The code is here is anyone wants to look at the
technical details:
http://github.com/peterjc/biopython/commits/seqrecords

Peter