[Biopython] Getting the sequence for a SeqFeature

Fri Nov 6 12:22:03 UTC 2009

Hi all,

I am planing to add a new method to the SeqFeature object, but
would like a little feedback first. This email is really just the
background - I'll write up a few examples later to try and make
this a bit clearer...

A task that comes up every so often on the mailing lists, which I
have needed to do myself in the past, is getting the nucleotide
sequences for features in a GenBank file, e.g.
http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html
http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005991.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005997.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006958.html

Often, once you have the nucleotide sequence, you'll want to
translate it, e.g. CDS features or mat_peptides as here:
http://lists.open-bio.org/pipermail/bioperl-l/2009-October/031493.html

If you parse a GenBank file (or an EMBL file etc) with SeqIO, you
typically get a SeqRecord object with the the full nucleotide sequence
(as record.seq, a Seq object) and a list of features (as record.features,
a list of SeqFeature objects).

For most prokaryotic features, things are fairly easy - you just need
the (non fuzzy) start and end positions of the SeqFeature, and the
strand. Then slice the parent sequence, and take the reverse
complement if required.

However, there are also rare cases like joins to consider (e.g. a
ribosomal slippage), but joins are common if you deal with eukaryotes
since intron/exon splicing is normal. Here you need to look at the
subfeatures, and their locations - and indeed their strands, as there
are a few mixed strand features in real GenBank files.

In the above examples I have been thinking about genomes, or
any nucleic sequence - but the same applies to proteins where
the features might be the positions of domains. All the same issues
apply except for strands.

As noted in the linked threads, I have some working code currently on
a github branch with unit tests which seems to handle all this. I would
like to include this in Biopython, but first would like a little feedback on
the proposed interface.

What I am proposing is adding a method to the SeqFeature object
taking the parent sequence (as a Seq like object, or even a string)
as a required argument. This would return the region of the parent
sequence described by the feature location and strands (including
any subfeatures for joins).

This could instead be done as a stand alone function, or as a
method of the Seq object (as I suggested back in 2007). However,
on reflection, I think the SeqFeature is most appropriate.
http://lists.open-bio.org/pipermail/biopython/2007-September/003706.html

With this basic functionality in place, it would then be much easier
to take a parent SeqRecord and a child SeqFeature, and build a
child SeqRecord taking the sequence from the parent SeqRecord
(using the above new code), and annotation from the SeqFeature.
This could (later) be added to Biopython as well, perhaps as a
method of the SeqRecord.

As this email is already very long, I'll delay giving any examples.

Peter