[Biopython-dev] Getting nucleotide sequence for GenBank features

Peter biopython at maubp.freeserve.co.uk
Tue Nov 3 23:41:57 UTC 2009


On Wed, Oct 28, 2009 at 12:50 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Wed, Oct 28, 2009 at 12:07 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> I think this should be part of Biopython proper (with unit tests etc), and
>> would like to discuss where to put it. My ideas include:
>>
>> (1) Method of the SeqFeature object taking the parent sequence (as a
>> string, Seq, ...?) as a required argument. Would return an object of the
>> same type as the parent sequence passed in.
>>
>> (2) Separate function, perhaps in Bio.SeqUtils taking the parent
>> sequence (as a string, Seq, ...?) and a SeqFeature object. Would
>> return an object of the same type as the parent sequence passed in.
>>
>> (3) Method of the Seq object taking a SeqFeature, returning a Seq.
>> [A drawback is Bio.Seq currently does not depend on Bio.SeqFeature]
>>
>> (4) Method of the SeqRecord object taking a SeqFeature. Could
>> return a SeqRecord using annotation from the SeqFeature. Complex.
>>
>> Any other ideas?
>>
>> We could even offer more than one of these approaches, but ideally
>> there should be one obvious way for the end user to do this. My
>> question is, which is most intuitive? I quite like idea (1).
>>
>> In terms of code complexity, I expect (1), (2) and (3) to be about the
>> same. Building a SeqRecord in (4) is trickier.
>
> Actually, thinking about this over lunch, for many of the use cases
> we do want to turn a SeqFeature into a SeqRecord - either for the
> nucleotides, or in some cases their translation. And if doing this,
> do something sensible with the SeqFeature annotation (qualifiers)
> seems generally to be useful. This could still be done with approaches
> (1) and (2) as well as (4).

Kyle at least seems to like idea (4), so much so that he has gone
ahead and coded up something:
http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006941.html

Certainly there are good reasons for wanting to be able to take
a SeqFeature and the parent sequence (SeqRecord or Seq)
and create a SeqRecord (either plain nucleotides or translated
into protein). e.g. pretty much all non-trivial GenBank to FASTA
conversions. Offering this as a SeqRecord method might be the
best approach, option (4).

However, this is I think on top of the more fundamental step
of just extracting the sequence (without worrying about the
annotation). Here as noted above, I currently favour adding
a method to the SeqFeature, option (1). How about as the
method name get_sequence, extract_sequence or maybe
just extract?

Peter



More information about the Biopython-dev mailing list