[Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF)

Tue May 5 15:26:20 UTC 2009

On Tue, Apr 21, 2009 at 2:55 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
>> I have also been thinking about how I would (re)design the SeqFeature
>> and FeatureLocation objects.  In particular I would want to put the
>> strand as part of the same object as the location, and also any
>> join-locations.  I would still want to cope with fuzzy locations, but
>> make the non-fuzzy approximations more prominent in comparison.  Also,
>> I really don't like the way joins are currently stored as more
>> SeqFeatures in the sub_features list (plus this kind of blocks
>> alternative usage for child/parent nesting that might be nice for GFF
>> files).
>>
>> The prime use case to keep in mind is taking a feature location (even
>> a join), and using this to extract that region of nucleotides from the
>> parent sequence (i.e. a Seq object or a SeqRecord object, as now both
>> can be sliced).

I've written code to do this in test_SeqIO_features.py, which cross
checks the nucleotides pulled out from a GenBank files based on the
SeqFeature, against what the NCBI provide in FASTA format.  This seems
to work OK, but has not been tested extensively (e.g. running it on
drosophila or arabidopsis would be good).

It could make sense to expose this functionality directly in
Biopython, maybe as a method of the SeqRecord taking a SeqFeature (or
the index of a feature in that record), returning a Seq object (or
perhaps a SeqRecord using the feature's annotation).

e.g.
>>> from Bio import SeqIO
>>> record = SeqIO.read(open("NC_005816.gb"),"genbank")
>>> record.extract_feature_seq(6)
Seq('GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAG...TAA',
IUPACAmbiguousDNA())
>>> feature = record.features[6]
>>> record.extract_feature_seq(feature)
Seq('GTGAACAAACAACAACAAACTGCGCTGAATATGGCGCGATTTATCAGAAGCCAG...TAA',
IUPACAmbiguousDNA())

Alternatively, rather than introducing a new method (e.g.
"extract_feature_seq" as in the above example) we could overload the
__getitem__ method of the SeqRecord, i.e. overloading the slice
mechanism so a SeqFeature can alternatively be given, e.g.
record[feature].  Note that passing the index of a feature wouldn't
work as record[6] currently means the seventh letter, rather than the
seventh feature.

Note that just passing a SeqFeature's FeatureLocation is not enough,
as this lacks the strand information, and also any sub-features and
associated location operator (i.e. join).

> I forgot to mention the second major use case I'm concerned about,
> which is recovering the GenBank/EMBL style location string.  I have
> looked at this in the past, by adding methods to the FeatureLocation
> and all the Position objects, but it is complicated by the fact the
> Position objects don't know if they are at the start or end (and for
> the start locations we need to add one to convert from Python
> counting).  This is the main block on having Bio.SeqIO support writing
> GenBank (or EMBL) files with their features included.

See Bug 2294 for writing GenBank files:
http://bugzilla.open-bio.org/show_bug.cgi?id=2294
I've just checked in some code to record the features when writing
GenBank files with Bio.SeqIO.  I solved the feature location issue by
introducing a private function which knows about all the currently
used AbstractPosition objects - the code is actually pretty short.

Peter