[Biopython-dev] SeqFeature and FeatureLocation objects (was Bio.GFF)

Peter Cock p.j.a.cock at googlemail.com
Thu Apr 23 14:06:14 UTC 2009


On Thu, Apr 23, 2009 at 1:36 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Hi all;
>
>> > Unless you are thinking of having an object representation as being too
>> > heavy, the non-light part of SeqFeature is all the FeatureLocation
>> > fuzziness.
>>
>> I've just had a quick go at what should be a 100% backwards compatible
>> modification to the FeatureLocation class to store ExactPosition start
>> or end positions as integers.  The idea should be more memory
>> efficient, using the complex position objects only when required.
>
> I like the idea here but I would go a step further and get rid of
> FeatureLocation, collapsing the start and end location onto the
> SeqFeature itself. FeatureLocation is basically just a holder for a
> start and end coordinates. In this version, you would store the
> positions plus extensions and fuzzy type on the Feature, and then
> instantiate fuzzy objects on demand.
>
> I took a look at the resource usage of these objects versus
> a lightweight implementation. For a GFF file with 70k features, the
> maximum memory usage is 128M versus 111M for the lightweight
> version. So the improvement is rather modest, ~15%.

Thanks for that.  Perhaps the variant idea using a using a single
reference for each location would save more (currently is uses two
references, one for the object and one for the integer - so in general
we are wasting memory on a pointer to None).

Certainly merging the SeqFeature and FeatureLocation should save even
more memory.  We could do this with full backward compatibility by
generating the FeatureLocation object on request (using a property
method for the SeqFeature's location), and this can also trigger a
deprecation warning.  We'd have to think about what to do with the
SeqFeature's __init__ method more carefully.

>> I forgot to mention the second major use case I'm concerned about,
>> which is recovering the GenBank/EMBL style location string.  I have
>> looked at this in the past, by adding methods to the FeatureLocation
>> and all the Position objects, but it is complicated by the fact the
>> Position objects don't know if they are at the start or end (and for
>> the start locations we need to add one to convert from Python
>> counting).  This is the main block on having Bio.SeqIO support writing
>> GenBank (or EMBL) files with their features included.
>
> I admittedly haven't looked at this in a while, but this was
> designed to be round tripped. The GenBank Record class can be
> written out back in GenBank format, and test_GenBank explicitly
> checks that the start and end records are the same.

Yes - The Bio.GenBank.Record class should round-trip, from memory it
stores feature locations as string.

I'm interested in writing a SeqRecord out as a GenBank file (which
already do, but without the features).  This would let you do things
like load an EMBL or GFF3 file as a SeqRecord, and output it as a
GenBank file.

Peter




More information about the Biopython-dev mailing list