[Biopython-dev] about the SeqRecord and SeqFeature classes

Peter biopython at maubp.freeserve.co.uk
Tue Sep 23 14:37:29 UTC 2008


On Tue, Sep 23, 2008 at 1:40 PM, Jose Blanca <jblanca at btc.upv.es> wrote:
> Hi:
> I'm still interested on the design of the Sequence and Alignment classes. For
> my work I need sequence classes with some extended features. I need a
> SequenceWithQuality class and a Seq class capable of holding information
> about features located in different regions of the sequence.
> I could use SeqRecord for the sequence with features and extend Seq for the
> SequenceWithQuality, but I have found some problems with this approach.

I would also like to be able to have SeqRecord or Seq objects with a
quality sequence.  This is probably more important than a general "per
letter annotation" system for sequences.  Would you want to use
integers, floats or characters for the quality scores?

> SeqRecord still doesn't have a __getitem__ method.

What do you think of the __getitem__ method proposed in attachment 942
on Bug 2507?
http://bugzilla.open-bio.org/show_bug.cgi?id=2507

> Also, SeqRecord exposes the implementation of the features collection,
> it's a public list. That I think is a limitation. For instance, we could be interested
> in controlling if a the feature added is inside the region covered by the sequence.

Yes, because it is currently a public list we can't easily stop the
user putting in-appropriate features (or other objects) into the list.
 A list-like sub-class with some brains behind it might be one
backwards compatible approach.  But do we really need to worry about
this?

> We can't also ask for features by their name or type.

You can work around this by creating a lookup dictionary, e.g.
http://www.warwick.ac.uk/go/peter_cock/python/genbank/#indexing_features

Perhaps we could add a "lookup feature" function given say an
annotation key (e.g. "locus_tag") and value (e.g. "NEQ010") plus
perhaps feature type (e.g. "CDS").

> I understand that keeping compatibility is paramount for BioPython and I share
> that concern. I also understand that having two classes to do the same job is
> not a nice thing.

I agree.  Especially now that Bio.SeqIO and AlignIO seem to be working
out pretty well and these are pretty tied into the SeqRecord object.

> Nevertheless I have been thinking about these issues and I have
> implemented a non-mutable sequence class with these ideas in mind. I
> plan to use this implementation to write an Alignment class capable of
> dealing with ESTs assemblies.

Dealing nicely with EST assemblies is a valuable goal.

> The most different aspect of this proposal and the code actually alive in
> BioPython are the LocatableFeature and Location classes. LocatableFeature is
> equivalent to SeqFeature, but while SeqFeature is mostly a struct with no
> methods LocatableFeature has a __getitem__, __len__ and complement.
> Location is inspired by the BioRange BioPerl class.

I personally don't like the current way Biopython stores the location
for SeqFeatures containing sub-features (e.g. anything with a join).
The join-location can only be determined from a combination of the
location of each sub-feature.  However, this standard is currently
implemented and stable, and supported in Biopython's BioSQL wrapper.

> I would like to have equivalent functions in BioPython and I'm willing to help
> in the adaptation the actual BioPython classes. I would appreciate to hear
> your suggestions and criticisms about the classes that I'm sending.
> Best regards,

If there are enough people interested in re-working the
Seq/MutableSeq/SeqRecord objects with an API break, we could seriously
discuss this as part of a hypothetical "Biopython 2.0".  Once we move
from CVS to SVN it would also be possible to setup a branch in the
repository to experiment there.  However, I think there is still
plenty of potential for improving things in a backwards compatible
manor (and have opened several enhancement bugs on bugzilla for this).
 I would like to try and tackle these before breaking the existing
API.

Peter



More information about the Biopython-dev mailing list