[Biopython-dev] [Biopython] Filtering SeqRecord feature list / nested SeqFeatures

Wed Aug 26 11:36:36 UTC 2009

Hi all,

I've retitled this thread (originally on the main list) to focus on the
more general idea of filtering SeqRecord feature list (as that has
very little to do with SQLAlchemy) and how this interact with
nested SeqFeature objects.

On Wed, Aug 26, 2009, Peter wrote:
> On Wed, Aug 26, 2009, Kyle Ellrott wrote:
>> I've added a new database function lookupFeature to quickly search for
>> sequences features without have to load all of them for any particular
>> sequence.
>> ...
>
> Interesting - and potentially useful if you are interested in just
> part of the genome (e.g. an operon).
>
> Have you tested this on composite features (e.g. a join)?
> Without looking into the details of your code this isn't clear.
>
> I wonder how well this would scale with a big BioSQL database
> ...
>
> On the other hand, if all the record's features have already been
> loaded into memory, there would just be thousands of locations
> to look at - it might be quicker.
>
> This brings me to another idea for how this interface might work,
> via the SeqRecord - how about adding a method like this:
>
> def filtered_features(self, start=None, end=None, type=None):
>
> Note I think it would also be nice to filter on the feature type (e.g.
> CDS or gene). This method would return a sublist of the full
> feature list (i.e. a list of those SeqFeature objects within the
> range given, and of the appropriate type). This could initially
> be implemented with a simple loop, but there would be scope
> for building an index or something more clever.
>
> [Note we are glossing over some potentially ambiguous
> cases with complex composite locations, where the "start"
> and "end" may differ from the "span" of the feature.]
>
> The DBSeqRecord would be able to do the same (just inherit
> the method), but you could try doing this via an SQL query, ...

Brad, it occurred to me this idea (a filtered_features method
on the SeqRecord) might cause trouble with what I believe you
have in mind for parsing GFF files into nested SeqFeatures.
Is that still your plan?

In particular, if you have save a CDS feature within a gene
feature, and the user asked for all the CDS features, simply
scanning the top level features list would miss it.

Would it be safe to assume (or even enforce) that subfeatures
are always *with* the location spanned by the parent feature?
Even with this proviso, a daughter feature may still be small
enough to pass a start/end filter, even if the parent feature
is not. Again, scanning the top level features list would miss
it.

All these issues go away if we continue to treat the SeqRecord
features list as a flat list, and only use the SeqFeature
subfeatures list purely for storing composite locations (i.e.
sub regions of the parent feature - not for true subfeatures).

There are other downsides to using nested SubFeatures,
it will probably require a lot of reworking of the GenBank
output due to how composite features like joins are
currently stored, and I haven't even looked at the BioSQL
side of things. You may have looked at that already
though, so I may just be worrying about nothing.

Peter