[Biopython-dev] [Biopython] Filtering SeqRecord feature list / nested SeqFeatures

Mon Aug 31 12:54:52 UTC 2009

Peter and Kyle;

> I've retitled this thread (originally on the main list) to focus on the
> more general idea of filtering SeqRecord feature list (as that has
> very little to do with SQLAlchemy) and how this interact with
> nested SeqFeature objects.

Sorry to have missed this thread in real time; I was out of town
last week. Generally, it is great we are focusing on standard
queries and building up APIs to make them more intuitive. Nice.

> Brad, it occurred to me this idea (a filtered_features method
> on the SeqRecord) might cause trouble with what I believe you
> have in mind for parsing GFF files into nested SeqFeatures.
> Is that still your plan?

Yes, that was still the idea although I haven't dug into it much
beyond last time we discussed this. This is the direct translation
of the GFF way of handling multiple transcripts and coding features,
and seems like the intuitive way to handle the problem.

> In particular, if you have save a CDS feature within a gene
> feature, and the user asked for all the CDS features, simply
> scanning the top level features list would miss it.

I think we'll be okay here. With nesting everything would still be
stored in the seqfeature table. The seqfeature_relationship table
defines the nesting relationship but for the sake of queries all of
the features can be treated as flat directly related to the bioentry
of interest.

Secondarily, you would need to reconstitute the nested relationship
if that is of interest, but for the query example of "give me all 
features of this type in this region" you could return a simple flat
iterator of them.

> Would it be safe to assume (or even enforce) that subfeatures
> are always *with* the location spanned by the parent feature?
> Even with this proviso, a daughter feature may still be small
> enough to pass a start/end filter, even if the parent feature
> is not. Again, scanning the top level features list would miss
> it.

The within assumption makes sense to me here. There may be
pathological cases that fall outside of this, but no examples are
coming to mind right now.

> There are other downsides to using nested SubFeatures,
> it will probably require a lot of reworking of the GenBank
> output due to how composite features like joins are
> currently stored, and I haven't even looked at the BioSQL
> side of things. You may have looked at that already
> though, so I may just be worrying about nothing.

Agreed. My thought was to prototype this with GFF and then think
further about GenBank features. Initially, I just want to get the
GFF parsing documented and in the Biopython repository, and then the
BioSQL storage would be a logical next step.

Brad