[Biopython] SQL Alchemy based BioSQL

Wed Aug 26 11:10:44 UTC 2009

On Wed, Aug 26, 2009 at 2:01 AM, Kyle Ellrott<kellrott at gmail.com> wrote:
> I've added a new database function lookupFeature to quickly search for
> sequences features without have to load all of them for any particular
> sequence.
> Because it's a non-standard function, I've taken the opportunity to
> play around with some more dynamic search features.
> Once we get the interface for these types of searches locked down on
> lookupFeature, a similar system could be implemented in the standard
> 'lookup' call.

I'm not sure about that - all the other "lookup" functions already in
BioSeqDatabase return DBSeqRecord objects don't they? See
below for an alternative...

> The work is posted at http://github.com/kellrott/biopython

You could have posted this on the dev list, but this is debatable.
If it all gets too technical we should probably move the thread...

> The following is an example of a working search, that pulls all of the
> protein_ids from NC_004663.1 between 60,000 and 70,000 on the positive
> strand.
>
> import sys
> from BioSQL import BioSQLAlchemy as BioSeqDataBase
>
> server = BioSeqDataBase.open_database( driver="mysql", user='test',
> host='localhost', db='testdb' )
> db = server[ 'bacteria' ]
>
> seq = db.lookup( version="NC_004663.1" )
>
> features = db.lookupFeatures( BioSeqDataBase.Column('strand') == 1,
>        BioSeqDataBase.Column('start_pos') < 70000,
>        BioSeqDataBase.Column('end_pos') > 60000,
>        bioentry_id = seq._primary_id, name="protein_id" )
>
> #print len(features)
> for feature in features:
>        print feature
>

Interesting - and potentially useful if you are interested in just
part of the genome (e.g. an operon).

Have you tested this on composite features (e.g. a join)?
Without looking into the details of your code this isn't clear.

I wonder how well this would scale with a big BioSQL database
with hundreds of bioentry rows, and millions of seqfeature
and location rows? You'd have to search all the location rows,
filtering on the seqfeature_id linked to the bioentry_id you
wanted. The performance would depend on the database
server, the database software, how big the database is, and
any indexing etc.

Have you signed up to the BioSQL mailing list yet Kyle? It may
help for discussing things like the SQL indexing.

On the other hand, if all the record's features have already been
loaded into memory, there would just be thousands of locations
to look at - it might be quicker.

This brings me to another idea for how this interface might work,
via the SeqRecord - how about adding a method like this:

def filtered_features(self, start=None, end=None, type=None):

Note I think it would also be nice to filter on the feature type (e.g.
CDS or gene). This method would return a sublist of the full
feature list (i.e. a list of those SeqFeature objects within the
range given, and of the appropriate type). This could initially
be implemented with a simple loop, but there would be scope
for building an index or something more clever.

[Note we are glossing over some potentially ambiguous
cases with complex composite locations, where the "start"
and "end" may differ from the "span" of the feature.]

The DBSeqRecord would be able to do the same (just inherit
the method), but you could try doing this via an SQL query, to
get the database to tell you which seqfeature_ids are wanted,
and then return those (existing) SeqFeature objects. [Note we
should avoid returning new SeqFeature objects, as it could be
very confusing to have multiple SeqFeature instances for the
same feature in the database - as well as wasting memory,
and time to build the new objects.]

Peter