[Biopython-dev] A modification to BioSQL

Mon Jun 22 20:11:02 UTC 2015

All,

I’ve been using the BioSQL schema with Bioperl and would like to start doing the same with Biopython, but there’s a limitation I’d like to fix. Here’s the relevant table in the BioSQL schema, seqfeature:

     Column     |         Type          |                        Modifiers                        | Storage  | Stats target | Description 
----------------+-----------------------+---------------------------------------------------------+----------+--------------+-------------
 seqfeature_id  | integer               | not null default nextval('seqfeature_pk_seq'::regclass) | plain    |              | 
 bioentry_id    | integer               | not null                                                | plain    |              | 
 type_term_id   | integer               | not null                                                | plain    |              | 
 source_term_id | integer               | not null                                                | plain    |              | 
 display_name   | character varying(64) |                                                         | extended |              | 
 rank           | integer               | not null default 0                                      | plain    |              | 

Note that required field, source_term_id. In the work I’ve been doing with Bioperl I’ve been setting this “source term” to different values (e.g.  “NCBI”) depending on where the tag/value data in the feature comes from. 

But here’s the code that makes a persistent feature, from BioSQL/Loader.py:

    def _load_seqfeature_basic(self, feature_type, feature_rank, bioentry_id):
        """Load the first tables of a seqfeature and returns the id (PRIVATE).

        This loads the "key" of the seqfeature (ie. CDS, gene) and
        the basic seqfeature table itself.
        """
        ontology_id = self._get_ontology_id('SeqFeature Keys')
        seqfeature_key_id = self._get_term_id(feature_type,
                                              ontology_id=ontology_id)
        # XXX source is always EMBL/GenBank/SwissProt here; it should depend on
        # the record (how?)
        source_cat_id = self._get_ontology_id('SeqFeature Sources')
        source_term_id = self._get_term_id('EMBL/GenBank/SwissProt',
                                           ontology_id=source_cat_id)

        sql = r"INSERT INTO seqfeature (bioentry_id, type_term_id, " \
              r"source_term_id, rank) VALUES (%s, %s, %s, %s)"
        self.adaptor.execute(sql, (bioentry_id, seqfeature_key_id,
                                   source_term_id, feature_rank + 1))
        seqfeature_id = self.adaptor.last_id('seqfeature')

        return seqfeature_id

This code always sets the source term to “ EMBL/GenBank/SwissProt”, and it can not be set to anything else. A better idea is to have a method to set and get this, e.g. source(), just as you can set the “type” of the feature. The way to do this is to subclass SeqFeature to make DBSeqFeature, just as Seq is subclassed to make DBSeq and SeqRecord is subclassed to make DBSeqRecord in BioSQL/Seq.py.

So I propose to fork, code, and send a pull request for this. What do you think?

Thanks again,

Brian O.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20150622/7e2cae70/attachment-0001.html>