[Biopython-dev] bioperl-db

Thu Jul 12 20:13:48 EDT 2001

As I think I've mentioned before, I'm visiting EBI before going
to BOSC next week.  Part of my hope while here is to get better
integration between biopython and bioperl.  This afternoon (and
evening) I worked on being able to use a MySQL database loaded
from the bioperl-db schema.

Here's some of what it looks like, with comments starting with ">>> #"

josiah> python
Python 2.1a2 (#1, Feb 11 2001, 00:48:26)
[GCC 2.95.3 19991030 (prerelease)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> import BioSeqDatabase
>>> server = BioSeqDatabase.open_database(user = "root", db="minidb")
>>> server.keys()
['genbank']
>>> db = server["genbank"]
>>> # can also do 'display_id' and 'primary_id'
>>> record = db.lookup(accession = "L26462")
>>> # size of the sequence
>>> len(record)
3002
>>> # This is akin to the bioperl way to access the sequence record
>>> seq = record.primary_seq.seq
>>> # Except that biopython's Seq is a first-class object
>>> # In bioperl this is a primary_seq.subseq(0, 1)
>>> seq[0]
'A'
>>> # The first fetches the full string from the database,
>>> # then shows the 1st 5 chars; the second only fetches 5 characters
>>> # from the database
>>> seq.tostring()[:5], seq[:5].tostring()
('ACCTC', 'ACCTC')
>>> # More showing off :)
>>> seq[-1]
'C'
>>> seq.tostring()[-5:], seq[-5:].tostring()
('CTATC', 'CTATC')
>>> # Yep, subslices can be subsliced
>>> seq.tostring()[-5:], seq[-10:][5:].tostring()
('CTATC', 'CTATC')
>>> from Bio import utils
>>> # This shows that the Alphabet is correct
>>> utils.translate_to_stop(record.primary_seq.seq)
Seq('TSYLTPLITPLIVTLWVVSDFLFICIFDCIKRSLVFYLLFPKT', IUPACProtein())
>>> # Added a new 'Species' object for this, based on bioperl's
>>> record.species
<BioSeq.Species instance at 0x817afbc>
>>> str(record.species)
'Eukaryota Metazoa Chordata Craniata Vertebrata Mammalia Eutheria Primates
Catarrhini Hominidae Homo sapiens'
>>> record.species.binomial()
'Homo sapiens'
>>> # Loaded sequence features into the biopython Bio.SeqFeature.SeqFeature
>>> len(record.seq_features)
26
>>> for feature in record.seq_features[:4]:
...     print str(feature)
...
type: source
location: (0..3002)
ref: None:None
strand: 1
qualifiers:
        Key: db_xref, Value: taxon:9606
        Key: haplotype, Value: C4
        Key: note, Value: sequence found in a Melanesian population
        Key: organism, Value: Homo sapiens

type: variation
location: (110..111)
ref: None:None
strand: 1
qualifiers:
        Key: replace, Value: t

type: variation
location: (262..263)
ref: None:None
strand: 1
qualifiers:
        Key: note, Value: Rsa I polymorphism
        Key: replace, Value: t

type: variation
location: (272..273)
ref: None:None
strand: 1
qualifiers:
        Key: replace, Value: c

>>> # Much more is supported - see the source!
>>>

If you are interested, the files are at
  http://www.biopython.org/~dalke/BioSeqDatabase.py
  http://www.biopython.org/~dalke/BioSeq.py

It may be somewhat harder reading than usual because
  - this is prototype code
  - I've not used the bioperl object layer before
  - I've not used the bioperl-db schema before
  - I've not used MySQL before
  - there are almost no comments in the 550+ LOC

I found out some things in the process:

1) The biopython SeqFeature currently must be used like:

   feature = SeqFeature()
   feature.type = "variation"
   ...

I would much rather prefer allowing the values to be set through
the constructor, as in

   feature = SeqFeature(type = "variation", ...)

This is part of my design philosophy, which is to minimize
modification of a data structure after it has been created.

2) is strand part of the feature or the location?

In bioperl-db and in bioperl, the strand information is part
of the Range.  In SeqFeature, the strand is part of the SeqFeature
and not of the FeatureLocation (which is our equivalent to the
Range).

This is a problem when a SeqFeature has several subfeatures.
In the current scheme, the parent SeqFeature keeps track of
a global start/end location for all its children.  It also needs
a strand.  What strand is used if the subchildren are on
different strands?  (Does that possibility make sense?)

In any case, I'm not sure enough about how to use a SeqFeature
that I can't figure out if it's really wrong or not.

3) Related to that, what's the type used when there are subfeatures?
The SeqFeature documentation says in the case of a join( ... )
the subfeatures have a typename of parent.type + "_span"

>     CDS    join(1..10,30..40,50..60)
>
>    The the top level feature would be a CDS from 1 to 60, and the sub
>    features would be of 'CDS_span' type and would be from 1 to 10, 30 to
>    40 and 50 to 60, respectively

but in Genbank._FeatureConsumer it says:
>             # add _join or _order to the name to make the type clear

I don't like either one.  Why does the SeqFeature need a type at all?

4) feature qualifiers shouldn't really be a dictionary

The SeqFeature keeps track of the qualifiers through a dictionary.
That's a problem because if there are two qualifiers with the
same name then there's a collision, and one gets over written.
(For example,

LOCUS       HUM2C18X01   1668 bp    DNA             PRI       24-AUG-1993
DEFINITION  Homo sapiens cytochrome P4502C18 (CYP2C18) gene, 5' flank and
exon
            1.
ACCESSION   L16869
VERSION     L16869.1  GI:291599

has two /citations

     exon            <1270..1437
                     /gene="CYP2C18"
                     /note="transcription start site not determined"
                     /citation=[1]
                     /citation=[3]
                     /number=1
                     /evidence=experimental

)

To get around this problem, the GenBank parser has:

> If there are multiple qualifier keys with the same name we
> would lose some info in the dictionary, so we append a unique
> number to the end of the name in case of conflicts.
> """
> # if we've got a key from before, add it to the dictionary of
> # qualifiers
> if self._cur_qualifier_key:
>     # get a unique name
>     unique_name = self._cur_qualifier_key
>     counter = 1
>     while self._cur_feature.qualifiers.has_key(unique_name):
>         unique_name = self._cur_qualifier_key + str(counter)
>         counter = counter + 1
>
>     self._cur_feature.qualifiers[unique_name] = \
>                                              self._cur_qualifier_value

which means the qualifier key has a number on the end of it.
That's messing with user visible data in a way that I definitely
don't regard as kosher.  (In my new bioperl-db integration code I
still keep workaround but minimize it somewhat by not appending
a 0 the first time around.  That isn't the right solution either.)

One way I've solved this problem before is to make a hybrid dict/list
class, where you can refer to
  obj["key"]
to get the value associated with the named key - a string - or
  obj[4]
to get the value in the 5th position.

This only works if the key can never be a string.

I'm not suggesting it as a solution, only pointing out that other
alternatives exist.

- What's the numbering system of the FeatureLocation?
When the GenBank parser uses it there is no conversion from
GenBank's numbering system
(where [i,j] means i<=position<=j and position == 1 is the first term)
to Python's
(where [i,j] means i<=position<j and position == 0 is the first term)

I strongly insist that all numbers be converted to normal Python
semantics and only translated at I/O boundaries (to/from files, to/from
GUIs, etc.)  Otherwise people will get confused.  Okay, at least *I'll*
get confused.

                    Andrew
                    dalke at acm.org