[Bioperl-l] Database Retrieval

Mon Aug 7 22:44:54 UTC 2006

On 8/7/06 1:53 PM, "Sendu Bala" <bix at sendu.me.uk> wrote:

> Sean Davis wrote:
>> 
>> On 8/7/06 11:51 AM, "Sendu Bala" <bix at sendu.me.uk> wrote:
>> 
>>> Actually, would there be any interest in a bioperl interface to the UCSC
>>> tables? It could probably be done using their DAS server, and end up as
>>> a very easy-to-use alternative to the Ensembl API for limited or one-off
>>> queries.
>> 
>> I haven't used their DAS server in a couple of years--it is probably worth
>> my trying it again at some point.  I have gone the route of maintaining a
>> local ucsc mirror of the database.  There are obvious advantages to doing
>> this, including speed of access, no access limits, and the power and
>> flexibility of SQL.  With the simplicity of MySQL, one can even rsync
>> directory from UCSC's mysql data directory directly, eliminating the need to
>> do a table import step.  I haven't gone so far as to implement a perl-based
>> wrapper, but one could envision doing so for the most commonly-used tables,
>> perhaps using Rose::DB::Object or DBIx::Class as a base with other methods
>> added as needed based on the data contained in the table.
> 
> Yes, using MySQL would probably be better. Any interface would ideally
> use the UCSC MySQL server by default (with enforced X second waits),
> with an option to set the address to your own server.

This is the route I have gone.

> Do you want to go ahead and look into making those classes for accessing
> the common tables? It's in my plan to make various aspects of genomic
> data retrieval a strength of bioperl as opposed to a surprising missing
> link (http://www.bioperl.org/wiki/Getting_Genomic_Sequences); I'll get
> to that in a few weeks but if you lay the ground work or better yet
> complete everything before then that would be great! :)

So, there is a sketch of what things would look like here:

http://watson.nci.nih.gov/~sdavis/Bio-DB-UCSC.tar.gz

You can install it using the usual:

 perl Makefile.PL
 make test
 make install

You will need to have installed:

 Rose::DB::Object

As this is what ORM I am using.  It is very crude and only includes the
refLink and refFlat tables so far, but adding other tables is pretty
straightforward, as you can see from the code.  I would love to hear
comments.  Basically, to use, you can do something like that shown in the
synopsis and output is given below:

NAME
    Bio::DB::UCSC - Access UCSC MySQL tables nicely

SYNOPSIS
      use Bio::DB::UCSC::RefLink::Manager;

      my $reflinks = Bio::DB::UCSC::RefLink::Manager->get_reflinks(
          query => [
                    mrnaAcc => {like => 'NM_00002%'},
                   ],
          );

      foreach my $reflink (@$reflinks) {
          print "Accession:  ",$reflink->mrnaAcc,"\n";
          print "    Gene ID:  ",$reflink->locusLinkId,"\n";
          print "    Locations: \n";
          # and get all locations for each reflink
          # reflink table is related to refflat table
          my $refflats = $reflink->refflats;
          foreach my $refflat (@$refflats) {
              print "    Chrom: ",$refflat->chrom,
                    "    Transcription Start:  ",$refflat->txStart,"\n";
          }
      }

OUTPUT:
Accession:  NM_000020
    Gene ID:  94
    Locations: 
    Chrom: chr12    Transcription Start:  50587468
Accession:  NM_000026
    Gene ID:  158
    Locations: 
    Chrom: chr22    Transcription Start:  39072508
Accession:  NM_000022
    Gene ID:  100
    Locations: 
    Chrom: chr20    Transcription Start:  42681577
Accession:  NM_000027
    Gene ID:  175
    Locations: 
    Chrom: chr4    Transcription Start:  178588917
Accession:  NM_000028
    Gene ID:  178
    Locations: 
    Chrom: chr1    Transcription Start:  100088632
Accession:  NM_000023
    Gene ID:  6442
    Locations: 
    Chrom: chr17    Transcription Start:  45598389
Accession:  NM_000029
    Gene ID:  183
    Locations: 
    Chrom: chr1    Transcription Start:  228904891
Accession:  NM_000025
    Gene ID:  155
    Locations: 
    Chrom: chr8    Transcription Start:  37939672
Accession:  NM_000021
    Gene ID:  5663
    Locations: 
    Chrom: chr14    Transcription Start:  72672931
Accession:  NM_000024
    Gene ID:  154
    Locations: 
    Chrom: chr5    Transcription Start:  148186368