[Bioperl-l] Database Retrieval

Tue Aug 8 13:42:53 UTC 2006

...
> These have relatively clean, well-defined APIs; UCSC does not.  If  
> you have
> access to the UCSC source code, just take a look at joiner.doc to  
> see the
> mess.  Accessing NCBI is quite a different matter than accessing  
> UCSC, I
> think.

Yes, I think every database is different.  The critical thing is to  
get the data flowing first, then worry about getting it into the  
appropriate objects.

So, how would you design a generic interface to access anything in  
UCSC, either remotely or locally (MySQL)?  That would be the start.   
It can be modified from there.

>> If you have the critical backend class made (remote or local access
>> to the database), an interface could be designed similar to
>> Bio::DB::GenBank.
>
> That critical backend is not straightforward, as noted above, but  
> I'll think
> about it more.

> Unlike Genbank where each "object" is the same, there is no such  
> single
> entity at UCSC, so returning data from UCSC is potentially much more
> complicated, with special cases for refSeq, knownGene, ESTs, mRNAs,  
> BACS,
> SNPs, cpg islands, etc.  All I'm saying is that the design of UCSC  
> places
> some constraints on at least the implementation of the interface,  
> if not
> also on the design of the API.

NCBI has the same issue; dbSNP returns several different formats,  
only XML clusters are recognized in Bioperl.  Taxonomy access also  
returns several formats (XML is used in Bioperl).

The key would be to map those special cases to return the data in a  
format you expect Bioperl to eventually use, normally XML or text.   
There are a few exceptions (EntrezGene uses ASN1).   You could also  
have an override allowed; EUtilities allows the use of the parameter  
'retmode' so you can override the return mode specified by the mapped  
databases.

As an example, here's a small bit from EUtilities in the BEGIN block:

     %DATABASE = ('pubmed'           => 'xml',
                  'protein'          => 'text',
                  'nucleotide'       => 'text',
                  'nuccore'          => 'text',
                  'nucgss'           => 'text',
                  'nucest'           => 'text',
                  'structure'        => 'text',
                  'genome'           => 'text',
                  'books'            => 'xml',
                  'cancerchromosomes'=> 'xml',
                  'cdd'              => 'xml',
                  'domains'          => 'xml',
                  'gene'             => 'asn1',
...

Chris

> Sean
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign