[Bioperl-l] UCSC database backend

Chris Fields cjfields at uiuc.edu
Wed Aug 9 19:21:57 UTC 2006


...

> Before we get too far down this line of thought, keep in mind that this
> will
> be dozens of Gb of sequence and database tables.  See here for details:
> 
> http://genome.ucsc.edu/admin/mirror.html
> 
> The sequences include all of genbank, essentially.  The mysql tables ALONE
> (no sequence) for only ONE human assembly is on the order of 10Gb--not the
> kind of thing you can download in a few minutes (or even hours).  Just to
> keep in mind....

Yes, there was a recent bug related to the packing order for very large
files (>4 GB, I believe).  I'm hoping Lincoln takes a look at it soon for
further suggestions as the proposed changes would require reindexing
everything.  However, the proposed fix did work well for the submitter.

> On another point, the strength of UCSC is not in obtaining sequence, but
> in
> mapping to the genome.  I think getting actual sequence should be
> secondary
> here, if for no other reason than there are trivially easy ways of getting
> sequence information from elsewhere given an accession or ID.  There is
> simply too much information to be stored locally for most people and
> getting
> the data remotely from UCSC doesn't seem possible currently.
> 
> Sean

Then we could use this to primarily return location and other information
instead.  Anyone interested in sequence can use the location info to
retrieve sequences remotely (via Bio::DB::GenBank or similar) or locally
(Bio::DB::Fasta).  

The key is to get this set up in some basic way that people could start
using it, make suggestions, etc.  Sendu, any suggestions?

Chris





More information about the Bioperl-l mailing list