[Bioperl-l] UCSC database backend

Thu Aug 10 14:21:04 UTC 2006

Sendu,

Sean indicates that the sequences would be held in flatfiles.  The  
trick would be grabbing location information from a particular MySQL  
table, then using that to retrieve the sequence slice from the  
indexed flatfile.

MySQL table-->SeqFeatureI(?)-->
Bio::LocationI(Simple/Split/Fuzzy etc)-->sequence slice from Indexed  
file

Would be relatively easy if the MySQL table contains information  
about which flatfile is used; that I don't know.  If not, maybe use  
an .ini file to map the tables to flatfiles?

If you wanted something from GenBank:

MySQL table-->SeqFeatureI(?)-->
Bio::LocationI(Simple/Split/Fuzzy etc)-->sequence slice from GenBank  
file

The GenBank file slice could be retrieved remotely via  
Bio::DB::GenBank if you didn't want a local GenBank installation:

my $ncbi = Bio::DB::GenBank->new(-format    => 'fasta');
# later...
$ncbi->seq_start($start);
$ncbi->seq_stop($end);
$ncbi->strand($strand);
my $seq = $ncbi->get_Seq_by_id($id);

Bio::DB::Fasta and Bio::DB::GenBank both implement  
Bio::DB::RandomAccessI.  A requirement for sequence retrieval could  
be a DB handle that is-a Bio::DB::RandomAccessI.

Bio::SeqFeatureI's spliced_seq() uses a similar idea: using an  
optional DB handle, piece together sequence slices based on location  
information from a seqfeature.  One possible issue: lack of  
correspondence between the local MySQL database and the remote  
GenBank database.  This would require the user automate updating  
their local databases once a week or so.

There are a few problems which should be easily worked around:

1)  Bio::DB::Fasta can't handle very large files (http:// 
bugzilla.open-bio.org/show_bug.cgi?id=2063).  There is a proposed fix  
in Bugzilla, but I'm not sure about the the idea of dynamically  
determining the packing/unpacking (32-bit vs 64-bit) based on file size.

2)  I think sequences in UCSC start with 0; in bioperl sequences  
start with 1.  Easy enough, but something to keep in mind.

Chris

On Aug 10, 2006, at 2:14 AM, Sendu Bala wrote:

> Sean Davis wrote:
>>
>> Before we get too far down this line of thought, keep in mind that  
>> this will
>> be dozens of Gb of sequence and database tables.  See here for  
>> details:
>>
>> http://genome.ucsc.edu/admin/mirror.html
>>
>> The sequences include all of genbank, essentially.  The mysql  
>> tables ALONE
>> (no sequence) for only ONE human assembly is on the order of 10Gb-- 
>> not the
>> kind of thing you can download in a few minutes (or even hours).   
>> Just to
>> keep in mind....
>
> I think if someone needs heavy-duty access to genomic data, they'll  
> find
> the discspace. That wouldn't be the problem. The problem would be
> finding an easy way of getting the data, which is where I hoped
> something like a UCSC frontend would come in.
>
>
>> On another point, the strength of UCSC is not in obtaining  
>> sequence, but in
>> mapping to the genome.  I think getting actual sequence should be  
>> secondary
>> here, if for no other reason than there are trivially easy ways of  
>> getting
>> sequence information from elsewhere given an accession or ID.   
>> There is
>> simply too much information to be stored locally for most people  
>> and getting
>> the data remotely from UCSC doesn't seem possible currently.
>
> The work would certainly be highly valuable even if it didn't allow  
> for
> sequence retrieval, but from my own point of view my main interest was
> exactly the retrieval of arbitrary bits of genomic sequence - for  
> which
> there is no accession or ID that can be used to query some other  
> database.
>
> How does the website table browser frontend allow retrieval of  
> sequence
> data?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign