[Bioperl-l] UCSC database backend
Chris Fields
cjfields at uiuc.edu
Thu Aug 10 14:21:04 UTC 2006
Sendu,
Sean indicates that the sequences would be held in flatfiles. The
trick would be grabbing location information from a particular MySQL
table, then using that to retrieve the sequence slice from the
indexed flatfile.
MySQL table-->SeqFeatureI(?)-->
Bio::LocationI(Simple/Split/Fuzzy etc)-->sequence slice from Indexed
file
Would be relatively easy if the MySQL table contains information
about which flatfile is used; that I don't know. If not, maybe use
an .ini file to map the tables to flatfiles?
If you wanted something from GenBank:
MySQL table-->SeqFeatureI(?)-->
Bio::LocationI(Simple/Split/Fuzzy etc)-->sequence slice from GenBank
file
The GenBank file slice could be retrieved remotely via
Bio::DB::GenBank if you didn't want a local GenBank installation:
my $ncbi = Bio::DB::GenBank->new(-format => 'fasta');
# later...
$ncbi->seq_start($start);
$ncbi->seq_stop($end);
$ncbi->strand($strand);
my $seq = $ncbi->get_Seq_by_id($id);
Bio::DB::Fasta and Bio::DB::GenBank both implement
Bio::DB::RandomAccessI. A requirement for sequence retrieval could
be a DB handle that is-a Bio::DB::RandomAccessI.
Bio::SeqFeatureI's spliced_seq() uses a similar idea: using an
optional DB handle, piece together sequence slices based on location
information from a seqfeature. One possible issue: lack of
correspondence between the local MySQL database and the remote
GenBank database. This would require the user automate updating
their local databases once a week or so.
There are a few problems which should be easily worked around:
1) Bio::DB::Fasta can't handle very large files (http://
bugzilla.open-bio.org/show_bug.cgi?id=2063). There is a proposed fix
in Bugzilla, but I'm not sure about the the idea of dynamically
determining the packing/unpacking (32-bit vs 64-bit) based on file size.
2) I think sequences in UCSC start with 0; in bioperl sequences
start with 1. Easy enough, but something to keep in mind.
Chris
On Aug 10, 2006, at 2:14 AM, Sendu Bala wrote:
> Sean Davis wrote:
>>
>> Before we get too far down this line of thought, keep in mind that
>> this will
>> be dozens of Gb of sequence and database tables. See here for
>> details:
>>
>> http://genome.ucsc.edu/admin/mirror.html
>>
>> The sequences include all of genbank, essentially. The mysql
>> tables ALONE
>> (no sequence) for only ONE human assembly is on the order of 10Gb--
>> not the
>> kind of thing you can download in a few minutes (or even hours).
>> Just to
>> keep in mind....
>
> I think if someone needs heavy-duty access to genomic data, they'll
> find
> the discspace. That wouldn't be the problem. The problem would be
> finding an easy way of getting the data, which is where I hoped
> something like a UCSC frontend would come in.
>
>
>> On another point, the strength of UCSC is not in obtaining
>> sequence, but in
>> mapping to the genome. I think getting actual sequence should be
>> secondary
>> here, if for no other reason than there are trivially easy ways of
>> getting
>> sequence information from elsewhere given an accession or ID.
>> There is
>> simply too much information to be stored locally for most people
>> and getting
>> the data remotely from UCSC doesn't seem possible currently.
>
> The work would certainly be highly valuable even if it didn't allow
> for
> sequence retrieval, but from my own point of view my main interest was
> exactly the retrieval of arbitrary bits of genomic sequence - for
> which
> there is no accession or ID that can be used to query some other
> database.
>
> How does the website table browser frontend allow retrieval of
> sequence
> data?
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list