[Bioperl-l] UCSC database backend

Wed Aug 9 18:11:37 UTC 2006

> Chris,
> 
> Once I get CVS access, I will commit what I have done (as long as it
> "works").
> 
> Now for the details.  Keep in mind that for many of the "sequences"
> available from UCSC, there is no actual "sequence" stored in the database;
> rather they are stored in flat files not accessible directly via SQL.
> Therefore, a sequence would be "abstract" in the sense of being a "join
> location" on the chromosome, and even that isn't quite right, as the mRNA
> sequence != genomic alignment sequence.  Also, there are many different
> tables that maintain "sequence" information.  So, implementing
> RandomAccessI
> is not going to be straightforward and will require some assumptions about
> what will be searched.  In fact, since the same "sequence" can be in many
> different tables, there may need to be a way of specifying where the
> search
> is done (what table(s)).
> 
> Sean

Sean,

Okay, makes sense.  So, the MySQL database holds the sequence information
(location, etc) and the actual sequences (mRNA, EST, genomic) are in various
flat files.  Seems like this calls for a helper set-up script to index the
appropriate sequence flat files and possibly load the MySQL database table
information.  Bio::DB::Fasta could be used for indexing the sequence files
as it's pretty fast.

So, if I were to retrieve a particular sequence (region of scaffold of
genomic DNA for instance), I would need:

1)  unique ID or name for the sequence
2)  start-end coordinates (in UCSC terms, I suppose; UCSC starts with 0, if
I remember correctly?)
3)  table to retrieve data from
4)  either the location of indexed sequence files or a flat-file db handler

These could be all set upon instantiation for sequence retrieval :

$factory = Bio::DB::UCSC::Sequence(-table     => $table,
                                   -seq_start => $start,
                                   -seq_end   => $end,
                                   -db        => $handler,);

# returns Bio::PrimarySeq::Fasta via Bio::DB::Fasta DB Handler

$seq = $factory->get_Seq_by_id($id);  

If you just want the sequence associated with an ID, the location info
(whether it is Simple, Split, Fuzzy, etc) could be used to retrieve the
subsequence from the appropriate flatfile dependent on the table used.

$factory = Bio::DB::UCSC::Sequence(-table     => $table,
                                   -db        => $handler,);

# returns Bio::PrimarySeq::Fasta via Bio::DB::Fasta DB handler

$seq = $factory->get_Seq_by_id($id);  

Would something like that be appropriate?  Not sure if I'm missing
something.  Sendu may have other suggestions/additions; I'm letting the
coffee talk now.

Chris