[Bioperl-l] TFBS databases, Bio::Matrix::PSM suitable?

Chris Fields cjfields at uiuc.edu
Tue Aug 22 15:12:20 UTC 2006


...
> Yes, Taxonomy being what I'm familiar with I was thinking of doing it
> the same way, especially given that there are so many completely
> different kinds of information you would want to get out of a TFBS
> database. I'll look into how it is 'normally' done if anyone suggests
> that would be better.

How you want to implement these is completely up to you.  I'll make
suggestions (as well as anyone else), but you don't have to follow them.  As
long as it works (and is legal, as Hilmar points out), people will be happy.

...
> It's better because we're talking about a multiple alignment almost
> always with more than 2 sequences, so SimilarityPair would not be
> appropriate...

Yep, my thought as well.  The reason I proposed SimilarityPair was when
using the matrix for scanning data, but that's really the purvey of
SearchIO.

> > Or have the Bio::DB module set up to grab either your
> > 'instance' sequences by ID (where you could possibly implement
> > RandomAccessI)
> 
> ... though having said that you'd still want access to the individual
> sequences by ID.

My thought was to use the DB module to create a stream of chunks of raw data
based on what you are accessing (sequence, alignment, matrix, etc).  The
stream could be handled by the proper IO class to get a Matrix stream
(Bio::Matrix::PSM::IO::transfac), sequence stream (Bio::SeqIO::*), or
alignment stream (Bio::AlignIO*), based on implementing RandomAccessI-like
interfaces:

get_Matrix_by_id() # single matrix
get_Matrices_by_id() # stream of matrices
get_Alignment_by_id() # single alignment
get_Alignments_by_id() # stream of alignments
get_Stream_by_id() # stream of sequences

You could have this all encompassed in one specific class, or have separate
'plugin' DB classes implement specific interfaces, so they could perform
focused, specific tasks:

Bio::DB::Transfac # main interface

Plugins:
Bio::DB::Transfac::align # handles alignments
Bio::DB::Transfac::matrix # handles matrices
Bio::DB::Transfac::seq # handles sequences

my $db = Bio::DB::Transfac->new(-format => 'align',
                                -db     => $tfac);

my $alignio = $db->get_Alignments_by_id(\@ids);

while (my $aln = $alignio->next_aln) {
... #do work here
}

The AlignIO plugin would need to be created, with a next_aln() method
implemented initially to handle the data, but that shouldn't be too hard if
the sequences are in an already-aligned format.  The matrix IO is already
there.  I suppose you could even leave out the sequence methods and just
retrieve the individual sequences from the alignment if needed using
SimpleAlign's each_seq() or each_seq_with_id().

You could also use the Bio::DB::Taxonomy approach and return the object
straight out w/o relying on an IO-like class.  It's completely up to you.

BTW, the reason I suggested RandomAccessI is that we are trying to create
comparable interfaces for UCSC to get LocationI, etc; a similar idea could
be used here.  Setting the basic interface up isn't hard if it is based on
the principles used in RandomAccessI.  

...

Chris





More information about the Bioperl-l mailing list