[Bioperl-l] TFBS databases, Bio::Matrix::PSM suitable?

Tue Aug 22 11:39:53 UTC 2006

Sendu Bala wrote:

>I'm looking to extract data from some Transcription Factor Binding Site 
>(TFBS) databases. For example, matrix, sequence and known position 
>information out of Transfac flatfiles.
>
>Currently there is Bio::Matrix::PSM::IO::transfac, but it only gives you 
>the PSM matrices, not the 'instance' sequences. Bio::Matrix::PSM also 
>has this to say:
>
>  
>
Sendu,
Transfac is not an open database so, you cannot get the instance data 
anyway. There was a discussion on that recently. Since Bioperl is 
completely open project, I am not sure it makes sense to put efforts 
into supporting something that is not open- even if you have access to 
the data files (which I believe Transfac does not allow in general) and 
can develop additional methods/modules, how the rest of us can use it or 
debug/support it?
Stefan

>  
>
>>=head1 DESCRIPTION
>>
>>To handle a combination of site matrices and/or their corresponding
>>sequence matches (instances). This object inherits from
>>Bio::Matrix::PSM::SiteMatrix, so you can use the respective
>>methods. It may hold also an array of Bio::Matrix::PSM::InstanceSite
>>object, but you will have to retrieve these through
>>Bio::Matrix::PSM::Psm-E<gt>instances method (see below). To some extent
>>this is an expanded SiteMatrix object, holding data from analysis that
>>also deal with sequence matches of a particular matrix.
>>
>>
>>=head2 DESIGN ISSUES
>>
>>This does not make too much sense to me I am mixing PSM with PSM
>>sequence matches Though they are very closely related, I am not
>>satisfied by the way this is implemented here.  Heikki suggested
>>different objects when one has something like meme But does this mean
>>we have to write a different objects for mast, meme, transfac,
>>theiresias, etc.?  To me the best way is to return SiteMatrix object +
>>arrray of InstanceSite objects and then mast will return undef for
>>SiteMatrix and transfac will return undef for InstanceSite. Probably I
>>cannot see some other design issues that might arise from such
>>approach, but it seems more straightforward.  Hilmar does not like
>>this beacause it is an exception from the general BioPerl rules Should
>>I leave this as an option?  Also the header rightfully belongs the
>>driver object, and could be retrieved as hashes.  I do not think it
>>can be done any other way, unless we want to create even one more
>>object with very unclear content.
>>    
>>
>
>I actually want to get even more kinds of data out, so rather than 
>extend Bio::Matrix::PSM::IO::transfac and related modules in some way, 
>would it be more appropriate to have something like 
>Bio::DB::TFBS::transfac which had a number of methods that gave specific 
>kinds of objects? We could have get_psm() which gives a normal 'pure' 
>Bio::Matrix::PSM with no InstanceSite objects, get_aln() which returns a 
>Bio::SimpleAlign for the 'instance' sequences that were used to generate 
>a given PSM, and get_map() which returns a new special kind of Bio::Map 
>with binding site position information.
>
>Another way it makes a little more sense for this to be a 'DB' module 
>and not an IO one is that there are multiple huge Transfac data files in 
>the database, with related and cross-referenced information. To extract 
>the complete information you would want to parse them all and create 
>indexes for fast lookups later, not something you really expect of an IO 
>module.
>
>
>Thoughts anyone?
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l
>  
>