[Bioperl-l] TFBS databases, Bio::Matrix::PSM suitable?

Tue Aug 22 14:27:33 UTC 2006

>===== Original Message From Sendu Bala <bix at sendu.me.uk> =====
>Stefan Kirov wrote:
>> Sendu Bala wrote:
>>
>>> I'm looking to extract data from some Transcription Factor Binding
>>> Site (TFBS) databases. For example, matrix, sequence and known
>>> position information out of Transfac flatfiles.
>>>
>>> Currently there is Bio::Matrix::PSM::IO::transfac, but it only gives
>>> you the PSM matrices, not the 'instance' sequences. Bio::Matrix::PSM
>>> also has this to say:
>>
>> Transfac is not an open database so, you cannot get the instance data
>> anyway.
>
>You can. It is in the sites.dat file and often in the matrix.dat file.
>It is also available freely and publicly via at least 2 websites.
>
Could you please post the urls? Last time I checked Transfac was specifically 
forbidding people from providing the data files. This may have changed.

>
>> There was a discussion on that recently. Since Bioperl is
>> completely open project, I am not sure it makes sense to put efforts
>> into supporting something that is not open- even if you have access to
>> the data files (which I believe Transfac does not allow in general)
>
>It does allow it; you just have to pay for fast access to the latest
>data. Or you can use older data for free via the web. A Bio::DB module
>could provide access to either.
>
>
> > how the rest of us can use it or debug/support it?
>
>It may be possible to include a small example subset of the data in
>t/data; there is after all already t/data/transfac.dat (which is a small
>matrix.dat file).
The test files are good only if there is access to the full data set. By their 
nature, tests files can span only a representation of multiple scenarios to 
check the installation validity, this in no way could be a check for 
synchronization between the full data set and the code.
>
>
>In any case, I don't see that your argument is valid. Why should bioperl
>be restricted to only dealing with 'open' data sources? 
As I said because of most developers will not be able to take care of the 
module.

If someone is
>willing to develop and maintain a module that deals with a data source,
>it makes no difference if that source is open or not - it is useful
>either way to other people who also have access to that data. If there
>comes a time that the maintainer can no longer maintain it and it stops
>working because the data format changes, and no one knows the new
>format, it can be deprecated.

In ideal world this may work. Imagine a situation where the code is out of 
sync with the data format and noone is really able to check that. Then a user 
with access to the data source would get burned by trying to use the bioperl 
module, A natural reaction is then to blame bioperl (and probably a correct 
one too).

>
>Is there some 'popularity' threshold that must be passed before it is
>'worth' adding a database module to Bioperl? Why should there be one?
>The cost of having one is a few kb in disc storage space, the benefit
>extremely large to the person who might want to use it.
The cost is usually much larger- both in support and maintenance.
 There may be an
>argument that core shouldn't become cluttered with too much stuff that
>the majority of people won't use, but how is that line drawn? 
This is not the point. The core should not get cluttered with code that is not 
maintained. In general more widely used modules are better maintained, but the 
real disaster would be a poorely maintained module with a large audience.
I don't
>personally use the majority of bioperl modules, but I don't think they
>should all be removed. And clearly the idea of having PWM, transfac
>related modules in bioperl has been deemed acceptable in the past, or we
>wouldn't have Bio::Matrix::PSM::transfac.
Actually  Bio::Matrix::PSM::transfac should have been deprecate in my opinion. 
I stated that more than an year ago.
I agree that a transfac module is necessary and useful (this is why I started 
developing one in the first place)  in general but I doubt it is reasonable to 
support one without access to the underlying data structure.
>_______________________________________________
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/bioperl-l