[Bioperl-l] TFBS databases, Bio::Matrix::PSM suitable?

Tue Aug 22 14:58:33 UTC 2006

skirov wrote:
>> Stefan Kirov wrote:
>>> Sendu Bala wrote:
>>> 
>>> Transfac is not an open database so, you cannot get the instance
>>> data anyway.
>> 
>> You can. It is in the sites.dat file and often in the matrix.dat
>> file. It is also available freely and publicly via at least 2
>> websites.
>> 
> Could you please post the urls? Last time I checked Transfac was
> specifically forbidding people from providing the data files. This
> may have changed.

I meant the 'instance' data is freely available, not the .dat files
specifically.

http://www.gene-regulation.com/pub/databases.html#transfac (free reg
required)

http://www.cbil.upenn.edu/cgi-bin/tess/tess

>>> how the rest of us can use it or debug/support it?
>> 
>> It may be possible to include a small example subset of the data in
>> t/data; there is after all already t/data/transfac.dat (which is a
>> small matrix.dat file).
> 
> The test files are good only if there is access to the full data set.
> By their nature, tests files can span only a representation of
> multiple scenarios to check the installation validity, this in no way
> could be a check for synchronization between the full data set and
> the code.

I'm not sure what you mean. Do you think that before a genbank parser
can be released, all genbank files in existence must be supplied in the
test suite to ensure it really does work on everyone's machine? The test
data need only be representative, and if it isn't good enough and a user
discovers a problem, a bug is reported and fixed as normal.

>> If someone is willing to develop and maintain a module that deals
>> with a data source, it makes no difference if that source is open
>> or not - it is useful either way to other people who also have
>> access to that data. If there comes a time that the maintainer can
>> no longer maintain it and it stops working because the data format
>> changes, and no one knows the new format, it can be deprecated.
> 
> In ideal world this may work. Imagine a situation where the code is
> out of sync with the data format and noone is really able to check
> that. Then a user with access to the data source would get burned by
> trying to use the bioperl module, A natural reaction is then to blame
> bioperl (and probably a correct one too).

Well, yes, of course. This is the problem faced by 100% of the parsers
in bioperl. They work until the file format changes, and then hopefully
there is someone around who will fix the problem.

I don't see the fear that in the future it may not work is a reason to
not want it at all. Everything in bioperl may not work in the future.

> The cost is usually much larger- both in support and maintenance.

That cost is borne by the developer that choses to maintain the module.
That would be me in this case, and it isn't a problem for me.

> This is not the point. The core should not get cluttered with code
> that is not maintained. In general more widely used modules are
> better maintained, but the real disaster would be a poorely
> maintained module with a large audience.

Who says it won't be maintained? I will maintain it. The very second I
can no longer maintain it and no one else can, it can be deprecated to
avoid clutter. I don't see the problem. But in any case see below - 
anyone could probably maintain it.

> I agree that a transfac module is necessary and useful (this is why I
> started developing one in the first place)  in general but I doubt it
> is reasonable to support one without access to the underlying data
> structure.

I have access to the pro data files. Everyone has access to 
http://www.biobase-international.com/pages/index.php?id=117 which I 
think documents changes since the last version (in this case, there were 
no changes to the data format since 10.1). Everyone has access to the 
websites.