[Biopython-dev] Parsing TRANSFAC matrices with Bio.Motif

Tue Aug 7 07:18:43 UTC 2012

Hi Michiel,

On Tue, Aug 7, 2012 at 8:40 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
> Dear all,
>
> Currently Bio.Motif has some support for writing TRANSFAC files but not for reading TRANSFAC files. I would like to add such a parser to Bio.Motif. Do you all agree that it fits in this module? Note that the TRANSFAC files very much look like EMBL files, and therefore contain much more information than what is currently in a Bio.Motif._Motif.Motif object (the object to be generated by Bio.Motif.read(handle, "transfac")). Perhaps the easiest is to add an attribute .annotations to Bio.Motif._Motif.Motif objects, and use it as a dictionary to store the EMBL-like annotations under their 2-letter keys.
>
That would certainly be a valuable addition. I didn't add it as a
format because it might get a bit confusing for users. The TRANSFAC
itself (trademarked, afaik), as distributed by the BIObase company and
is not available unless you pay them some license(you have to register
even for the "publicly available" one that comes with a license too).
If you do, then you get access to a number of interconnected datasets,
including information about what they call "matrices",  "sites" and
"transcription factors" and "classes". I think that if we want to
support their filetypes, we probably should think whether we should
support the matrix file only or maybe the other ones asa well. The
confusing part is that many programs use "transfac-like" formats, i.e.
files very similar to the part in the "matrix" file that corresponds
to the PWM itself. (For example see
http://www.benoslab.pitt.edu/stamp/help.html).

> On a related note, currently Bio.Motif._Motif.Motif objects also perform functions that are more appropriate for a separate PWM (position-weight matrix) class within Bio.Motif. It may be a good idea to have a separate PWM class for this functionality.

Currently, Bio.Motif.Motif class represents something sequence-like.
It can either be seen a set of instances (.add_instance(),
.search_instance()) or as a PWM (.log_odds(), search_pwm(), etc), It
can hold some annotation part (i.e. name etc), however, in my mind, it
is the core of the functionality for "motif" analysis. I can imagine
other types of motifs (we discussed regExp or HMM based motifs) that
could subclass Motif, but I think this should be the role of the Motif
class. Then comes the thing with annotations. I would rather vote for
something more similar to SeqRecord and Seq, where a new class
(MotifRecord?) would hold all the annotation data from TRANSFAC or
somesuch DB, and the Motif would remain more sequence-like. With
respect to moving the PWM-related functionality to a separate class,
I'm not sure. I think it is valuable to be able to load instances from
a file and then convert them to a PWM. It could be done with separate
classes, but I'm not sure it would be easier then...

best
Bartek
-- 
Bartek Wilczynski