[Bioperl-l] Position scoring matrix objects
Stefan Kirov
skirov at utk.edu
Fri Jul 25 14:27:06 EDT 2003
I am doing some research on cis-regulatory sites. I have looked through
the bioperl mailing list and module documentation and it seems to me
that there are no objects, sufficiently suitable for this task(holding
and working with position scoring matrices and their occurance). So I
wrote some motif related Perl modules, that might be (or not) of general
interest and I would like to hear what people on the mailing list think
about this. I would also be happy to get any suggestions and critics. By
the way what I have done so far works only for DNA.
Here are the classes I have designed. There is no abstract interface at
the moment. If people consider this important I can change it. I am
documenting these modules and I am trying to follow the BioPerl structure.
SiteMatrix.
Synopsis: holds a position scoring matrix description and provides
methods to extract different information from this object.
Methods:
new: construct from position scoring matrix hash, individual vectors can
be supplied both as strings or arrays, takes the arguments as a hash
iupac- return IUPAC compliant consensus as a string
score- Returns the score as a real number
IC- information content. Returns a real number
id- identifier. Returns a string
accession- accession number. Returns a string
seq- return simple consensus (choose highest probability or N if prob
too low), sequence
next_pos- return the sequence probably for each letter, IUPAC symbol,
IUPAC probability and simple sequence consenus letter for this position.
Rewind at the end. Returns a hash.
pos- current position get/set. Returns an integer.
regexp- construct a regular expression based on IUPAC consensus. For
example AGWV will be [Aa][Gg][AaTt][AaCcGg]
width-self exp. Integer.
get_string- gets the probability vector for a single base as a string.
Throws an exception if the argument is not in {A,C,G,T}.
When creating the object the constructor will check for positions that
equal 0. If such is found it will increase the count for all positions
by one and recalculate the frequency. Potential bug- if you are using
frequencies and one of the positions is 0 it will change significantly.
However, you should never have frequency that equals 0.
Throws an exception if:
You mix as an input array and string (for example A matrix is given as
array, C – as string).
The position vector is (0,0,0,0).
One of the probability vectors is shorter than the rest.
The probabilities for A,C,G and T do not add up to 1 when you use string
as input vectors.
Examples:
A probability matrix as a string can be:”8913a09” where a is actually
10. This is merely done for compabilty with meme and transfac.
my ($a,$c,$g,$t,$score,$ic, $mid, $seq)=@_; #Either arrayref or string
my %param=(pA=>$a,pC=>$c,pG=>$g,pT=>$t,IC=>$ic,e_val=>$score, id=>$mid);
my $site=new SiteMatrix(%param);
my $regexp=$site->regexp;
my $count=grep($regexp,$seq);
my $count=($seq=~ s/$regexp/$1/eg);
print “Motif $mid is present $count times in this sequence\n”;
Parsers that return SiteMatrix objects:
Meme (the one, distributed with bioperl does not work, and I was unable
to get answers from the list and the developer)
new(file)- associates the object with a meme file. Throws exception if
the file is HTML format.
parse_next- returns the next motif in the file as a SiteMatrix object
Transfac
The methods are pretty much the same, but SiteMatrix object might have
empty fields- for example transfac entry will not contain score and
information content:
new
At the moment the parsers are implemented as two separate classes. This
probably should change and follow the same. There is also no rigorous
check for format violations.
InstanceSite holds information about an instance of a matrix A in the
sequence B.
Methods:
new: creates object from a hash
id- sequence id
mid- motif id
sequence
relpos-relative to transcription start site, usually minus. Will be
calculated if sequence length and position is supplied
matrix-get/sets the SiteMatrix, associated with this instance
diff- gets the number of mismatches based on regexp of SiteMatrix
compared to the instance sequence
Example:
my
%input=(score=>$score,start=>$pos,motif=>$id,seqid=>$llid,seq=>$sequence);
my $instance=new InstanceSite(%input);
Mast parser will return an array of SiteInstance objects. Very rudimentary.
--
Stefan Kirov, Ph.D.
University of Tennessee/Oak Ridge National Laboratory
1060 Commerce Park, Oak Ridge
TN 37830-8026
USA
tel +865 576 5120
fax +865 241 1965
e-mail: skirov at utk.edu
sao at ornl.gov
More information about the Bioperl-l
mailing list