[BioRuby-cvs] bioruby/lib/bio/util contingency_table.rb,1.2,1.3
Katayama Toshiaki
k at pub.open-bio.org
Mon Feb 27 13:03:16 UTC 2006
Update of /home/repository/bioruby/bioruby/lib/bio/util
In directory pub.open-bio.org:/tmp/cvs-serv2931
Modified Files:
contingency_table.rb
Log Message:
* RDoc is moved in the head and folded for readability
Index: contingency_table.rb
===================================================================
RCS file: /home/repository/bioruby/bioruby/lib/bio/util/contingency_table.rb,v
retrieving revision 1.2
retrieving revision 1.3
diff -C2 -d -r1.2 -r1.3
*** contingency_table.rb 13 Dec 2005 14:58:37 -0000 1.2
--- contingency_table.rb 27 Feb 2006 13:03:14 -0000 1.3
***************
*** 1,6 ****
- module Bio
#
! # bio/util/contingency_table.rb - Statistical contingency table analysis for aligned sequences
#
# Copyright:: Copyright (C) 2005 Trevor Wennblom <trevor at corevx.com>
--- 1,5 ----
#
! # = bio/util/contingency_table.rb - Statistical contingency table analysis for aligned sequences
#
# Copyright:: Copyright (C) 2005 Trevor Wennblom <trevor at corevx.com>
***************
*** 9,13 ****
# $Id$
#
! #
#--
#
--- 8,236 ----
# $Id$
#
! # == Synopsis
! #
! # The Bio::ContingencyTable class provides basic statistical contingency table
! # analysis for two positions within aligned sequences.
! #
! # When ContingencyTable is instantiated the set of characters in the
! # aligned sequences may be passed to it as an array. This is
! # important since it uses these characters to create the table's rows
! # and columns. If this array is not passed it will use it's default
! # of an amino acid and nucleotide alphabet in lowercase along with the
! # clustal spacer '-'.
! #
! # To get data from the table the most used functions will be
! # chi_square and contingency_coefficient:
! #
! # ctable = Bio::ContingencyTable.new()
! # ctable['a']['t'] += 1
! # # .. put more values into the table
! # puts ctable.chi_square
! # puts ctable.contingency_coefficient # between 0.0 and 1.0
! #
! # The contingency_coefficient represents the degree of correlation of
! # change between two sequence positions in a multiple-sequence
! # alignment. 0.0 indicates no correlation, 1.0 is the maximum
! # correlation.
! #
! #
! # == Further Reading
! #
! # * http://en.wikipedia.org/wiki/Contingency_table
! # * http://www.physics.csbsju.edu/stats/exact.details.html
! # * Numerical Recipes in C by Press, Flannery, Teukolsky, and Vetterling
! # #
! # == Usage
! #
! # What follows is an example of ContingencyTable in typical usage
! # analyzing results from a clustal alignment.
! #
! # require 'bio'
! # require 'bio/contingency_table'
! #
! # seqs = {}
! # max_length = 0
! # Bio::ClustalW::Report.new( IO.read('sample.aln') ).to_a.each do |entry|
! # data = entry.data.strip
! # seqs[entry.definition] = data.downcase
! # max_length = data.size if max_length == 0
! # raise "Aligned sequences must be the same length!" unless data.size == max_length
! # end
! #
! # VERBOSE = true
! # puts "i\tj\tchi_square\tcontingency_coefficient" if VERBOSE
! # correlations = {}
! #
! # 0.upto(max_length - 1) do |i|
! # (i+1).upto(max_length - 1) do |j|
! # ctable = Bio::ContingencyTable.new()
! # seqs.each_value { |seq| ctable.table[ seq[i].chr ][ seq[j].chr ] += 1 }
! #
! # chi_square = ctable.chi_square
! # contingency_coefficient = ctable.contingency_coefficient
! # puts [(i+1), (j+1), chi_square, contingency_coefficient].join("\t") if VERBOSE
! #
! # correlations["#{i+1},#{j+1}"] = contingency_coefficient
! # correlations["#{j+1},#{i+1}"] = contingency_coefficient # Both ways are accurate
! # end
! # end
! #
! # require 'yaml'
! # File.new('results.yml', 'a+') { |f| f.puts correlations.to_yaml }
! #
! #
! # == Tutorial
! #
!
! # ContingencyTable returns the statistical significance of change
! # between two positions in an alignment. If you would like to see how
! # every possible combination of positions in your alignment compares
! # to one another you must set this up yourself. Hopefully the
! # provided examples will help you get started without too much
! # trouble.
! #
! # def lite_example(sequences, max_length, characters)
! #
! # %w{i j chi_square contingency_coefficient}.each { |x| print x.ljust(12) }
! # puts
! #
! # 0.upto(max_length - 1) do |i|
! # (i+1).upto(max_length - 1) do |j|
! # ctable = Bio::ContingencyTable.new( characters )
! # sequences.each do |seq|
! # i_char = seq[i].chr
! # j_char = seq[j].chr
! # ctable.table[i_char][j_char] += 1
! # end
! # chi_square = ctable.chi_square
! # contingency_coefficient = ctable.contingency_coefficient
! # [(i+1), (j+1), chi_square, contingency_coefficient].each { |x| print x.to_s.ljust(12) }
! # puts
! # end
! # end
! #
! # end
! #
! # allowed_letters = Array.new
! # allowed_letters = 'abcdefghijk'.split('')
! #
! # seqs = Array.new
! # seqs << 'abcde'
! # seqs << 'abcde'
! # seqs << 'aacje'
! # seqs << 'aacae'
! #
! # length_of_every_sequence = seqs[0].size # 5 letters long
! #
! # lite_example(seqs, length_of_every_sequence, allowed_letters)
! #
! #
! # Producing the following results:
! #
! # i j chi_square contingency_coefficient
! # 1 2 0.0 0.0
! # 1 3 0.0 0.0
! # 1 4 0.0 0.0
! # 1 5 0.0 0.0
! # 2 3 0.0 0.0
! # 2 4 4.0 0.707106781186548
! # 2 5 0.0 0.0
! # 3 4 0.0 0.0
! # 3 5 0.0 0.0
! # 4 5 0.0 0.0
! #
! # The position i=2 and j=4 has a high contingency coefficient
! # indicating that the changes at these positions are related. Note
! # that i and j are arbitrary, this could be represented as i=4 and j=2
! # since they both refer to position two and position four in the
! # alignment. Here are some more examples:
! #
! # seqs = Array.new
! # seqs << 'abcde'
! # seqs << 'abcde'
! # seqs << 'aacje'
! # seqs << 'aacae'
! # seqs << 'akcfe'
! # seqs << 'akcfe'
! #
! # length_of_every_sequence = seqs[0].size # 5 letters long
! #
! # lite_example(seqs, length_of_every_sequence, allowed_letters)
! #
! #
! # Results:
! #
! # i j chi_square contingency_coefficient
! # 1 2 0.0 0.0
! # 1 3 0.0 0.0
! # 1 4 0.0 0.0
! # 1 5 0.0 0.0
! # 2 3 0.0 0.0
! # 2 4 12.0 0.816496580927726
! # 2 5 0.0 0.0
! # 3 4 0.0 0.0
! # 3 5 0.0 0.0
! # 4 5 0.0 0.0
! #
! # Here we can see that the strength of the correlation of change has
! # increased when more data is added with correlated changes at the
! # same positions.
! #
! # seqs = Array.new
! # seqs << 'abcde'
! # seqs << 'abcde'
! # seqs << 'kacje' # changed first letter
! # seqs << 'aacae'
! # seqs << 'akcfa' # changed last letter
! # seqs << 'akcfe'
! #
! # length_of_every_sequence = seqs[0].size # 5 letters long
! #
! # lite_example(seqs, length_of_every_sequence, allowed_letters)
! #
! #
! # Results:
! #
! # i j chi_square contingency_coefficient
! # 1 2 2.4 0.534522483824849
! # 1 3 0.0 0.0
! # 1 4 6.0 0.707106781186548
! # 1 5 0.24 0.196116135138184
! # 2 3 0.0 0.0
! # 2 4 12.0 0.816496580927726
! # 2 5 2.4 0.534522483824849
! # 3 4 0.0 0.0
! # 3 5 0.0 0.0
! # 4 5 2.4 0.534522483824849
! #
! # With random changes it becomes more difficult to identify correlated
! # changes, yet positions two and four still have the highest
! # correlation as indicated by the contingency coefficient. The best
! # way to improve the accuracy of your results, as is often the case
! # with statistics, is to increase the sample size.
! #
! #
! # == A Note on Efficiency
! #
!
! # ContingencyTable is slow. It involves many calculations for even a
! # seemingly small five-string data set. Even worse, it's very
! # dependent on matrix traversal, and this is done with two dimensional
! # hashes which dashes any hope of decent speed.
! #
!
! # Finally, half of the matrix is redundant and positions could be
! # summed with their companion position to reduce calculations. For
! # example the positions (5,2) and (2,5) could both have their values
! # added together and just stored in (2,5) while (5,2) could be an
! # illegal position. Also, positions (1,1), (2,2), (3,3), etc. will
! # never be used.
! #
! # The purpose of this package is flexibility and education. The code
! # is short and to the point in aims of achieving that purpose. If the
! # BioRuby project moves towards C extensions in the future a
! # professional caliber version will likely be created.
! #
! #
#--
#
***************
*** 30,252 ****
#
! =begin rdoc
! bio/util/contingency_table.rb - Statistical contingency table analysis for aligned sequences
!
! == Synopsis
!
! The Bio::ContingencyTable class provides basic statistical contingency table
! analysis for two positions within aligned sequences.
!
! When ContingencyTable is instantiated the set of characters in the aligned sequences may be
! passed to it as an array. This is important since it uses these characters
! to create the table's rows and columns. If this array is not passed it will
! use it's default of an amino acid and nucleotide alphabet in lowercase along with the
! clustal spacer '-'.
!
! To get data from the table the most used functions will be chi_square and contingency_coefficient:
! ctable = Bio::ContingencyTable.new()
! ctable['a']['t'] += 1
! # .. put more values into the table
! puts ctable.chi_square
! puts ctable.contingency_coefficient # between 0.0 and 1.0
!
! The contingency_coefficient represents the degree of correlation of change between two
! sequence positions in a multiple-sequence alignment. 0.0 indicates no correlation, 1.0 is the
! maximum correlation.
!
!
! == Further Reading
!
! * http://en.wikipedia.org/wiki/Contingency_table
! * http://www.physics.csbsju.edu/stats/exact.details.html
! * Numerical Recipes in C by Press, Flannery, Teukolsky, and Vetterling
!
!
! == Usage
!
! What follows is an example of ContingencyTable in typical usage analyzing results from a clustal alignment.
!
! require 'bio'
! require 'bio/contingency_table'
!
! seqs = {}
! max_length = 0
! Bio::ClustalW::Report.new( IO.read('sample.aln') ).to_a.each do |entry|
! data = entry.data.strip
! seqs[entry.definition] = data.downcase
! max_length = data.size if max_length == 0
! raise "Aligned sequences must be the same length!" unless data.size == max_length
! end
!
! VERBOSE = true
! puts "i\tj\tchi_square\tcontingency_coefficient" if VERBOSE
! correlations = {}
!
! 0.upto(max_length - 1) do |i|
! (i+1).upto(max_length - 1) do |j|
! ctable = Bio::ContingencyTable.new()
! seqs.each_value { |seq| ctable.table[ seq[i].chr ][ seq[j].chr ] += 1 }
!
! chi_square = ctable.chi_square
! contingency_coefficient = ctable.contingency_coefficient
! puts [(i+1), (j+1), chi_square, contingency_coefficient].join("\t") if VERBOSE
!
! correlations["#{i+1},#{j+1}"] = contingency_coefficient
! correlations["#{j+1},#{i+1}"] = contingency_coefficient # Both ways are accurate
! end
! end
!
! require 'yaml'
! File.new('results.yml', 'a+') { |f| f.puts correlations.to_yaml }
!
!
! == Tutorial
!
! ContingencyTable returns the statistical significance of change between two positions in an alignment.
! If you would like to see how every possible combination of positions in your alignment compares to one another
! you must set this up yourself. Hopefully the provided examples will help you get started without
! too much trouble.
!
! def lite_example(sequences, max_length, characters)
!
! %w{i j chi_square contingency_coefficient}.each { |x| print x.ljust(12) }
! puts
!
! 0.upto(max_length - 1) do |i|
! (i+1).upto(max_length - 1) do |j|
! ctable = Bio::ContingencyTable.new( characters )
! sequences.each do |seq|
! i_char = seq[i].chr
! j_char = seq[j].chr
! ctable.table[i_char][j_char] += 1
! end
! chi_square = ctable.chi_square
! contingency_coefficient = ctable.contingency_coefficient
! [(i+1), (j+1), chi_square, contingency_coefficient].each { |x| print x.to_s.ljust(12) }
! puts
! end
! end
!
! end
!
! allowed_letters = Array.new
! allowed_letters = 'abcdefghijk'.split('')
!
! seqs = Array.new
! seqs << 'abcde'
! seqs << 'abcde'
! seqs << 'aacje'
! seqs << 'aacae'
!
! length_of_every_sequence = seqs[0].size # 5 letters long
!
! lite_example(seqs, length_of_every_sequence, allowed_letters)
!
!
! Producing the following results:
!
! i j chi_square contingency_coefficient
! 1 2 0.0 0.0
! 1 3 0.0 0.0
! 1 4 0.0 0.0
! 1 5 0.0 0.0
! 2 3 0.0 0.0
! 2 4 4.0 0.707106781186548
! 2 5 0.0 0.0
! 3 4 0.0 0.0
! 3 5 0.0 0.0
! 4 5 0.0 0.0
!
! The position i=2 and j=4 has a high contingency coefficient indicating that the changes at these
! positions are related. Note that i and j are arbitrary, this could be represented as i=4 and j=2
! since they both refer to position two and position four in the alignment. Here are some more examples:
!
! seqs = Array.new
! seqs << 'abcde'
! seqs << 'abcde'
! seqs << 'aacje'
! seqs << 'aacae'
! seqs << 'akcfe'
! seqs << 'akcfe'
!
! length_of_every_sequence = seqs[0].size # 5 letters long
!
! lite_example(seqs, length_of_every_sequence, allowed_letters)
!
!
! Results:
!
! i j chi_square contingency_coefficient
! 1 2 0.0 0.0
! 1 3 0.0 0.0
! 1 4 0.0 0.0
! 1 5 0.0 0.0
! 2 3 0.0 0.0
! 2 4 12.0 0.816496580927726
! 2 5 0.0 0.0
! 3 4 0.0 0.0
! 3 5 0.0 0.0
! 4 5 0.0 0.0
!
! Here we can see that the strength of the correlation of change has increased when more data is added with correlated changes at the same positions.
!
! seqs = Array.new
! seqs << 'abcde'
! seqs << 'abcde'
! seqs << 'kacje' # changed first letter
! seqs << 'aacae'
! seqs << 'akcfa' # changed last letter
! seqs << 'akcfe'
!
! length_of_every_sequence = seqs[0].size # 5 letters long
!
! lite_example(seqs, length_of_every_sequence, allowed_letters)
!
!
! Results:
!
! i j chi_square contingency_coefficient
! 1 2 2.4 0.534522483824849
! 1 3 0.0 0.0
! 1 4 6.0 0.707106781186548
! 1 5 0.24 0.196116135138184
! 2 3 0.0 0.0
! 2 4 12.0 0.816496580927726
! 2 5 2.4 0.534522483824849
! 3 4 0.0 0.0
! 3 5 0.0 0.0
! 4 5 2.4 0.534522483824849
!
! With random changes it becomes more difficult to identify correlated changes, yet positions two
! and four still have the highest correlation as indicated by the contingency coefficient. The
! best way to improve the accuracy of your results, as is often the case with statistics, is to
! increase the sample size.
!
!
! == A Note on Efficiency
!
! ContingencyTable is slow. It involves many calculations for even a seemingly small five-string data set.
! Even worse, it's very dependent on matrix traversal, and this is done with two dimensional hashes which
! dashes any hope of decent speed.
!
! Finally, half of the matrix is redundant and positions could be summed with their companion position to reduce
! calculations. For example the positions (5,2) and (2,5) could both have their values added together and
! just stored in (2,5) while (5,2) could be an illegal position. Also, positions (1,1), (2,2), (3,3), etc.
! will never be used.
!
! The purpose of this package is flexibility and education. The code is short and to the point in
! aims of achieving that purpose. If the BioRuby project moves towards C extensions in the future a
! professional caliber version will likely be created.
!
!
! == Author
! Trevor Wennblom <trevor at corevx.com>
!
!
! == Copyright
! Copyright (C) 2005 Trevor Wennblom
! Licensed under the same terms as BioRuby.
!
! =end
class ContingencyTable
--- 253,257 ----
#
! module Bio
class ContingencyTable
More information about the bioruby-cvs
mailing list