Class: Bio::ContingencyTable
- Defined in:
- lib/bio/util/contingency_table.rb
Overview
bio/util/contingency_table.rb - Statistical contingency table analysis for aligned sequences
- Author
-
Trevor Wennblom <[email protected]>
- Copyright
-
Copyright © 2005-2007 Midwinter Laboratories, LLC (midwinterlabs.com)
- License
-
The Ruby License
Description
The Bio::ContingencyTable class provides basic statistical contingency table analysis for two positions within aligned sequences.
When ContingencyTable is instantiated the set of characters in the aligned sequences may be passed to it as an array. This is important since it uses these characters to create the table’s rows and columns. If this array is not passed it will use it’s default of an amino acid and nucleotide alphabet in lowercase along with the clustal spacer ‘-’.
To get data from the table the most used functions will be chi_square and contingency_coefficient:
ctable = Bio::ContingencyTable.new()
ctable['a']['t'] += 1
# .. put more values into the table
puts ctable.chi_square
puts ctable.contingency_coefficient # between 0.0 and 1.0
The contingency_coefficient represents the degree of correlation of change between two sequence positions in a multiple-sequence alignment. 0.0 indicates no correlation, 1.0 is the maximum correlation.
Further Reading
-
Numerical Recipes in C by Press, Flannery, Teukolsky, and Vetterling
Usage
What follows is an example of ContingencyTable in typical usage analyzing results from a clustal alignment.
require 'bio'
seqs = {}
max_length = 0
Bio::ClustalW::Report.new( IO.read('sample.aln') ).to_a.each do |entry|
data = entry.data.strip
seqs[entry.definition] = data.downcase
max_length = data.size if max_length == 0
raise "Aligned sequences must be the same length!" unless data.size == max_length
end
VERBOSE = true
puts "i\tj\tchi_square\tcontingency_coefficient" if VERBOSE
correlations = {}
0.upto(max_length - 1) do |i|
(i+1).upto(max_length - 1) do |j|
ctable = Bio::ContingencyTable.new()
seqs.each_value { |seq| ctable.table[ seq[i].chr ][ seq[j].chr ] += 1 }
chi_square = ctable.chi_square
contingency_coefficient = ctable.contingency_coefficient
puts [(i+1), (j+1), chi_square, contingency_coefficient].join("\t") if VERBOSE
correlations["#{i+1},#{j+1}"] = contingency_coefficient
correlations["#{j+1},#{i+1}"] = contingency_coefficient # Both ways are accurate
end
end
require 'yaml'
File.new('results.yml', 'a+') { |f| f.puts correlations.to_yaml }
Tutorial
ContingencyTable returns the statistical significance of change between two positions in an alignment. If you would like to see how every possible combination of positions in your alignment compares to one another you must set this up yourself. Hopefully the provided examples will help you get started without too much trouble.
def lite_example(sequences, max_length, characters)
%w{i j chi_square contingency_coefficient}.each { |x| print x.ljust(12) }
puts
0.upto(max_length - 1) do |i|
(i+1).upto(max_length - 1) do |j|
ctable = Bio::ContingencyTable.new( characters )
sequences.each do |seq|
i_char = seq[i].chr
j_char = seq[j].chr
ctable.table[i_char][j_char] += 1
end
chi_square = ctable.chi_square
contingency_coefficient = ctable.contingency_coefficient
[(i+1), (j+1), chi_square, contingency_coefficient].each { |x| print x.to_s.ljust(12) }
puts
end
end
end
allowed_letters = Array.new
allowed_letters = 'abcdefghijk'.split('')
seqs = Array.new
seqs << 'abcde'
seqs << 'abcde'
seqs << 'aacje'
seqs << 'aacae'
length_of_every_sequence = seqs[0].size # 5 letters long
lite_example(seqs, length_of_every_sequence, allowed_letters)
Producing the following results:
i j chi_square contingency_coefficient
1 2 0.0 0.0
1 3 0.0 0.0
1 4 0.0 0.0
1 5 0.0 0.0
2 3 0.0 0.0
2 4 4.0 0.707106781186548
2 5 0.0 0.0
3 4 0.0 0.0
3 5 0.0 0.0
4 5 0.0 0.0
The position i=2 and j=4 has a high contingency coefficient indicating that the changes at these positions are related. Note that i and j are arbitrary, this could be represented as i=4 and j=2 since they both refer to position two and position four in the alignment. Here are some more examples:
seqs = Array.new
seqs << 'abcde'
seqs << 'abcde'
seqs << 'aacje'
seqs << 'aacae'
seqs << 'akcfe'
seqs << 'akcfe'
length_of_every_sequence = seqs[0].size # 5 letters long
lite_example(seqs, length_of_every_sequence, allowed_letters)
Results:
i j chi_square contingency_coefficient
1 2 0.0 0.0
1 3 0.0 0.0
1 4 0.0 0.0
1 5 0.0 0.0
2 3 0.0 0.0
2 4 12.0 0.816496580927726
2 5 0.0 0.0
3 4 0.0 0.0
3 5 0.0 0.0
4 5 0.0 0.0
Here we can see that the strength of the correlation of change has increased when more data is added with correlated changes at the same positions.
seqs = Array.new
seqs << 'abcde'
seqs << 'abcde'
seqs << 'kacje' # changed first letter
seqs << 'aacae'
seqs << 'akcfa' # changed last letter
seqs << 'akcfe'
length_of_every_sequence = seqs[0].size # 5 letters long
lite_example(seqs, length_of_every_sequence, allowed_letters)
Results:
i j chi_square contingency_coefficient
1 2 2.4 0.534522483824849
1 3 0.0 0.0
1 4 6.0 0.707106781186548
1 5 0.24 0.196116135138184
2 3 0.0 0.0
2 4 12.0 0.816496580927726
2 5 2.4 0.534522483824849
3 4 0.0 0.0
3 5 0.0 0.0
4 5 2.4 0.534522483824849
With random changes it becomes more difficult to identify correlated changes, yet positions two and four still have the highest correlation as indicated by the contingency coefficient. The best way to improve the accuracy of your results, as is often the case with statistics, is to increase the sample size.
A Note on Efficiency
ContingencyTable is slow. It involves many calculations for even a seemingly small five-string data set. Even worse, it’s very dependent on matrix traversal, and this is done with two dimensional hashes which dashes any hope of decent speed.
Finally, half of the matrix is redundant and positions could be summed with their companion position to reduce calculations. For example the positions (5,2) and (2,5) could both have their values added together and just stored in (2,5) while (5,2) could be an illegal position. Also, positions (1,1), (2,2), (3,3), etc. will never be used.
The purpose of this package is flexibility and education. The code is short and to the point in aims of achieving that purpose. If the BioRuby project moves towards C extensions in the future a professional caliber version will likely be created.
Instance Attribute Summary collapse
-
#characters ⇒ Object
readonly
Returns the value of attribute characters.
- #table ⇒ Object
Instance Method Summary collapse
-
#chi_square ⇒ Object
Report the chi square of the entire table.
-
#chi_square_element(i, j) ⇒ Object
Report the chi-square relation of two elements in the table.
-
#column_sum(j) ⇒ Object
Report the sum of all values in a given column.
-
#column_sum_all ⇒ Object
Report the sum of all values in all columns.
-
#contingency_coefficient ⇒ Object
Report the contingency coefficient of the table.
-
#expected(i, j) ⇒ Object
Calculate e, the expected value.
-
#initialize(characters_in_sequences = nil) ⇒ ContingencyTable
constructor
Create a ContingencyTable that has characters_in_sequence.size rows and characters_in_sequence.size columns for each row.
-
#row_sum(i) ⇒ Object
Report the sum of all values in a given row.
-
#row_sum_all ⇒ Object
(also: #table_sum_all)
Report the sum of all values in all rows.
Constructor Details
#initialize(characters_in_sequences = nil) ⇒ ContingencyTable
Create a ContingencyTable that has characters_in_sequence.size rows and characters_in_sequence.size columns for each row
Arguments
-
characters_in_sequences
: (optional) The allowable characters that will be present in the aligned sequences.
- Returns
-
ContingencyTable
object to be filled with values and calculated upon
256 257 258 259 260 |
# File 'lib/bio/util/contingency_table.rb', line 256 def initialize(characters_in_sequences = nil) @characters = ( characters_in_sequences or %w{a c d e f g h i k l m n p q r s t v w y - x u} ) tmp = Hash[*@characters.collect { |v| [v, 0] }.flatten] @table = Hash[*@characters.collect { |v| [v, tmp.dup] }.flatten] end |
Instance Attribute Details
#characters ⇒ Object (readonly)
Returns the value of attribute characters.
247 248 249 |
# File 'lib/bio/util/contingency_table.rb', line 247 def characters @characters end |
Instance Method Details
#chi_square ⇒ Object
Report the chi square of the entire table
Arguments
-
none
- Returns
-
Float
chi square value
332 333 334 335 336 337 338 339 340 341 342 |
# File 'lib/bio/util/contingency_table.rb', line 332 def chi_square total = 0 c = @characters max = c.size - 1 @characters.each do |i| # Loop through every row in the ContingencyTable @characters.each do |j| # Loop through every column in the ContingencyTable total += chi_square_element(i, j) end end total end |
#chi_square_element(i, j) ⇒ Object
Report the chi-square relation of two elements in the table
Arguments
-
i
: row -
j
: column
- Returns
-
Float
chi-square of an intersection
351 352 353 354 355 |
# File 'lib/bio/util/contingency_table.rb', line 351 def chi_square_element(i, j) eij = expected(i, j) return 0 if eij == 0 ( @table[i][j] - eij )**2 / eij end |
#column_sum(j) ⇒ Object
Report the sum of all values in a given column
Arguments
-
j
: Column to sum
- Returns
-
Integer
sum of column
280 281 282 283 284 |
# File 'lib/bio/util/contingency_table.rb', line 280 def column_sum(j) total = 0 @table.each { |row_key, column| total += column[j] } total end |
#column_sum_all ⇒ Object
Report the sum of all values in all columns.
-
This is the same thing as asking for the sum of all values in the table.
Arguments
-
none
- Returns
-
Integer
sum of all columns
294 295 296 297 298 |
# File 'lib/bio/util/contingency_table.rb', line 294 def column_sum_all total = 0 @characters.each { |j| total += column_sum(j) } total end |
#contingency_coefficient ⇒ Object
Report the contingency coefficient of the table
Arguments
-
none
- Returns
-
Float
contingency_coefficient of the table
363 364 365 366 |
# File 'lib/bio/util/contingency_table.rb', line 363 def contingency_coefficient c_s = chi_square Math.sqrt(c_s / (table_sum_all + c_s) ) end |
#expected(i, j) ⇒ Object
Calculate e, the expected value.
Arguments
-
i
: row -
j
: column
- Returns
-
Float
e(sub:ij) = (r(sub:i)/N) * (c(sub:j))
322 323 324 |
# File 'lib/bio/util/contingency_table.rb', line 322 def expected(i, j) (row_sum(i).to_f / table_sum_all) * column_sum(j) end |
#row_sum(i) ⇒ Object
Report the sum of all values in a given row
Arguments
-
i
: Row to sum
- Returns
-
Integer
sum of row
268 269 270 271 272 |
# File 'lib/bio/util/contingency_table.rb', line 268 def row_sum(i) total = 0 @table[i].each { |k, v| total += v } total end |
#row_sum_all ⇒ Object Also known as: table_sum_all
Report the sum of all values in all rows.
-
This is the same thing as asking for the sum of all values in the table.
Arguments
-
none
- Returns
-
Integer
sum of all rows
308 309 310 311 312 |
# File 'lib/bio/util/contingency_table.rb', line 308 def row_sum_all total = 0 @characters.each { |i| total += row_sum(i) } total end |