Class: NaiveBayesClassifier
- Inherits:
-
Object
- Object
- NaiveBayesClassifier
- Defined in:
- lib/unsupervised-language-detection/naive-bayes-classifier.rb
Instance Attribute Summary collapse
-
#category_names ⇒ Object
Returns the value of attribute category_names.
-
#num_categories ⇒ Object
readonly
Returns the value of attribute num_categories.
-
#prior_category_counts ⇒ Object
readonly
Returns the value of attribute prior_category_counts.
-
#prior_token_count ⇒ Object
readonly
Returns the value of attribute prior_token_count.
Class Method Summary collapse
-
.train_em(max_epochs, training_examples) ⇒ Object
Performs a Naive Bayes EM algorithm with two classes.
Instance Method Summary collapse
-
#classify(tokens) ⇒ Object
Returns the index (not the name) of the category the tokens are classified under.
-
#get_posterior_category_probabilities(tokens) ⇒ Object
Returns p(category | token), for each category, in an array.
-
#get_prior_category_probability(category_index) ⇒ Object
Returns p(category).
-
#get_token_probability(token, category_index) ⇒ Object
Returns p(token | category).
-
#initialize(options = {}) ⇒ NaiveBayesClassifier
constructor
Parameters ———- num_categories: number of categories we want to classify.
-
#train(example, category_index, probability = 1) ⇒ Object
Given a labeled training example (i.e., an array of tokens and its probability of belonging to a certain category), update the parameters of the Naive Bayes model.
Constructor Details
#initialize(options = {}) ⇒ NaiveBayesClassifier
Parameters
num_categories: number of categories we want to classify. prior_category_counts: array of parameters for a Dirichlet prior that we place on the prior probabilities of each category. (In other words, these are “virtual counts” of the number of times we have seen each category previously.) Set the array to all 0’s if you want to use maximum likelihood estimates. Defaults to uniform reals from the unit interval if nothing is set. prior_token_count: parameter for a beta prior that we place on p(token|category). (In other words, this is a “virtual count” of the number of times we have seen each token previously.) Set to 0 if you want to use maximum likelihood estimates.
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 10 def initialize( = {}) = {:num_categories => 2, :prior_token_count => 0.0001}.merge() @num_categories = [:num_categories] @prior_token_count = [:prior_token_count] @prior_category_counts = [:prior_category_counts] || Array.new(@num_categories) { rand } @category_names = [:category_names] || (0..num_categories-1).map(&:to_s).to_a # `@token_counts[category][token]` is the (weighted) number of times we have seen `token` with this category. @token_counts = Array.new(@num_categories) do Hash.new { |h, token| h[token] = 0 } end # `@total_token_counts[category]` is always equal to `@token_counts[category].sum`. @total_token_counts = Array.new(@num_categories, 0) # `@category_counts[category]` is the (weighted) number of training examples we have seen with this category. @category_counts = Array.new(@num_categories, 0) end |
Instance Attribute Details
#category_names ⇒ Object
Returns the value of attribute category_names.
3 4 5 |
# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 3 def category_names @category_names end |
#num_categories ⇒ Object (readonly)
Returns the value of attribute num_categories.
2 3 4 |
# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 2 def num_categories @num_categories end |
#prior_category_counts ⇒ Object (readonly)
Returns the value of attribute prior_category_counts.
2 3 4 |
# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 2 def prior_category_counts @prior_category_counts end |
#prior_token_count ⇒ Object (readonly)
Returns the value of attribute prior_token_count.
2 3 4 |
# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 2 def prior_token_count @prior_token_count end |
Class Method Details
.train_em(max_epochs, training_examples) ⇒ Object
Performs a Naive Bayes EM algorithm with two classes.
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 47 def self.train_em(max_epochs, training_examples) prev_classifier = NaiveBayesClassifier.new max_epochs.times do classifier = NaiveBayesClassifier.new # E-M training training_examples.each do |example| # E-step: for each training example, recompute its classification probabilities. posterior_category_probs = prev_classifier.get_posterior_category_probabilities(example) # M-step: for each category, recompute the probability of generating each token. posterior_category_probs.each_with_index do |p, category| classifier.train(example, category, p) end end prev_classifier = classifier # TODO: add a convergence check, so we can break out early if we want. end return prev_classifier end |
Instance Method Details
#classify(tokens) ⇒ Object
Returns the index (not the name) of the category the tokens are classified under.
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 69 def classify(tokens) max_prob, max_category = -1, -1 if tokens.empty? # If the example is empty, find the category with the highest prior probability. (0..@num_categories - 1).each do |i| prior_prob = get_prior_category_probability(i) max_prob, max_category = prior_prob, i if prior_prob > max_prob end else # Otherwise, find the category with the highest posterior probability. get_posterior_category_probabilities(tokens).each_with_index do |prob, category| max_prob, max_category = prob, category if prob > max_prob end end return max_category end |
#get_posterior_category_probabilities(tokens) ⇒ Object
Returns p(category | token), for each category, in an array.
89 90 91 92 93 94 95 96 97 |
# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 89 def get_posterior_category_probabilities(tokens) unnormalized_posterior_probs = (0..@num_categories-1).map do |category| p = tokens.map { |token| get_token_probability(token, category) }.reduce(:*) # p(tokens | category) p * get_prior_category_probability(category) # p(tokens | category) * p(category) end normalization = unnormalized_posterior_probs.reduce(:+) normalization = 1 if normalization == 0 return unnormalized_posterior_probs.map{ |p| p / normalization } end |
#get_prior_category_probability(category_index) ⇒ Object
Returns p(category).
110 111 112 113 114 115 116 117 |
# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 110 def get_prior_category_probability(category_index) denom = @category_counts.reduce(:+) + @prior_category_counts.reduce(:+) if denom == 0 return 0 else return (@category_counts[category_index] + @prior_category_counts[category_index]).to_f / denom end end |
#get_token_probability(token, category_index) ⇒ Object
Returns p(token | category).
100 101 102 103 104 105 106 107 |
# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 100 def get_token_probability(token, category_index) denom = @total_token_counts[category_index] + @token_counts[category_index].size * @prior_token_count if denom == 0 return 0 else return ((@token_counts[category_index][token] || 0) + @prior_token_count).to_f / denom end end |
#train(example, category_index, probability = 1) ⇒ Object
Given a labeled training example (i.e., an array of tokens and its probability of belonging to a certain category), update the parameters of the Naive Bayes model. Parameters
example: an array of tokens. category_index: the index of the category this example belongs to. probability: the probability that the example belongs to the category.
38 39 40 41 42 43 44 |
# File 'lib/unsupervised-language-detection/naive-bayes-classifier.rb', line 38 def train(example, category_index, probability = 1) example.each do |token| @token_counts[category_index][token] += probability @total_token_counts[category_index] += probability end @category_counts[category_index] += probability end |