Class: Linnaeus::Classifier
- Defined in:
- lib/linnaeus/classifier.rb
Overview
Classify documents against the Bayesian corpus.
lc = Linnaeus::Classifier.new(<options hash>)
lc.classify 'a string of text' #a wild category appears
lc.classification_scores 'a different string of text' #a hash of categories and scores
Constructor Options
- persistence_class
-
A class implementing persistence - the default (Linnaeus::Persistence) uses redis.
- stopwords_class
-
A class that emits a set of stopwords. The default is Linnaeus::Stopwords
- skip_stemming
-
Set to true to skip porter stemming.
- encoding
-
Force text to use this character set. UTF-8 by default.
- redis_connection
-
An instantiated Redis connection, allowing you to reuse an existing Redis connection.
- redis_host
-
Passed to persistence class constructor. Defaults to “127.0.0.1”
- redis_port
-
Passed to persistence class constructor. Defaults to “6379”.
- redis_db
-
Passed to persistence class constructor. Defaults to “0”.
- redis_*
-
Please see Linnaeus::Persistence for the rest of the options that’re passed through directly to the Redis client connection.
Instance Method Summary collapse
-
#classification_scores(text) ⇒ Object
Returns a hash of scores for each category in the Bayesian corpus.
-
#classify(text) ⇒ Object
The most likely category for a document.
Methods inherited from Linnaeus
#count_word_occurrences, #initialize
Constructor Details
This class inherits a constructor from Linnaeus
Instance Method Details
#classification_scores(text) ⇒ Object
Returns a hash of scores for each category in the Bayesian corpus. The closer a score is to 0, the more likely a match it is.
Parameters
- text
-
a string of text to classify.
Returns
a hash of categories with a score as the values.
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
# File 'lib/linnaeus/classifier.rb', line 37 def classification_scores(text) scores = {} @db.get_categories.each do |category| words_with_count_for_category = @db.get_words_with_count_for_category category total_word_count_sum_for_category = words_with_count_for_category.values.reduce(0){|sum, count| sum += count.to_i} scores[category] = 0 count_word_occurrences(text).each do |word, count| tmp_score = (words_with_count_for_category[word].nil?) ? 0.1 : words_with_count_for_category[word].to_i scores[category] += Math.log(tmp_score / total_word_count_sum_for_category.to_f) end end scores end |
#classify(text) ⇒ Object
The most likely category for a document.
Parameters
- text
-
a string of text to classify.
Returns
A string representing the most likely category.
61 62 63 64 65 66 67 68 |
# File 'lib/linnaeus/classifier.rb', line 61 def classify(text) scores = classification_scores(text) if scores.any? (scores.sort_by { |a| -a[1] })[0][0] else '' end end |