Class: Groupie
- Inherits:
-
Object
- Object
- Groupie
- Defined in:
- lib/groupie.rb,
lib/groupie/group.rb,
lib/groupie/version.rb,
lib/groupie/tokenizer.rb
Overview
This extends Groupie and adds a version number
Defined Under Namespace
Classes: Error, Group, Tokenizer
Constant Summary collapse
- VERSION =
'0.6.0'
Instance Attribute Summary collapse
-
#smart_weight ⇒ Object
Returns the value of attribute smart_weight.
Class Method Summary collapse
-
.tokenize(object) ⇒ Array<String>
Turn a String (or anything else that responds to #to_s) into an Array of String tokens.
- .version ⇒ Object
Instance Method Summary collapse
-
#[](group) ⇒ Groupie::Group
Access an existing Group or create a new one.
-
#add_word(word) ⇒ Object
Private method used by Groups to register known words with the Group.
-
#classify(entry, strategy = :sum) ⇒ Hash<Object, Float>
Classify a single word against all groups, returning the probability distribution.
-
#classify_text(words, strategy = :sum) ⇒ Hash<Object, Float>
Classify a text by taking the average of all word classifications.
-
#default_weight ⇒ Float
Default weight is used when
smart_weight
is enabled. -
#initialize(smart_weight: false) ⇒ Groupie
constructor
A new instance of Groupie.
-
#unique_words ⇒ Hash<String, Integer>
Return a word score dictionary that excludes the 4th quartile most popular words.
Constructor Details
#initialize(smart_weight: false) ⇒ Groupie
Returns a new instance of Groupie.
16 17 18 19 20 |
# File 'lib/groupie.rb', line 16 def initialize(smart_weight: false) @groups = {} @smart_weight = smart_weight @known_words = Set.new end |
Instance Attribute Details
#smart_weight ⇒ Object
Returns the value of attribute smart_weight.
13 14 15 |
# File 'lib/groupie.rb', line 13 def smart_weight @smart_weight end |
Class Method Details
.tokenize(object) ⇒ Array<String>
Turn a String (or anything else that responds to #to_s) into an Array of String tokens. This attempts to remove most common punctuation marks and types of whitespace.
27 28 29 |
# File 'lib/groupie.rb', line 27 def self.tokenize(object) Tokenizer.new(object).to_tokens end |
.version ⇒ Object
7 8 9 |
# File 'lib/groupie/version.rb', line 7 def self.version VERSION end |
Instance Method Details
#[](group) ⇒ Groupie::Group
Access an existing Group or create a new one.
35 36 37 |
# File 'lib/groupie.rb', line 35 def [](group) @groups[group] ||= Group.new(group, self) end |
#add_word(word) ⇒ Object
Private method used by Groups to register known words with the Group.
119 120 121 |
# File 'lib/groupie.rb', line 119 def add_word(word) @known_words << word end |
#classify(entry, strategy = :sum) ⇒ Hash<Object, Float>
Classify a single word against all groups, returning the probability distribution.
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
# File 'lib/groupie.rb', line 61 def classify(entry, strategy = :sum) # Calculate default weight once outside of the loop default_weight = self.default_weight # Each group calculates the count, then reduces it to a score: <group name, score> per_group_score = @groups.transform_values do |group| apply_count_strategy(default_weight + group.count(entry), strategy) end # When we have no scores, we have no results, so abort early # Note that when smart_weight is enabled we always have a score. total_score = per_group_score.values.sum return {} if total_score.zero? # Final results must be within 0.0..1.0, so divide each score by the total score per_group_score.transform_values { |group_score| group_score.to_f / total_score } end |
#classify_text(words, strategy = :sum) ⇒ Hash<Object, Float>
Classify a text by taking the average of all word classifications.
45 46 47 48 49 50 51 52 |
# File 'lib/groupie.rb', line 45 def classify_text(words, strategy = :sum) words &= unique_words if strategy == :unique group_score_sums, hits = calculate_group_scores(words, strategy) group_score_sums.each.with_object({}) do |(group, sum), averages| averages[group] = sum / hits end end |
#default_weight ⇒ Float
Default weight is used when smart_weight
is enabled. Each word’s count is increased by the default_weight
value, which is the average frequency of each unique word we know about.
Example: if we have indexed 1000 total words, of which 500 were unique,
the default_weight would be 1000/500=2.0
105 106 107 108 109 110 111 112 113 114 115 116 |
# File 'lib/groupie.rb', line 105 def default_weight # Default weight only applies when smart weight is enabled return 0.0 unless smart_weight # If we don't know any words, the weight is also zero return 0.0 unless @known_words.any? # Gather counts and calculate total_words = @groups.each_value.sum(&:total_word_count) total_unique_words = @known_words.count total_words / total_unique_words.to_f end |
#unique_words ⇒ Hash<String, Integer>
Return a word score dictionary that excludes the 4th quartile most popular words. Why do this? So the most common (and thus meaningless) words are ignored and less common words gain more predictive power.
This is used by the :unique strategy of the classifier.
84 85 86 87 88 89 90 91 92 93 94 95 |
# File 'lib/groupie.rb', line 84 def unique_words # Iterate over all Groups and merge their <word, count> dictionaries into one total_count = @groups.inject({}) do |total, (_name, group)| total.merge!(group.word_counts) { |_key, o, n| o + n } end # Extract the word count that's at the top 75% top_quartile_index = [((total_count.size * 3) / 4) - 1, 1].max top_quartile_frequency = total_count.values.sort[top_quartile_index] # Throw out all words which have a count that's above this frequency total_count.reject! { |_word, count| count > top_quartile_frequency } total_count.keys end |