Class: Groupie

Inherits:
Object
  • Object
show all
Defined in:
lib/groupie.rb,
lib/groupie/group.rb,
lib/groupie/version.rb,
lib/groupie/tokenizer.rb

Overview

This extends Groupie and adds a version number

Defined Under Namespace

Classes: Error, Group, Tokenizer

Constant Summary collapse

VERSION =
'0.6.0'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(smart_weight: false) ⇒ Groupie

Returns a new instance of Groupie.

Parameters:

  • smart_weight (true, false) (defaults to: false)

    (false) Whether smart weight is enabled or not.



16
17
18
19
20
# File 'lib/groupie.rb', line 16

def initialize(smart_weight: false)
  @groups = {}
  @smart_weight = smart_weight
  @known_words = Set.new
end

Instance Attribute Details

#smart_weightObject

Returns the value of attribute smart_weight.



13
14
15
# File 'lib/groupie.rb', line 13

def smart_weight
  @smart_weight
end

Class Method Details

.tokenize(object) ⇒ Array<String>

Turn a String (or anything else that responds to #to_s) into an Array of String tokens. This attempts to remove most common punctuation marks and types of whitespace.

Parameters:

  • object (String, #to_s)

Returns:

  • (Array<String>)


27
28
29
# File 'lib/groupie.rb', line 27

def self.tokenize(object)
  Tokenizer.new(object).to_tokens
end

.versionObject



7
8
9
# File 'lib/groupie/version.rb', line 7

def self.version
  VERSION
end

Instance Method Details

#[](group) ⇒ Groupie::Group

Access an existing Group or create a new one.

Parameters:

  • group (Object)

    The name of the group to access.

Returns:



35
36
37
# File 'lib/groupie.rb', line 35

def [](group)
  @groups[group] ||= Group.new(group, self)
end

#add_word(word) ⇒ Object

Private method used by Groups to register known words with the Group.



119
120
121
# File 'lib/groupie.rb', line 119

def add_word(word)
  @known_words << word
end

#classify(entry, strategy = :sum) ⇒ Hash<Object, Float>

Classify a single word against all groups, returning the probability distribution.

Parameters:

  • entry (String)

    A word to be classified

  • strategy (Symbol) (defaults to: :sum)

    (:sum) the strategy to use on the score

Returns:

  • (Hash<Object, Float>)

    Hash with <group, probability> pairings. Probabilities are always in 0.0..1.0, and add up to 1.0 (i.e. it’s a probability distribution)

Raises:



61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
# File 'lib/groupie.rb', line 61

def classify(entry, strategy = :sum)
  # Calculate default weight once outside of the loop
  default_weight = self.default_weight
  # Each group calculates the count, then reduces it to a score: <group name, score>
  per_group_score = @groups.transform_values do |group|
    apply_count_strategy(default_weight + group.count(entry), strategy)
  end
  # When we have no scores, we have no results, so abort early
  # Note that when smart_weight is enabled we always have a score.
  total_score = per_group_score.values.sum
  return {} if total_score.zero?

  # Final results must be within 0.0..1.0, so divide each score by the total score
  per_group_score.transform_values { |group_score| group_score.to_f / total_score }
end

#classify_text(words, strategy = :sum) ⇒ Hash<Object, Float>

Classify a text by taking the average of all word classifications.

Parameters:

  • words (Array<String>)

    List of words to be classified

  • strategy (Symbol) (defaults to: :sum)

Returns:

  • (Hash<Object, Float>)

    Hash with <group, score> pairings. Scores are always in 0.0..1.0

Raises:



45
46
47
48
49
50
51
52
# File 'lib/groupie.rb', line 45

def classify_text(words, strategy = :sum)
  words &= unique_words if strategy == :unique
  group_score_sums, hits = calculate_group_scores(words, strategy)

  group_score_sums.each.with_object({}) do |(group, sum), averages|
    averages[group] = sum / hits
  end
end

#default_weightFloat

Default weight is used when smart_weight is enabled. Each word’s count is increased by the default_weight value, which is the average frequency of each unique word we know about.

Example: if we have indexed 1000 total words, of which 500 were unique,

the default_weight would be 1000/500=2.0

Returns:

  • (Float)

    The default weight for all words



105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/groupie.rb', line 105

def default_weight
  # Default weight only applies when smart weight is enabled
  return 0.0 unless smart_weight

  # If we don't know any words, the weight is also zero
  return 0.0 unless @known_words.any?

  # Gather counts and calculate
  total_words = @groups.each_value.sum(&:total_word_count)
  total_unique_words = @known_words.count
  total_words / total_unique_words.to_f
end

#unique_wordsHash<String, Integer>

Return a word score dictionary that excludes the 4th quartile most popular words. Why do this? So the most common (and thus meaningless) words are ignored and less common words gain more predictive power.

This is used by the :unique strategy of the classifier.

Returns:

  • (Hash<String, Integer>)


84
85
86
87
88
89
90
91
92
93
94
95
# File 'lib/groupie.rb', line 84

def unique_words
  # Iterate over all Groups and merge their <word, count> dictionaries into one
  total_count = @groups.inject({}) do |total, (_name, group)|
    total.merge!(group.word_counts) { |_key, o, n| o + n }
  end
  # Extract the word count that's at the top 75%
  top_quartile_index = [((total_count.size * 3) / 4) - 1, 1].max
  top_quartile_frequency = total_count.values.sort[top_quartile_index]
  # Throw out all words which have a count that's above this frequency
  total_count.reject! { |_word, count| count > top_quartile_frequency }
  total_count.keys
end