Module: ClassifierReborn::Hasher

Defined in:
lib/classifier-reborn/extensions/hasher.rb

Class Method Summary collapse

Class Method Details

.word_hash(str, enable_stemmer = true, tokenizer: Tokenizer::Whitespace, token_filters: [TokenFilter::Stopword]) ⇒ Object

Return a Hash of strings => ints. Each word in the string is stemmed, interned, and indexes to its frequency in the document.



19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# File 'lib/classifier-reborn/extensions/hasher.rb', line 19

def word_hash(str, enable_stemmer = true,
              tokenizer: Tokenizer::Whitespace,
              token_filters: [TokenFilter::Stopword])
  if token_filters.include?(TokenFilter::Stemmer)
    unless enable_stemmer
      token_filters.reject! do |token_filter|
        token_filter == TokenFilter::Stemmer
      end
    end
  else
    token_filters << TokenFilter::Stemmer if enable_stemmer
  end
  words = tokenizer.call(str)
  token_filters.each do |token_filter|
    words = token_filter.call(words)
  end
  d = Hash.new(0)
  words.each do |word|
    d[word.intern] += 1
  end
  d
end