Module: ClassifierReborn::TokenFilter::Stopword

Defined in:
lib/classifier-reborn/extensions/token_filter/stopword.rb

Overview

This filter removes stopwords in the language, from given tokens.

Constant Summary collapse

STOPWORDS_PATH =
[File.expand_path(File.dirname(__FILE__) + '/../../../../data/stopwords')]
STOPWORDS =

Create a lazily-loaded hash of stopword data

Hash.new do |hash, language|
  hash[language] = []

  STOPWORDS_PATH.each do |path|
    if File.exist?(File.join(path, language))
      hash[language] = Set.new File.read(File.join(path, language.to_s)).force_encoding('utf-8').split
      break
    end
  end

  hash[language]
end

Class Method Summary collapse

Class Method Details

.add_custom_stopword_path(path) ⇒ Object

Add custom path to a new stopword file created by user



24
25
26
# File 'lib/classifier-reborn/extensions/token_filter/stopword.rb', line 24

def add_custom_stopword_path(path)
  STOPWORDS_PATH.unshift(path)
end

.call(tokens) ⇒ Object



16
17
18
19
20
21
# File 'lib/classifier-reborn/extensions/token_filter/stopword.rb', line 16

def call(tokens)
  tokens.reject do |token|
    token.maybe_stopword? &&
      (token.length <= 2 || STOPWORDS[@language].include?(token))
  end
end

.language=(language) ⇒ Object

Changes the language of stopwords



43
44
45
# File 'lib/classifier-reborn/extensions/token_filter/stopword.rb', line 43

def language=(language)
  @language = language
end