Module: ClassifierReborn::Tokenizer::Whitespace

Defined in:
lib/classifier-reborn/extensions/tokenizer/whitespace.rb

Overview

This tokenizes given input as white-space separated terms. It mainly aims to tokenize sentences written with a space between words, like English, French, and others.

Class Method Summary collapse

Class Method Details

.call(str) ⇒ Object



16
17
18
19
20
21
22
23
24
25
# File 'lib/classifier-reborn/extensions/tokenizer/whitespace.rb', line 16

def call(str)
  tokens = str.gsub(/[^\p{WORD}\s]/, '').downcase.split.collect do |word|
    Token.new(word, stemmable: true, maybe_stopword: true)
  end
  symbol_tokens = str.scan(/[^\s\p{WORD}]/).collect do |word|
    Token.new(word, stemmable: false, maybe_stopword: false)
  end
  tokens += symbol_tokens
  tokens
end