Module: ClassifierReborn::Tokenizer::Whitespace
- Defined in:
- lib/classifier-reborn/extensions/tokenizer/whitespace.rb
Overview
This tokenizes given input as white-space separated terms. It mainly aims to tokenize sentences written with a space between words, like English, French, and others.
Class Method Summary collapse
Class Method Details
.call(str) ⇒ Object
16 17 18 19 20 21 22 23 24 25 |
# File 'lib/classifier-reborn/extensions/tokenizer/whitespace.rb', line 16 def call(str) tokens = str.gsub(/[^\p{WORD}\s]/, '').downcase.split.collect do |word| Token.new(word, stemmable: true, maybe_stopword: true) end symbol_tokens = str.scan(/[^\s\p{WORD}]/).collect do |word| Token.new(word, stemmable: false, maybe_stopword: false) end tokens += symbol_tokens tokens end |