Class: Boilerpipe::UnicodeTokenizer

Inherits:

Object

Object
Boilerpipe::UnicodeTokenizer

Defined in:: lib/boilerpipe/util/unicode_tokenizer.rb

Constant Summary collapse

INVISIBLE_SEPARATOR =

"\u2063"

WORD_BOUNDARY =

Regexp.new('\b')

NOT_WORD_BOUNDARY =

Regexp.new("[\u2063]*([\\\"'\\.,\\!\\@\\-\\:\\;\\$\\?\\(\\)\/])[\u2063]*")

Class Method Summary collapse

.tokenize(text) ⇒ Object

replace word boundaries with ‘invisible separator’ strip invisible separators from non-word boundaries replace spaces or invisible separators with a single space trim split words on single space.

Class Method Details

.tokenize(text) ⇒ `Object`

replace word boundaries with ‘invisible separator’ strip invisible separators from non-word boundaries replace spaces or invisible separators with a single space trim split words on single space

# File 'lib/boilerpipe/util/unicode_tokenizer.rb', line 13

def self.tokenize(text)
  text.gsub(WORD_BOUNDARY, INVISIBLE_SEPARATOR)
    .gsub(NOT_WORD_BOUNDARY, '\1')
    .gsub(/[ \u2063]+/, ' ')
    .strip
    .split(/[ ]+/)
end

Class: Boilerpipe::UnicodeTokenizer

Constant Summary collapse

Class Method Summary collapse

Class Method Details

.tokenize(text) ⇒ Object

.tokenize(text) ⇒ `Object`