Class: Boilerpipe::UnicodeTokenizer
- Inherits:
-
Object
- Object
- Boilerpipe::UnicodeTokenizer
- Defined in:
- lib/boilerpipe/util/unicode_tokenizer.rb
Constant Summary collapse
- INVISIBLE_SEPARATOR =
"\u2063"
- WORD_BOUNDARY =
Regexp.new('\b')
- NOT_WORD_BOUNDARY =
Regexp.new("[\u2063]*([\\\"'\\.,\\!\\@\\-\\:\\;\\$\\?\\(\\)\/])[\u2063]*")
Class Method Summary collapse
-
.tokenize(text) ⇒ Object
replace word boundaries with ‘invisible separator’ strip invisible separators from non-word boundaries replace spaces or invisible separators with a single space trim split words on single space.
Class Method Details
.tokenize(text) ⇒ Object
replace word boundaries with ‘invisible separator’ strip invisible separators from non-word boundaries replace spaces or invisible separators with a single space trim split words on single space
13 14 15 16 17 18 19 |
# File 'lib/boilerpipe/util/unicode_tokenizer.rb', line 13 def self.tokenize(text) text.gsub(WORD_BOUNDARY, INVISIBLE_SEPARATOR) .gsub(NOT_WORD_BOUNDARY, '\1') .gsub(/[ \u2063]+/, ' ') .strip .split(/[ ]+/) end |