Class: Treat::Workers::Processors::Tokenizers::Punkt
- Inherits:
-
Object
- Object
- Treat::Workers::Processors::Tokenizers::Punkt
- Defined in:
- lib/treat/workers/processors/tokenizers/punkt.rb
Overview
Tokenization script from the ‘punkt-segmenter’ Ruby gem.
Authors: Willy ([email protected]>), Steven Bird ([email protected]), Edward Loper ([email protected]), Joel Nothman ([email protected]). License: Apache License v2.
Constant Summary collapse
- SentEndChars =
['.', '?', '!']
- ReSentEndChars =
/[.?!]/
- InternalPunctuation =
[',', ':', ';']
- ReBoundaryRealignment =
/^["\')\]}]+?(?:\s+|(?=--)|$)/m
- ReWordStart =
/[^\(\"\`{\[:;&\#\*@\)}\]\-,]/
- ReNonWordChars =
/(?:[?!)\";}\]\*:@\'\({\[])/
- ReMultiCharPunct =
/(?:\-{2,}|\.{2,}|(?:\.\s){2,}\.)/
- ReWordTokenizer =
/#{ReMultiCharPunct}|(?=#{ReWordStart})\S+?(?=\s|$|#{ReNonWordChars}|#{ReMultiCharPunct}|,(?=$|\s|#{ReNonWordChars}|#{ReMultiCharPunct}))|\S/
- RePeriodContext =
/\S*#{ReSentEndChars}(?=(?<after_tok>#{ReNonWordChars}|\s+(?<next_tok>\S+)))/
Class Method Summary collapse
-
.tokenize(entity, options = {}) ⇒ Object
Perform tokenization of the entity and add the resulting tokens as its children.
Class Method Details
.tokenize(entity, options = {}) ⇒ Object
Perform tokenization of the entity and add the resulting tokens as its children.
Options: none.
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# File 'lib/treat/workers/processors/tokenizers/punkt.rb', line 23 def self.tokenize(entity, = {}) entity.check_hasnt_children s = entity.to_s s.scan(ReWordTokenizer).each do |token| if SentEndChars.include?(token[-1]) entity << Treat::Entities::Token.from_string(token[0..-2]) entity << Treat::Entities::Token.from_string(token[-1..-1]) else entity << Treat::Entities::Token.from_string(token) end end end |