Class: Treat::Workers::Processors::Tokenizers::Punkt

Inherits:
Object
  • Object
show all
Defined in:
lib/treat/workers/processors/tokenizers/punkt.rb

Overview

Tokenization script from the ‘punkt-segmenter’ Ruby gem.

Authors: Willy ([email protected]>), Steven Bird ([email protected]), Edward Loper ([email protected]), Joel Nothman ([email protected]). License: Apache License v2.

Constant Summary collapse

SentEndChars =
['.', '?', '!']
ReSentEndChars =
/[.?!]/
InternalPunctuation =
[',', ':', ';']
ReBoundaryRealignment =
/^["\')\]}]+?(?:\s+|(?=--)|$)/m
ReWordStart =
/[^\(\"\`{\[:;&\#\*@\)}\]\-,]/
ReNonWordChars =
/(?:[?!)\";}\]\*:@\'\({\[])/
ReMultiCharPunct =
/(?:\-{2,}|\.{2,}|(?:\.\s){2,}\.)/
ReWordTokenizer =
/#{ReMultiCharPunct}|(?=#{ReWordStart})\S+?(?=\s|$|#{ReNonWordChars}|#{ReMultiCharPunct}|,(?=$|\s|#{ReNonWordChars}|#{ReMultiCharPunct}))|\S/
RePeriodContext =
/\S*#{ReSentEndChars}(?=(?<after_tok>#{ReNonWordChars}|\s+(?<next_tok>\S+)))/

Class Method Summary collapse

Class Method Details

.tokenize(entity, options = {}) ⇒ Object

Perform tokenization of the entity and add the resulting tokens as its children.

Options: none.



23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# File 'lib/treat/workers/processors/tokenizers/punkt.rb', line 23

def self.tokenize(entity, options = {})
  
  entity.check_hasnt_children
  
  s = entity.to_s
  
  s.scan(ReWordTokenizer).each do |token|
    if SentEndChars.include?(token[-1])
      entity << Treat::Entities::Token.from_string(token[0..-2])
      entity << Treat::Entities::Token.from_string(token[-1..-1])
    else
      entity << Treat::Entities::Token.from_string(token)
    end
  end
  
end