Class: Treat::Workers::Processors::Tokenizers::Stanford
- Inherits:
-
Object
- Object
- Treat::Workers::Processors::Tokenizers::Stanford
- Defined in:
- lib/treat/workers/processors/tokenizers/stanford.rb
Overview
Tokenization provided by Stanford Penn-Treebank style tokenizer. Most punctuation is split from adjoining words, verb contractions and the Anglo-Saxon genitive of nouns are split into their component morphemes, and each morpheme is tagged separately. N.B. Contrary to the standard PTB tokenization, double quotes (“) are NOT changed to doubled single forward- and backward- quotes (“ and ”) by default.
Constant Summary collapse
- DefaultOptions =
Default options for the tokenizer.
{ directional_quotes: false, escape_characters: false }
- @@tokenizer =
Hold one instance of the tokenizer.
nil
Class Method Summary collapse
-
.add_tokens(entity, tokens, options) ⇒ Object
Add the tokens to the entity.
-
.tokenize(entity, options = {}) ⇒ Object
Perform tokenization of the entity and add the resulting tokens as its children.
Class Method Details
.add_tokens(entity, tokens, options) ⇒ Object
Add the tokens to the entity.
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
# File 'lib/treat/workers/processors/tokenizers/stanford.rb', line 37 def self.add_tokens(entity, tokens, ) tokens.each do |token| val = token.value unless [:escape_characters] Treat..ptb.escape_characters. each do |key, value| val.gsub!(value, key) end end unless [:directional_quotes] val.gsub!(/``/,'"') val.gsub!(/''/,'"') end entity << Treat::Entities::Token. from_string(val) end end |
.tokenize(entity, options = {}) ⇒ Object
Perform tokenization of the entity and add the resulting tokens as its children.
Options:
-
(Boolean) :directional_quotes => Whether
to attempt to get correct directional quotes, replacing “…” by “…”. Off by default.
26 27 28 29 30 31 32 33 34 |
# File 'lib/treat/workers/processors/tokenizers/stanford.rb', line 26 def self.tokenize(entity, = {}) Treat::Loaders::Stanford.load = DefaultOptions.merge() @@tokenizer ||= StanfordCoreNLP.load(:tokenize) entity.check_hasnt_children text = ::StanfordCoreNLP::Annotation.new(entity.to_s) @@tokenizer.annotate(text) add_tokens(entity, text.get(:tokens), ) end |