Class: Treat::Workers::Processors::Tokenizers::PTB
- Inherits:
-
Object
- Object
- Treat::Workers::Processors::Tokenizers::PTB
- Defined in:
- lib/treat/workers/processors/tokenizers/ptb.rb
Overview
Tokenization based on the tokenizer developped by Robert Macyntyre in 1995 for the Penn Treebank project. This tokenizer mostly follows the conventions used by the Penn Treebank. N.B. Contrary to the standard PTB tokenization, double quotes (“) are NOT changed to doubled single forward- and backward- quotes (“ and ”) by default.
Authors: Utiyama Masao ([email protected]). License: Ruby License.
Constant Summary collapse
- DefaultOptions =
Default options for the tokenizer.
{ directional_quotes: false }
Class Method Summary collapse
- .split(string, options) ⇒ Object
-
.tokenize(entity, options = {}) ⇒ Object
Perform tokenization of the entity and add the resulting tokens as its children.
Class Method Details
.split(string, options) ⇒ Object
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
# File 'lib/treat/workers/processors/tokenizers/ptb.rb', line 41 def self.split(string, ) s = " " + string + " " s.gsub!(/‘/,"'") s.gsub!(/’/,"'") s.gsub!(/“/,"``") s.gsub!(/”/,"''") s.gsub!(/\s+/," ") s.gsub!(/(\s+)''/,'\1"') s.gsub!(/(\s+)``/,'\1"') s.gsub!(/''(\s+)/,'"\1') s.gsub!(/``(\s+)/,'"\1') s.gsub!(/ (['`]+)([^0-9].+) /,' \1 \2 ') s.gsub!(/([ (\[{<])"/,'\1 `` ') s.gsub!(/\.\.\./,' ... ') s.gsub!(/[,;:@\#$%&]/,' \& ') s.gsub!(/([^.])([.])([\])}>"']*)[ ]*$/,'\1 \2\3 ') s.gsub!(/[?!]/,' \& ') s.gsub!(/[\]\[(){}<>]/,' \& ') s.gsub!(/--/,' -- ') s.sub!(/$/,' ') s.sub!(/^/,' ') s.gsub!(/"/,' \'\' ') s.gsub!(/([^'])' /,'\1 \' ') s.gsub!(/'([sSmMdD]) /,' \'\1 ') s.gsub!(/'ll /,' \'ll ') s.gsub!(/'re /,' \'re ') s.gsub!(/'ve /,' \'ve ') s.gsub!(/n't /,' n\'t ') s.gsub!(/'LL /,' \'LL ') s.gsub!(/'RE /,' \'RE ') s.gsub!(/'VE /,' \'VE ') s.gsub!(/N'T /,' N\'T ') s.gsub!(/ ([Cc])annot /,' \1an not ') s.gsub!(/ ([Dd])'ye /,' \1\' ye ') s.gsub!(/ ([Gg])imme /,' \1im me ') s.gsub!(/ ([Gg])onna /,' \1on na ') s.gsub!(/ ([Gg])otta /,' \1ot ta ') s.gsub!(/ ([Ll])emme /,' \1em me ') s.gsub!(/ ([Mm])ore'n /,' \1ore \'n ') s.gsub!(/ '([Tt])is /,' \'\1 is ') s.gsub!(/ '([Tt])was /,' \'\1 was ') s.gsub!(/ ([Ww])anna /,' \1an na ') while s.sub!(/(\s)([0-9]+) , ([0-9]+)(\s)/, '\1\2,\3\4'); end s.gsub!(/\//, ' / ') s.gsub!(/\s+/,' ') s.strip! # Remove directional quotes. unless [:directional_quotes] s.gsub!(/``/,'"') s.gsub!(/''/,'"') end s.split(/\s+/) end |
.tokenize(entity, options = {}) ⇒ Object
Perform tokenization of the entity and add the resulting tokens as its children.
Options:
-
(Boolean) => :directional_quotes whether to
replace double quotes by “ and ” or not.
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
# File 'lib/treat/workers/processors/tokenizers/ptb.rb', line 25 def self.tokenize(entity, = {}) = DefaultOptions.merge() entity.check_hasnt_children if entity.has_children? raise Treat::Exception, "Cannot tokenize an #{entity.class} " + "that already has children." end chunks = split(entity.to_s, ) chunks.each do |chunk| next if chunk =~ /([[:space:]]+)/ entity << Treat::Entities::Token. from_string(chunk) end end |