Class: SRX::English::WordSplitter
- Inherits:
-
Object
- Object
- SRX::English::WordSplitter
- Includes:
- Enumerable
- Defined in:
- lib/srx/english/word_splitter.rb
Constant Summary collapse
- SPLIT_RULES =
{ :word => "\\p{Alpha}\\p{Word}*", :number => "\\p{Digit}+(?:[:., _/-]\\p{Digit}+)*", :punct => "\\p{Punct}", :graph => "\\p{Graph}", :other => "[^\\p{Word}\\p{Graph}]+" }
- SPLIT_RE =
/#{SPLIT_RULES.values.map{|v| "(#{v})"}.join("|")}/m
Instance Attribute Summary collapse
-
#sentence ⇒ Object
Returns the value of attribute sentence.
Instance Method Summary collapse
-
#each ⇒ Object
This method iterates over the words in the sentence.
-
#initialize(sentence = nil) ⇒ WordSplitter
constructor
The initializer accepts a
sentence
, which might be a Sentence instance or a String instance.
Constructor Details
#initialize(sentence = nil) ⇒ WordSplitter
The initializer accepts a sentence
, which might be a Sentence instance or a String instance.
The splitter might be initialized without the sentence, but should be set using the accessor before first call to each
method.
25 26 27 |
# File 'lib/srx/english/word_splitter.rb', line 25 def initialize(sentence=nil) @sentence = sentence end |
Instance Attribute Details
#sentence ⇒ Object
Returns the value of attribute sentence.
8 9 10 |
# File 'lib/srx/english/word_splitter.rb', line 8 def sentence @sentence end |
Instance Method Details
#each ⇒ Object
This method iterates over the words in the sentence. It yields the string representation of the word and its type, which is one of:
-
:word
- a regular word (including words containing numbers, like A4) -
:number
- a number (including number with spaces, dashes, slashes, etc.) -
:punct
- single punctuation character (comma, semicolon, full stop, etc.) -
:graph
- any single graphical (visible) character -
:other
- anything which is not covered by the above types (non-visible characters in particular)
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
# File 'lib/srx/english/word_splitter.rb', line 38 def each raise "Invalid argument - sentence is nil" if @sentence.nil? @sentence.scan(SPLIT_RE) do |word,number,punct,graph,other| start_offset = $~.begin(0) end_offset = $~.end(0)-1 if !word.nil? yield word, :word, start_offset, end_offset elsif !number.nil? yield number, :number, start_offset, end_offset elsif !punct.nil? yield punct, :punct, start_offset, end_offset elsif !graph.nil? yield graph, :graph, start_offset, end_offset else yield other, :other, start_offset, end_offset end end end |