Module: RMMSeg::Algorithm
- Included in:
- ComplexAlgorithm, SimpleAlgorithm
- Defined in:
- lib/rmmseg/algorithm.rb
Overview
An algorithm can segment a piece of text into an array of words. This module is the common operations shared by SimpleAlgorithm and ComplexAlgorithm .
Constant Summary collapse
- NONWORD_CHAR_RE =
Determine whether a character can be part of a basic latin word.
/^\W$/
Instance Method Summary collapse
-
#basic_latin?(char) ⇒ Boolean
Determine whether a character is a basic latin character.
-
#find_match_words(index) ⇒ Object
Find all words occuring in the dictionary starting from
index
. -
#get_basic_latin_word ⇒ Object
Skip whitespaces and punctuation to extract a basic latin word.
-
#initialize(text, token = Token) ⇒ Object
Initialize a new instance of Algorithm, the
text
will then be segmented by this instance. -
#next_token ⇒ Object
Get the next Token recognized.
- #nonword_char?(char) ⇒ Boolean
-
#segment ⇒ Object
Segment the string in
text
into an array of words.
Instance Method Details
#basic_latin?(char) ⇒ Boolean
Determine whether a character is a basic latin character.
127 128 129 |
# File 'lib/rmmseg/algorithm.rb', line 127 def basic_latin?(char) char.length == 1 end |
#find_match_words(index) ⇒ Object
Find all words occuring in the dictionary starting from index
. The maximum word length is determined by Config.max_word_length
.
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
# File 'lib/rmmseg/algorithm.rb', line 89 def find_match_words(index) for i, w in @match_cache if i == index return w end end dic = Dictionary.instance str = String.new strlen = 0 words = Array.new i = index while i < @chars.length && !basic_latin?(@chars[i]) && strlen < Config.max_word_length str << @chars[i] strlen += 1 if dic.has_word?(str) words << dic.get_word(str) end i += 1 end if words.empty? words << Word.new(@chars[index], Word::TYPES[:unrecognized]) end @match_cache[@match_cache_idx] = [index, words] @match_cache_idx += 1 @match_cache_idx = 0 if @match_cache_idx == MATCH_CACHE_MAX_LENGTH words end |
#get_basic_latin_word ⇒ Object
Skip whitespaces and punctuation to extract a basic latin word.
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/rmmseg/algorithm.rb', line 56 def get_basic_latin_word start_pos = nil end_pos = nil i = @index while i < @chars.length && basic_latin?(@chars[i]) && nonword_char?(@chars[i]) i += 1 end start_pos = @byte_index + i - @index while i < @chars.length && basic_latin?(@chars[i]) break if nonword_char?(@chars[i]) i += 1 end end_pos = @byte_index + i - @index while i < @chars.length && basic_latin?(@chars[i]) && nonword_char?(@chars[i]) i += 1 end @byte_index += i - @index @index = i return @token.new(@text[start_pos...end_pos], start_pos, end_pos) end |
#initialize(text, token = Token) ⇒ Object
Initialize a new instance of Algorithm, the text
will then be segmented by this instance. token
is the class which will be used to construct the result token.
15 16 17 18 19 20 21 |
# File 'lib/rmmseg/algorithm.rb', line 15 def initialize(text, token=Token) @text = text @chars = text.each_char @index = 0 @byte_index = 0 @token = token end |
#next_token ⇒ Object
Get the next Token recognized.
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# File 'lib/rmmseg/algorithm.rb', line 24 def next_token return nil if @index >= @chars.length if basic_latin?(@chars[@index]) token = get_basic_latin_word else token = get_cjk_word end if token.start == token.end # empty return next_token else return token end end |
#nonword_char?(char) ⇒ Boolean
134 135 136 |
# File 'lib/rmmseg/algorithm.rb', line 134 def nonword_char?(char) NONWORD_CHAR_RE =~ char end |
#segment ⇒ Object
Segment the string in text
into an array of words.
42 43 44 45 46 47 48 49 50 51 52 |
# File 'lib/rmmseg/algorithm.rb', line 42 def segment words = Array.new token = next_token until token.nil? words << token.text token = next_token end words end |