Module: RMMSeg::Algorithm
- Included in:
- ComplexAlgorithm, SimpleAlgorithm
- Defined in:
- lib/rmmseg/algorithm.rb
Overview
An algorithm can segment a piece of text into an array of words. This module is the common operations shared by SimpleAlgorithm and ComplexAlgorithm .
Instance Method Summary collapse
-
#basic_latin?(char) ⇒ Boolean
Determine whether a character is a basic latin character.
-
#find_match_words(chars, index) ⇒ Object
Find all words occuring in the dictionary starting from
index
. -
#get_basic_latin_word ⇒ Object
Skip whitespaces and punctuation to extract a basic latin word.
-
#get_cjk_word(chunks) ⇒ Object
Use rules to filter the
chunks
to get the most apropos CJK word. -
#initialize(text) ⇒ Object
Initialize a new instance of Algorithm, the
text
will then be segmented by this instance. -
#next_token ⇒ Object
Get the next Token recognized.
-
#nonword_char?(char) ⇒ Boolean
Determine whether a character can be part of a basic latin word.
-
#segment ⇒ Object
Segment the string in
text
into an array of words.
Instance Method Details
#basic_latin?(char) ⇒ Boolean
Determine whether a character is a basic latin character. – TODO: Implement this method in a more correct way. currently I use number of bytes in this char to determine this. If it is a one-byte char, I consider it a basic latin. ++
147 148 149 |
# File 'lib/rmmseg/algorithm.rb', line 147 def basic_latin?(char) char.size == 1 end |
#find_match_words(chars, index) ⇒ Object
Find all words occuring in the dictionary starting from index
. The maximum word length is determined by Config.max_word_length
.
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
# File 'lib/rmmseg/algorithm.rb', line 117 def find_match_words(chars, index) dic = Dictionary.instance str = String.new words = Array.new i = index loop do break if i >= chars.length || basic_latin?(chars[i]) str << chars[i] if dic.has_word?(str) word = dic.get_word(str) words << word end i += 1 break if Word.new(str).length >= Config.max_word_length end if words.empty? words << Word.new(chars[index], Word::TYPES[:unrecognized]) end words end |
#get_basic_latin_word ⇒ Object
Skip whitespaces and punctuation to extract a basic latin word.
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
# File 'lib/rmmseg/algorithm.rb', line 57 def get_basic_latin_word word = String.new start_pos = nil end_pos = nil i = @index while i < @chars.length && basic_latin?(@chars[i]) && nonword_char?(@chars[i]) i += 1 end start_pos = @byte_index + i - @index while i < @chars.length && basic_latin?(@chars[i]) break if nonword_char?(@chars[i]) word << @chars[i] i += 1 end end_pos = @byte_index + i - @index while i < @chars.length && basic_latin?(@chars[i]) && nonword_char?(@chars[i]) i += 1 end @byte_index += i - @index @index = i return Token.new(word, start_pos, end_pos) end |
#get_cjk_word(chunks) ⇒ Object
Use rules to filter the chunks
to get the most apropos CJK word.
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/rmmseg/algorithm.rb', line 91 def get_cjk_word(chunks) i = 0 while i < @rules.length break if chunks.length < 2 chunks = @rules[i].filter(chunks) i += 1 end if chunks.length > 1 if Config.on_ambiguity == :raise_exception raise Ambiguity, "Can't solve ambiguity on #{chunks}" end end word = chunks[0].words[0] token = Token.new(word.text, @byte_index, @byte_index+word.byte_size) @index += word.length @byte_index += word.byte_size return token end |
#initialize(text) ⇒ Object
Initialize a new instance of Algorithm, the text
will then be segmented by this instance.
14 15 16 17 18 |
# File 'lib/rmmseg/algorithm.rb', line 14 def initialize(text) @chars = text.each_char @index = 0 @byte_index = 0 end |
#next_token ⇒ Object
Get the next Token recognized.
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# File 'lib/rmmseg/algorithm.rb', line 21 def next_token return nil if @index >= @chars.length current = @chars[@index] orig_index = @index token = nil len = 0 if basic_latin?(current) token = get_basic_latin_word else token = get_cjk_word(create_chunks) end if token.text.empty? return next_token else return token end end |
#nonword_char?(char) ⇒ Boolean
Determine whether a character can be part of a basic latin word.
153 154 155 |
# File 'lib/rmmseg/algorithm.rb', line 153 def nonword_char?(char) /^\W$/ =~ char end |
#segment ⇒ Object
Segment the string in text
into an array of words.
44 45 46 47 48 49 50 51 52 53 |
# File 'lib/rmmseg/algorithm.rb', line 44 def segment words = Array.new loop do token = next_token break if token.nil? words << token.text end words end |