Module: RMMSeg::Algorithm

Included in:
ComplexAlgorithm, SimpleAlgorithm
Defined in:
lib/rmmseg/algorithm.rb

Overview

An algorithm can segment a piece of text into an array of words. This module is the common operations shared by SimpleAlgorithm and ComplexAlgorithm .

Instance Method Summary collapse

Instance Method Details

#basic_latin?(char) ⇒ Boolean

Determine whether a character is a basic latin character. – TODO: Implement this method in a more correct way. currently I use number of bytes in this char to determine this. If it is a one-byte char, I consider it a basic latin. ++

Returns:

  • (Boolean)


147
148
149
# File 'lib/rmmseg/algorithm.rb', line 147

def basic_latin?(char)
  char.size == 1
end

#find_match_words(chars, index) ⇒ Object

Find all words occuring in the dictionary starting from index . The maximum word length is determined by Config.max_word_length .



117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# File 'lib/rmmseg/algorithm.rb', line 117

def find_match_words(chars, index)
  dic = Dictionary.instance
  str = String.new
  words = Array.new
  i = index
  
  loop do
    break if i >= chars.length || basic_latin?(chars[i])
    str << chars[i]
    if dic.has_word?(str)
      word = dic.get_word(str)
      words << word
    end
    i += 1
    break if Word.new(str).length >= Config.max_word_length
  end

  if words.empty?
    words << Word.new(chars[index], Word::TYPES[:unrecognized])
  end
  
  words
end

#get_basic_latin_wordObject

Skip whitespaces and punctuation to extract a basic latin word.



57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# File 'lib/rmmseg/algorithm.rb', line 57

def get_basic_latin_word
  word = String.new
  start_pos = nil
  end_pos = nil
  
  i = @index
  while i < @chars.length     &&
      basic_latin?(@chars[i]) &&
      nonword_char?(@chars[i])
    i += 1
  end

  start_pos = @byte_index + i - @index
  while i < @chars.length && basic_latin?(@chars[i])
    break if nonword_char?(@chars[i])
    word << @chars[i]
    i += 1
  end

  end_pos = @byte_index + i - @index
  while i < @chars.length      &&
      basic_latin?(@chars[i])  &&
      nonword_char?(@chars[i])
    i += 1
  end

  @byte_index += i - @index
  @index = i
  
  return Token.new(word, start_pos, end_pos)
end

#get_cjk_word(chunks) ⇒ Object

Use rules to filter the chunks to get the most apropos CJK word.



91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/rmmseg/algorithm.rb', line 91

def get_cjk_word(chunks)
  i = 0
  while i < @rules.length
    break if chunks.length < 2
    chunks = @rules[i].filter(chunks)
    i += 1
  end

  if chunks.length > 1
    if Config.on_ambiguity == :raise_exception
      raise Ambiguity, "Can't solve ambiguity on #{chunks}"
    end
  end

  word = chunks[0].words[0]
  token = Token.new(word.text, @byte_index, @byte_index+word.byte_size)
  
  @index += word.length
  @byte_index += word.byte_size

  return token
end

#initialize(text) ⇒ Object

Initialize a new instance of Algorithm, the text will then be segmented by this instance.



14
15
16
17
18
# File 'lib/rmmseg/algorithm.rb', line 14

def initialize(text)
  @chars = text.each_char
  @index = 0
  @byte_index = 0
end

#next_tokenObject

Get the next Token recognized.



21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# File 'lib/rmmseg/algorithm.rb', line 21

def next_token
  return nil if @index >= @chars.length

  current = @chars[@index]
  orig_index = @index
  token = nil
  len = 0

  if basic_latin?(current)
    token = get_basic_latin_word
  else
    token = get_cjk_word(create_chunks)
  end

  if token.text.empty?
    return next_token
  else
    return token
  end
end

#nonword_char?(char) ⇒ Boolean

Determine whether a character can be part of a basic latin word.

Returns:

  • (Boolean)


153
154
155
# File 'lib/rmmseg/algorithm.rb', line 153

def nonword_char?(char)
  /^\W$/ =~ char
end

#segmentObject

Segment the string in text into an array of words.



44
45
46
47
48
49
50
51
52
53
# File 'lib/rmmseg/algorithm.rb', line 44

def segment
  words = Array.new
  loop do
    token = next_token
    break if token.nil?
    words << token.text
  end

  words
end