Module: RMMSeg::Algorithm

Included in:: ComplexAlgorithm, SimpleAlgorithm

Defined in:: lib/rmmseg/algorithm.rb

Overview

An algorithm can segment a piece of text into an array of words. This module is the common operations shared by SimpleAlgorithm and ComplexAlgorithm .

Instance Method Summary collapse

#basic_latin?(char) ⇒ Boolean

Determine whether a character is a basic latin character.
#find_match_words(chars, index) ⇒ Object

Find all words occuring in the dictionary starting from index .
#get_basic_latin_word ⇒ Object

Skip whitespaces and punctuation to extract a basic latin word.
#get_cjk_word(chunks) ⇒ Object

Use rules to filter the chunks to get the most apropos CJK word.
#initialize(text) ⇒ Object

Initialize a new instance of Algorithm, the text will then be segmented by this instance.
#next_token ⇒ Object

Get the next Token recognized.
#nonword_char?(char) ⇒ Boolean

Determine whether a character can be part of a basic latin word.
#segment ⇒ Object

Segment the string in text into an array of words.

Instance Method Details

#basic_latin?(char) ⇒ `Boolean`

Determine whether a character is a basic latin character. – TODO: Implement this method in a more correct way. currently I use number of bytes in this char to determine this. If it is a one-byte char, I consider it a basic latin. ++

Returns:

(Boolean)



147
148
149

# File 'lib/rmmseg/algorithm.rb', line 147

def basic_latin?(char)
  char.size == 1
end

#find_match_words(chars, index) ⇒ `Object`

Find all words occuring in the dictionary starting from index . The maximum word length is determined by Config.max_word_length .

# File 'lib/rmmseg/algorithm.rb', line 117

def find_match_words(chars, index)
  dic = Dictionary.instance
  str = String.new
  words = Array.new
  i = index
  
  loop do
    break if i >= chars.length || basic_latin?(chars[i])
    str << chars[i]
    if dic.has_word?(str)
      word = dic.get_word(str)
      words << word
    end
    i += 1
    break if Word.new(str).length >= Config.max_word_length
  end

  if words.empty?
    words << Word.new(chars[index], Word::TYPES[:unrecognized])
  end
  
  words
end

#get_basic_latin_word ⇒ `Object`

Skip whitespaces and punctuation to extract a basic latin word.

# File 'lib/rmmseg/algorithm.rb', line 57

def get_basic_latin_word
  word = String.new
  start_pos = nil
  end_pos = nil
  
  i = @index
  while i < @chars.length     &&
      basic_latin?(@chars[i]) &&
      nonword_char?(@chars[i])
    i += 1
  end

  start_pos = @byte_index + i - @index
  while i < @chars.length && basic_latin?(@chars[i])
    break if nonword_char?(@chars[i])
    word << @chars[i]
    i += 1
  end

  end_pos = @byte_index + i - @index
  while i < @chars.length      &&
      basic_latin?(@chars[i])  &&
      nonword_char?(@chars[i])
    i += 1
  end

  @byte_index += i - @index
  @index = i
  
  return Token.new(word, start_pos, end_pos)
end

#get_cjk_word(chunks) ⇒ `Object`

Use rules to filter the chunks to get the most apropos CJK word.

# File 'lib/rmmseg/algorithm.rb', line 91

def get_cjk_word(chunks)
  i = 0
  while i < @rules.length
    break if chunks.length < 2
    chunks = @rules[i].filter(chunks)
    i += 1
  end

  if chunks.length > 1
    if Config.on_ambiguity == :raise_exception
      raise Ambiguity, "Can't solve ambiguity on #{chunks}"
    end
  end

  word = chunks[0].words[0]
  token = Token.new(word.text, @byte_index, @byte_index+word.byte_size)
  
  @index += word.length
  @byte_index += word.byte_size

  return token
end

#initialize(text) ⇒ `Object`

Initialize a new instance of Algorithm, the text will then be segmented by this instance.

# File 'lib/rmmseg/algorithm.rb', line 14

def initialize(text)
  @chars = text.each_char
  @index = 0
  @byte_index = 0
end

#next_token ⇒ `Object`

Get the next Token recognized.

# File 'lib/rmmseg/algorithm.rb', line 21

def next_token
  return nil if @index >= @chars.length

  current = @chars[@index]
  orig_index = @index
  token = nil
  len = 0

  if basic_latin?(current)
    token = get_basic_latin_word
  else
    token = get_cjk_word(create_chunks)
  end

  if token.text.empty?
    return next_token
  else
    return token
  end
end

#nonword_char?(char) ⇒ `Boolean`

Determine whether a character can be part of a basic latin word.

Returns:

(Boolean)



153
154
155

# File 'lib/rmmseg/algorithm.rb', line 153

def nonword_char?(char)
  /^\W$/ =~ char
end

#segment ⇒ `Object`

Segment the string in text into an array of words.

# File 'lib/rmmseg/algorithm.rb', line 44

def segment
  words = Array.new
  loop do
    token = next_token
    break if token.nil?
    words << token.text
  end

  words
end

Module: RMMSeg::Algorithm

Overview

Instance Method Summary collapse

Instance Method Details

#basic_latin?(char) ⇒ Boolean

#find_match_words(chars, index) ⇒ Object

#get_basic_latin_word ⇒ Object

#get_cjk_word(chunks) ⇒ Object

#initialize(text) ⇒ Object

#next_token ⇒ Object

#nonword_char?(char) ⇒ Boolean

#segment ⇒ Object