Module: Unidecoder

Extended by:
Unidecoder
Included in:
Unidecoder
Defined in:
lib/unidecoder.rb,
lib/unidecoder/version.rb

Overview

Utilities for transliterating UTF-8 strings to ASCII.

Defined Under Namespace

Modules: StringExtensions, Version

Constant Summary collapse

CODEPOINTS =

Contains Unicode codepoints, loading as needed from YAML files.

Hash.new { |h, k|
  h[k] = YAML::load_file(File.expand_path("../unidecoder/data/#{k}.yml", __FILE__))
}

Instance Method Summary collapse

Instance Method Details

#code_group(unpacked_character) ⇒ Object

Returns the Unicode codepoint grouping for the given character



86
87
88
# File 'lib/unidecoder.rb', line 86

def code_group(unpacked_character)
  "x%02x" % (unpacked_character >> 8)
end

#decode(string, overrides = nil) ⇒ String

Transliterates UTF-8 characters to ASCII.

Examples:


Unidecoder.decode("你好")                        #=> "Ni Hao"
Unidecoder.decode("Jürgen Müller", "ü" => "ue")  #=> "Juergen Mueller"
Unidecoder.decode("feliz año", "ñ" => "ni") #=>  #=> "feliz anio"

Parameters:

  • string (#to_s)

    The string or string-like object to transliterate.

  • overrides (Hash) (defaults to: nil)

    A Hash of UTF-8 to ASCII characters to use in place of the defaults. This can be used for language-specific transliterations.

Returns:

  • (String)

    The transliterated string.



35
36
37
38
39
40
41
42
43
44
# File 'lib/unidecoder.rb', line 35

def decode(string, overrides = nil)
  validate_utf8!(string)
  normalize(string.to_s).gsub(/[^\x00-\x7f]/u) do |char|
    begin
      decode_overridden(char, overrides) or decode_char(char)
    rescue
      "?"
    end
  end
end

#decode_char(char) ⇒ Object



76
77
78
79
# File 'lib/unidecoder.rb', line 76

def decode_char(char)
  unpacked = char.unpack("U")[0]
  CODEPOINTS[code_group(unpacked)][grouped_point(unpacked)]
end

#decode_overridden(char, overrides) ⇒ Object



81
82
83
# File 'lib/unidecoder.rb', line 81

def decode_overridden(char, overrides)
  overrides[char] if overrides
end

#define_normalize(library = nil, &block) ⇒ Object



63
64
65
66
67
68
69
70
# File 'lib/unidecoder.rb', line 63

def define_normalize(library = nil, &block)
  return if method_defined? :normalize
  begin
    require library if library
    define_method(:normalize, &block)
  rescue LoadError
  end
end

#encode(codepoint) ⇒ Object

Returns a UTF-8 character for the given UTF-8 codepoint



51
52
53
# File 'lib/unidecoder.rb', line 51

def encode(codepoint)
  [codepoint.to_i(16)].pack("U")
end

#grouped_point(unpacked_character) ⇒ Object

Returns the index of the given character in the YAML file for its codepoint group



91
92
93
# File 'lib/unidecoder.rb', line 91

def grouped_point(unpacked_character)
  unpacked_character & 255
end

#in_yaml_file(character) ⇒ Object

Returns string indicating which file (and line) contains the transliteration value for the character. This is useful only for development.



58
59
60
61
# File 'lib/unidecoder.rb', line 58

def in_yaml_file(character)
  unpacked = character.unpack("U")[0]
  "#{code_group(unpacked)}.yml (line #{grouped_point(unpacked) + 2})"
end

#validate_utf8!(string) ⇒ Object



46
47
48
# File 'lib/unidecoder.rb', line 46

def validate_utf8!(string)
  string.unpack("U*")
end