Module: Unidecoder
- Extended by:
- Unidecoder
- Included in:
- Unidecoder
- Defined in:
- lib/unidecoder.rb,
lib/unidecoder/version.rb
Overview
Utilities for transliterating UTF-8 strings to ASCII.
Defined Under Namespace
Modules: StringExtensions, Version
Constant Summary collapse
- CODEPOINTS =
Contains Unicode codepoints, loading as needed from YAML files.
Hash.new { |h, k| h[k] = YAML::load_file(File.("../unidecoder/data/#{k}.yml", __FILE__)) }
Instance Method Summary collapse
-
#code_group(unpacked_character) ⇒ Object
Returns the Unicode codepoint grouping for the given character.
-
#decode(string, overrides = nil) ⇒ String
Transliterates UTF-8 characters to ASCII.
- #decode_char(char) ⇒ Object
- #decode_overridden(char, overrides) ⇒ Object
- #define_normalize(library = nil, &block) ⇒ Object
-
#encode(codepoint) ⇒ Object
Returns a UTF-8 character for the given UTF-8 codepoint.
-
#grouped_point(unpacked_character) ⇒ Object
Returns the index of the given character in the YAML file for its codepoint group.
-
#in_yaml_file(character) ⇒ Object
Returns string indicating which file (and line) contains the transliteration value for the character.
- #validate_utf8!(string) ⇒ Object
Instance Method Details
#code_group(unpacked_character) ⇒ Object
Returns the Unicode codepoint grouping for the given character
86 87 88 |
# File 'lib/unidecoder.rb', line 86 def code_group(unpacked_character) "x%02x" % (unpacked_character >> 8) end |
#decode(string, overrides = nil) ⇒ String
Transliterates UTF-8 characters to ASCII.
35 36 37 38 39 40 41 42 43 44 |
# File 'lib/unidecoder.rb', line 35 def decode(string, overrides = nil) validate_utf8!(string) normalize(string.to_s).gsub(/[^\x00-\x7f]/u) do |char| begin decode_overridden(char, overrides) or decode_char(char) rescue "?" end end end |
#decode_char(char) ⇒ Object
76 77 78 79 |
# File 'lib/unidecoder.rb', line 76 def decode_char(char) unpacked = char.unpack("U")[0] CODEPOINTS[code_group(unpacked)][grouped_point(unpacked)] end |
#decode_overridden(char, overrides) ⇒ Object
81 82 83 |
# File 'lib/unidecoder.rb', line 81 def decode_overridden(char, overrides) overrides[char] if overrides end |
#define_normalize(library = nil, &block) ⇒ Object
63 64 65 66 67 68 69 70 |
# File 'lib/unidecoder.rb', line 63 def define_normalize(library = nil, &block) return if method_defined? :normalize begin require library if library define_method(:normalize, &block) rescue LoadError end end |
#encode(codepoint) ⇒ Object
Returns a UTF-8 character for the given UTF-8 codepoint
51 52 53 |
# File 'lib/unidecoder.rb', line 51 def encode(codepoint) [codepoint.to_i(16)].pack("U") end |
#grouped_point(unpacked_character) ⇒ Object
Returns the index of the given character in the YAML file for its codepoint group
91 92 93 |
# File 'lib/unidecoder.rb', line 91 def grouped_point(unpacked_character) unpacked_character & 255 end |
#in_yaml_file(character) ⇒ Object
Returns string indicating which file (and line) contains the transliteration value for the character. This is useful only for development.
58 59 60 61 |
# File 'lib/unidecoder.rb', line 58 def in_yaml_file(character) unpacked = character.unpack("U")[0] "#{code_group(unpacked)}.yml (line #{grouped_point(unpacked) + 2})" end |
#validate_utf8!(string) ⇒ Object
46 47 48 |
# File 'lib/unidecoder.rb', line 46 def validate_utf8!(string) string.unpack("U*") end |