Class: Multibyte::Handlers::UTF8Handler

Inherits:
Object
  • Object
show all
Defined in:
lib/multibyte/handlers/utf8_handler.rb

Overview

UTF8Handler implements Unicode aware operations for strings, these operations will be used by the Chars proxy when $KCODE is set to ‘UTF8’.

Direct Known Subclasses

UTF8HandlerProc

Constant Summary collapse

HANGUL_SBASE =

Hangul character boundaries and properties

0xAC00
HANGUL_LBASE =
0x1100
HANGUL_VBASE =
0x1161
HANGUL_TBASE =
0x11A7
HANGUL_LCOUNT =
19
HANGUL_VCOUNT =
21
HANGUL_TCOUNT =
28
HANGUL_NCOUNT =
HANGUL_VCOUNT * HANGUL_TCOUNT
HANGUL_SCOUNT =
11172
HANGUL_SLAST =
HANGUL_SBASE + HANGUL_SCOUNT
HANGUL_JAMO_FIRST =
0x1100
HANGUL_JAMO_LAST =
0x11FF
UNICODE_WHITESPACE =

All the unicode whitespace

[
  (0x0009..0x000D).to_a,  # White_Space # Cc   [5] <control-0009>..<control-000D>
  0x0020,          # White_Space # Zs       SPACE
  0x0085,          # White_Space # Cc       <control-0085>
  0x00A0,          # White_Space # Zs       NO-BREAK SPACE
  0x1680,          # White_Space # Zs       OGHAM SPACE MARK
  0x180E,          # White_Space # Zs       MONGOLIAN VOWEL SEPARATOR
  (0x2000..0x200A).to_a, # White_Space # Zs  [11] EN QUAD..HAIR SPACE
  0x2028,          # White_Space # Zl       LINE SEPARATOR
  0x2029,          # White_Space # Zp       PARAGRAPH SEPARATOR
  0x202F,          # White_Space # Zs       NARROW NO-BREAK SPACE
  0x205F,          # White_Space # Zs       MEDIUM MATHEMATICAL SPACE
  0x3000,          # White_Space # Zs       IDEOGRAPHIC SPACE
].flatten.freeze
UNICODE_LEADERS_AND_TRAILERS =

BOM (byte order mark) can also be seen as whitespace, it’s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored.

UNICODE_WHITESPACE + [65279]
UTF8_PAT =

Borrowed from the Kconv library by Shinji KONO - (also as seen on the W3C site)

/\A(?:
 [\x00-\x7f]                                     |
 [\xc2-\xdf] [\x80-\xbf]                         |
 \xe0        [\xa0-\xbf] [\x80-\xbf]             |
 [\xe1-\xef] [\x80-\xbf] [\x80-\xbf]             |
 \xf0        [\x90-\xbf] [\x80-\xbf] [\x80-\xbf] |
 [\xf1-\xf3] [\x80-\xbf] [\x80-\xbf] [\x80-\xbf] |
 \xf4        [\x80-\x8f] [\x80-\xbf] [\x80-\xbf]
)*\z/xn
UNICODE_TRAILERS_PAT =
/(#{codepoints_to_pattern(UNICODE_LEADERS_AND_TRAILERS)})+\Z/
UNICODE_LEADERS_PAT =
/\A(#{codepoints_to_pattern(UNICODE_LEADERS_AND_TRAILERS)})+/

Class Method Summary collapse

Class Method Details

.[]=(str, *args) ⇒ Object

Works just like the indexed replace method on string, except instead of byte offsets you specify character offsets.

Example:

s = "Müller"
s.chars[2] = "e" # Replace character with offset 2
s # => "Müeler"

s = "Müller"
s.chars[1, 2] = "ö" # Replace 2 characters at character offset 1
s # => "Möler"


155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# File 'lib/multibyte/handlers/utf8_handler.rb', line 155

def []=(str, *args)
  replace_by = args.pop
  # Indexed replace with regular expressions already works
  return str[*args] = replace_by if args.first.is_a?(Regexp)
  result = u_unpack(str)
  if args[0].is_a?(Fixnum)
    raise IndexError, "index #{args[0]} out of string" if args[0] >= result.length
    min = args[0]
    max = args[1].nil? ? min : (min + args[1] - 1)
    range = Range.new(min, max)
    replace_by = [replace_by].pack('U') if replace_by.is_a?(Fixnum)
  elsif args.first.is_a?(Range)
    raise RangeError, "#{args[0]} out of range" if args[0].min >= result.length
    range = args[0]
  else
    needle = args[0].to_s
    min = index(str, needle)
    max = min + length(needle) - 1
    range = Range.new(min, max)
  end
  result[range] = u_unpack(replace_by)
  str.replace(result.pack('U*'))
end

.capitalize(str) ⇒ Object

Returns a copy of str with the first character converted to uppercase and the remainder to lowercase



273
274
275
# File 'lib/multibyte/handlers/utf8_handler.rb', line 273

def capitalize(str)
  upcase(slice(str, 0..0)) + downcase(slice(str, 1..-1) || '')
end

.center(str, integer, padstr = ' ') ⇒ Object

Works just like String#center, only integer specifies characters instead of bytes.

Example:

"¾ cup".chars.center(8).to_s
# => " ¾ cup  "

"¾ cup".chars.center(8, " ").to_s # Use non-breaking whitespace
# => " ¾ cup  "


214
215
216
# File 'lib/multibyte/handlers/utf8_handler.rb', line 214

def center(str, integer, padstr=' ')
  justify(str, integer, :center, padstr)
end

.codepoints_to_pattern(array_of_codepoints) ⇒ Object

Returns a regular expression pattern that matches the passed Unicode codepoints



115
116
117
# File 'lib/multibyte/handlers/utf8_handler.rb', line 115

def self.codepoints_to_pattern(array_of_codepoints) #:nodoc:
  array_of_codepoints.collect{ |e| [e].pack 'U*' }.join('|') 
end

.compose(str) ⇒ Object

Perform composition on the characters in the string



311
312
313
# File 'lib/multibyte/handlers/utf8_handler.rb', line 311

def compose(str)
  compose_codepoints u_unpack(str).pack('U*')
end

.consumes?(str) ⇒ Boolean

Checks if the string is valid UTF8.

Returns:

  • (Boolean)


341
342
343
344
345
346
347
348
349
# File 'lib/multibyte/handlers/utf8_handler.rb', line 341

def consumes?(str)
  # Unpack is a little bit faster than regular expressions
  begin
    str.unpack('U*')
    true
  rescue ArgumentError
    false
  end
end

.decompose(str) ⇒ Object

Perform decomposition on the characters in the string



306
307
308
# File 'lib/multibyte/handlers/utf8_handler.rb', line 306

def decompose(str)
  decompose_codepoints(:canonical, u_unpack(str)).pack('U*')
end

.downcase(str) ⇒ Object

Convert characters in the string to lowercase



270
# File 'lib/multibyte/handlers/utf8_handler.rb', line 270

def downcase(str); to_case :lowercase_mapping, str; end

.g_length(str) ⇒ Object

Returns the number of grapheme clusters in the string. This method is very likely to be moved or renamed in future versions.



353
354
355
# File 'lib/multibyte/handlers/utf8_handler.rb', line 353

def g_length(str)
  g_unpack(str).length
end

.index(str, *args) ⇒ Object

Returns the position of the passed argument in the string, counting in codepoints



138
139
140
141
# File 'lib/multibyte/handlers/utf8_handler.rb', line 138

def index(str, *args)
  bidx = str.index(*args)
  bidx ? (u_unpack(str.slice(0...bidx)).size) : nil
end

.insert(str, offset, fragment) ⇒ Object

Inserts the passed string at specified codepoint offsets



128
129
130
131
132
133
134
135
# File 'lib/multibyte/handlers/utf8_handler.rb', line 128

def insert(str, offset, fragment)
  str.replace(
    u_unpack(str).insert(
      offset,
      u_unpack(fragment)
    ).flatten.pack('U*')
  )
end

.ljust(str, integer, padstr = ' ') ⇒ Object

Works just like String#ljust, only integer specifies characters instead of bytes.

Example:

"¾ cup".chars.rjust(8).to_s
# => "¾ cup   "

"¾ cup".chars.rjust(8, " ").to_s # Use non-breaking whitespace
# => "¾ cup   "


201
202
203
# File 'lib/multibyte/handlers/utf8_handler.rb', line 201

def ljust(str, integer, padstr=' ')
  justify(str, integer, :left, padstr)
end

.lstrip(str) ⇒ Object

Does Unicode-aware lstrip



224
225
226
# File 'lib/multibyte/handlers/utf8_handler.rb', line 224

def lstrip(str)
  str.gsub(UNICODE_LEADERS_PAT, '')
end

.normalize(str, form = Multibyte::DEFAULT_NORMALIZATION_FORM) ⇒ Object

Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.

  • str - The string to perform normalization on.

  • form - The form you want to normalize in. Should be one of the following: :c, :kc, :d, or :kd. Default is Multibyte::DEFAULT_NORMALIZATION_FORM.



288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
# File 'lib/multibyte/handlers/utf8_handler.rb', line 288

def normalize(str, form=Multibyte::DEFAULT_NORMALIZATION_FORM)
  # See http://www.unicode.org/reports/tr15, Table 1
  codepoints = u_unpack(str)
  case form
    when :d
      reorder_characters(decompose_codepoints(:canonical, codepoints))
    when :c
      compose_codepoints reorder_characters(decompose_codepoints(:canonical, codepoints))
    when :kd
      reorder_characters(decompose_codepoints(:compatability, codepoints))
    when :kc
      compose_codepoints reorder_characters(decompose_codepoints(:compatability, codepoints))
    else
      raise ArgumentError, "#{form} is not a valid normalization variant", caller
  end.pack('U*')
end

.reverse(str) ⇒ Object

Reverses codepoints in the string.



240
241
242
# File 'lib/multibyte/handlers/utf8_handler.rb', line 240

def reverse(str)
  u_unpack(str).reverse.pack('U*')
end

.rjust(str, integer, padstr = ' ') ⇒ Object

Works just like String#rjust, only integer specifies characters instead of bytes.

Example:

"¾ cup".chars.rjust(8).to_s
# => "   ¾ cup"

"¾ cup".chars.rjust(8, " ").to_s # Use non-breaking whitespace
# => "   ¾ cup"


188
189
190
# File 'lib/multibyte/handlers/utf8_handler.rb', line 188

def rjust(str, integer, padstr=' ')
  justify(str, integer, :right, padstr)
end

.rstrip(str) ⇒ Object

Does Unicode-aware rstrip



219
220
221
# File 'lib/multibyte/handlers/utf8_handler.rb', line 219

def rstrip(str)
  str.gsub(UNICODE_TRAILERS_PAT, '')
end

.size(str) ⇒ Object Also known as: length

Returns the number of codepoints in the string



234
235
236
# File 'lib/multibyte/handlers/utf8_handler.rb', line 234

def size(str)
  u_unpack(str).size
end

.slice(str, *args) ⇒ Object Also known as: []

Implements Unicode-aware slice with codepoints. Slicing on one point returns the codepoints for that character.



246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
# File 'lib/multibyte/handlers/utf8_handler.rb', line 246

def slice(str, *args)
  if args.size > 2
    raise ArgumentError, "wrong number of arguments (#{args.size} for 1)" # Do as if we were native
  elsif (args.size == 2 && !(args.first.is_a?(Numeric) || args.first.is_a?(Regexp)))
    raise TypeError, "cannot convert #{args.first.class} into Integer" # Do as if we were native
  elsif (args.size == 2 && !args[1].is_a?(Numeric))
    raise TypeError, "cannot convert #{args[1].class} into Integer" # Do as if we were native
  elsif args[0].kind_of? Range
    cps = u_unpack(str).slice(*args)
    cps.nil? ? nil : cps.pack('U*')
  elsif args[0].kind_of? Regexp
    str.slice(*args)
  elsif args.size == 1 && args[0].kind_of?(Numeric)
    u_unpack(str)[args[0]]
  else
    u_unpack(str).slice(*args).pack('U*')
  end
end

.strip(str) ⇒ Object

Removed leading and trailing whitespace



229
230
231
# File 'lib/multibyte/handlers/utf8_handler.rb', line 229

def strip(str)
  str.gsub(UNICODE_LEADERS_PAT, '').gsub(UNICODE_TRAILERS_PAT, '')
end

.tidy_bytes(str) ⇒ Object

Replaces all the non-utf-8 bytes by their iso-8859-1 or cp1252 equivalent resulting in a valid utf-8 string



358
359
360
361
362
363
364
365
366
367
368
369
# File 'lib/multibyte/handlers/utf8_handler.rb', line 358

def tidy_bytes(str)
  str.split(//u).map do |c|
    if !UTF8_PAT.match(c)
      n = c.unpack('C')[0]
      n < 128 ? n.chr :
      n < 160 ? [UCD.cp1252[n] || n].pack('U') :
      n < 192 ? "\xC2" + n.chr : "\xC3" + (n-64).chr
    else
      c
    end
  end.join
end

.translate_offset(str, byte_offset) ⇒ Object

Used to translate an offset from bytes to characters, for instance one received from a regular expression match



320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
# File 'lib/multibyte/handlers/utf8_handler.rb', line 320

def translate_offset(str, byte_offset)
  return nil if byte_offset.nil?
  return 0 if str == ''
  chunk = str[0..byte_offset]
  begin
    begin
      chunk.unpack('U*').length - 1
    rescue ArgumentError => e
      chunk = str[0..(byte_offset+=1)]
      # Stop retrying at the end of the string
      raise e unless byte_offset < chunk.length 
      # We damaged a character, retry
      retry
    end
  # Catch the ArgumentError so we can throw our own
  rescue ArgumentError 
    raise EncodingError.new('malformed UTF-8 character')
  end
end

.upcase(str) ⇒ Object

Convert characters in the string to uppercase



267
# File 'lib/multibyte/handlers/utf8_handler.rb', line 267

def upcase(str); to_case :uppercase_mapping, str; end