Class: Multibyte::Handlers::UTF8Handler
- Inherits:
-
Object
- Object
- Multibyte::Handlers::UTF8Handler
- Defined in:
- lib/multibyte/handlers/utf8_handler.rb
Overview
UTF8Handler implements Unicode aware operations for strings, these operations will be used by the Chars proxy when $KCODE is set to ‘UTF8’.
Direct Known Subclasses
Constant Summary collapse
- HANGUL_SBASE =
Hangul character boundaries and properties
0xAC00
- HANGUL_LBASE =
0x1100
- HANGUL_VBASE =
0x1161
- HANGUL_TBASE =
0x11A7
- HANGUL_LCOUNT =
19
- HANGUL_VCOUNT =
21
- HANGUL_TCOUNT =
28
- HANGUL_NCOUNT =
HANGUL_VCOUNT * HANGUL_TCOUNT
- HANGUL_SCOUNT =
11172
- HANGUL_SLAST =
HANGUL_SBASE + HANGUL_SCOUNT
- HANGUL_JAMO_FIRST =
0x1100
- HANGUL_JAMO_LAST =
0x11FF
- UNICODE_WHITESPACE =
All the unicode whitespace
[ (0x0009..0x000D).to_a, # White_Space # Cc [5] <control-0009>..<control-000D> 0x0020, # White_Space # Zs SPACE 0x0085, # White_Space # Cc <control-0085> 0x00A0, # White_Space # Zs NO-BREAK SPACE 0x1680, # White_Space # Zs OGHAM SPACE MARK 0x180E, # White_Space # Zs MONGOLIAN VOWEL SEPARATOR (0x2000..0x200A).to_a, # White_Space # Zs [11] EN QUAD..HAIR SPACE 0x2028, # White_Space # Zl LINE SEPARATOR 0x2029, # White_Space # Zp PARAGRAPH SEPARATOR 0x202F, # White_Space # Zs NARROW NO-BREAK SPACE 0x205F, # White_Space # Zs MEDIUM MATHEMATICAL SPACE 0x3000, # White_Space # Zs IDEOGRAPHIC SPACE ].flatten.freeze
- UNICODE_LEADERS_AND_TRAILERS =
BOM (byte order mark) can also be seen as whitespace, it’s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored.
UNICODE_WHITESPACE + [65279]
- UTF8_PAT =
Borrowed from the Kconv library by Shinji KONO - (also as seen on the W3C site)
/\A(?: [\x00-\x7f] | [\xc2-\xdf] [\x80-\xbf] | \xe0 [\xa0-\xbf] [\x80-\xbf] | [\xe1-\xef] [\x80-\xbf] [\x80-\xbf] | \xf0 [\x90-\xbf] [\x80-\xbf] [\x80-\xbf] | [\xf1-\xf3] [\x80-\xbf] [\x80-\xbf] [\x80-\xbf] | \xf4 [\x80-\x8f] [\x80-\xbf] [\x80-\xbf] )*\z/xn
- UNICODE_TRAILERS_PAT =
/(#{codepoints_to_pattern(UNICODE_LEADERS_AND_TRAILERS)})+\Z/
- UNICODE_LEADERS_PAT =
/\A(#{codepoints_to_pattern(UNICODE_LEADERS_AND_TRAILERS)})+/
Class Method Summary collapse
-
.[]=(str, *args) ⇒ Object
Works just like the indexed replace method on string, except instead of byte offsets you specify character offsets.
-
.capitalize(str) ⇒ Object
Returns a copy of
str
with the first character converted to uppercase and the remainder to lowercase. -
.center(str, integer, padstr = ' ') ⇒ Object
Works just like String#center, only integer specifies characters instead of bytes.
-
.codepoints_to_pattern(array_of_codepoints) ⇒ Object
Returns a regular expression pattern that matches the passed Unicode codepoints.
-
.compose(str) ⇒ Object
Perform composition on the characters in the string.
-
.consumes?(str) ⇒ Boolean
Checks if the string is valid UTF8.
-
.decompose(str) ⇒ Object
Perform decomposition on the characters in the string.
-
.downcase(str) ⇒ Object
Convert characters in the string to lowercase.
-
.g_length(str) ⇒ Object
Returns the number of grapheme clusters in the string.
-
.index(str, *args) ⇒ Object
Returns the position of the passed argument in the string, counting in codepoints.
-
.insert(str, offset, fragment) ⇒ Object
Inserts the passed string at specified codepoint offsets.
-
.ljust(str, integer, padstr = ' ') ⇒ Object
Works just like String#ljust, only integer specifies characters instead of bytes.
-
.lstrip(str) ⇒ Object
Does Unicode-aware lstrip.
-
.normalize(str, form = Multibyte::DEFAULT_NORMALIZATION_FORM) ⇒ Object
Returns the KC normalization of the string by default.
-
.reverse(str) ⇒ Object
Reverses codepoints in the string.
-
.rjust(str, integer, padstr = ' ') ⇒ Object
Works just like String#rjust, only integer specifies characters instead of bytes.
-
.rstrip(str) ⇒ Object
Does Unicode-aware rstrip.
-
.size(str) ⇒ Object
(also: length)
Returns the number of codepoints in the string.
-
.slice(str, *args) ⇒ Object
(also: [])
Implements Unicode-aware slice with codepoints.
-
.strip(str) ⇒ Object
Removed leading and trailing whitespace.
-
.tidy_bytes(str) ⇒ Object
Replaces all the non-utf-8 bytes by their iso-8859-1 or cp1252 equivalent resulting in a valid utf-8 string.
-
.translate_offset(str, byte_offset) ⇒ Object
Used to translate an offset from bytes to characters, for instance one received from a regular expression match.
-
.upcase(str) ⇒ Object
Convert characters in the string to uppercase.
Class Method Details
.[]=(str, *args) ⇒ Object
Works just like the indexed replace method on string, except instead of byte offsets you specify character offsets.
Example:
s = "Müller"
s.chars[2] = "e" # Replace character with offset 2
s # => "Müeler"
s = "Müller"
s.chars[1, 2] = "ö" # Replace 2 characters at character offset 1
s # => "Möler"
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 155 def []=(str, *args) replace_by = args.pop # Indexed replace with regular expressions already works return str[*args] = replace_by if args.first.is_a?(Regexp) result = u_unpack(str) if args[0].is_a?(Fixnum) raise IndexError, "index #{args[0]} out of string" if args[0] >= result.length min = args[0] max = args[1].nil? ? min : (min + args[1] - 1) range = Range.new(min, max) replace_by = [replace_by].pack('U') if replace_by.is_a?(Fixnum) elsif args.first.is_a?(Range) raise RangeError, "#{args[0]} out of range" if args[0].min >= result.length range = args[0] else needle = args[0].to_s min = index(str, needle) max = min + length(needle) - 1 range = Range.new(min, max) end result[range] = u_unpack(replace_by) str.replace(result.pack('U*')) end |
.capitalize(str) ⇒ Object
Returns a copy of str
with the first character converted to uppercase and the remainder to lowercase
273 274 275 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 273 def capitalize(str) upcase(slice(str, 0..0)) + downcase(slice(str, 1..-1) || '') end |
.center(str, integer, padstr = ' ') ⇒ Object
Works just like String#center, only integer specifies characters instead of bytes.
Example:
"¾ cup".chars.center(8).to_s
# => " ¾ cup "
"¾ cup".chars.center(8, " ").to_s # Use non-breaking whitespace
# => " ¾ cup "
214 215 216 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 214 def center(str, integer, padstr=' ') justify(str, integer, :center, padstr) end |
.codepoints_to_pattern(array_of_codepoints) ⇒ Object
Returns a regular expression pattern that matches the passed Unicode codepoints
115 116 117 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 115 def self.codepoints_to_pattern(array_of_codepoints) #:nodoc: array_of_codepoints.collect{ |e| [e].pack 'U*' }.join('|') end |
.compose(str) ⇒ Object
Perform composition on the characters in the string
311 312 313 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 311 def compose(str) compose_codepoints u_unpack(str).pack('U*') end |
.consumes?(str) ⇒ Boolean
Checks if the string is valid UTF8.
341 342 343 344 345 346 347 348 349 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 341 def consumes?(str) # Unpack is a little bit faster than regular expressions begin str.unpack('U*') true rescue ArgumentError false end end |
.decompose(str) ⇒ Object
Perform decomposition on the characters in the string
306 307 308 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 306 def decompose(str) decompose_codepoints(:canonical, u_unpack(str)).pack('U*') end |
.downcase(str) ⇒ Object
Convert characters in the string to lowercase
270 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 270 def downcase(str); to_case :lowercase_mapping, str; end |
.g_length(str) ⇒ Object
Returns the number of grapheme clusters in the string. This method is very likely to be moved or renamed in future versions.
353 354 355 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 353 def g_length(str) g_unpack(str).length end |
.index(str, *args) ⇒ Object
Returns the position of the passed argument in the string, counting in codepoints
138 139 140 141 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 138 def index(str, *args) bidx = str.index(*args) bidx ? (u_unpack(str.slice(0...bidx)).size) : nil end |
.insert(str, offset, fragment) ⇒ Object
Inserts the passed string at specified codepoint offsets
128 129 130 131 132 133 134 135 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 128 def insert(str, offset, fragment) str.replace( u_unpack(str).insert( offset, u_unpack(fragment) ).flatten.pack('U*') ) end |
.ljust(str, integer, padstr = ' ') ⇒ Object
Works just like String#ljust, only integer specifies characters instead of bytes.
Example:
"¾ cup".chars.rjust(8).to_s
# => "¾ cup "
"¾ cup".chars.rjust(8, " ").to_s # Use non-breaking whitespace
# => "¾ cup "
201 202 203 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 201 def ljust(str, integer, padstr=' ') justify(str, integer, :left, padstr) end |
.lstrip(str) ⇒ Object
Does Unicode-aware lstrip
224 225 226 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 224 def lstrip(str) str.gsub(UNICODE_LEADERS_PAT, '') end |
.normalize(str, form = Multibyte::DEFAULT_NORMALIZATION_FORM) ⇒ Object
Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.
-
str
- The string to perform normalization on. -
form
- The form you want to normalize in. Should be one of the following::c
,:kc
,:d
, or:kd
. Default is Multibyte::DEFAULT_NORMALIZATION_FORM.
288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 288 def normalize(str, form=Multibyte::DEFAULT_NORMALIZATION_FORM) # See http://www.unicode.org/reports/tr15, Table 1 codepoints = u_unpack(str) case form when :d reorder_characters(decompose_codepoints(:canonical, codepoints)) when :c compose_codepoints reorder_characters(decompose_codepoints(:canonical, codepoints)) when :kd reorder_characters(decompose_codepoints(:compatability, codepoints)) when :kc compose_codepoints reorder_characters(decompose_codepoints(:compatability, codepoints)) else raise ArgumentError, "#{form} is not a valid normalization variant", caller end.pack('U*') end |
.reverse(str) ⇒ Object
Reverses codepoints in the string.
240 241 242 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 240 def reverse(str) u_unpack(str).reverse.pack('U*') end |
.rjust(str, integer, padstr = ' ') ⇒ Object
Works just like String#rjust, only integer specifies characters instead of bytes.
Example:
"¾ cup".chars.rjust(8).to_s
# => " ¾ cup"
"¾ cup".chars.rjust(8, " ").to_s # Use non-breaking whitespace
# => " ¾ cup"
188 189 190 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 188 def rjust(str, integer, padstr=' ') justify(str, integer, :right, padstr) end |
.rstrip(str) ⇒ Object
Does Unicode-aware rstrip
219 220 221 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 219 def rstrip(str) str.gsub(UNICODE_TRAILERS_PAT, '') end |
.size(str) ⇒ Object Also known as: length
Returns the number of codepoints in the string
234 235 236 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 234 def size(str) u_unpack(str).size end |
.slice(str, *args) ⇒ Object Also known as: []
Implements Unicode-aware slice with codepoints. Slicing on one point returns the codepoints for that character.
246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 246 def slice(str, *args) if args.size > 2 raise ArgumentError, "wrong number of arguments (#{args.size} for 1)" # Do as if we were native elsif (args.size == 2 && !(args.first.is_a?(Numeric) || args.first.is_a?(Regexp))) raise TypeError, "cannot convert #{args.first.class} into Integer" # Do as if we were native elsif (args.size == 2 && !args[1].is_a?(Numeric)) raise TypeError, "cannot convert #{args[1].class} into Integer" # Do as if we were native elsif args[0].kind_of? Range cps = u_unpack(str).slice(*args) cps.nil? ? nil : cps.pack('U*') elsif args[0].kind_of? Regexp str.slice(*args) elsif args.size == 1 && args[0].kind_of?(Numeric) u_unpack(str)[args[0]] else u_unpack(str).slice(*args).pack('U*') end end |
.strip(str) ⇒ Object
Removed leading and trailing whitespace
229 230 231 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 229 def strip(str) str.gsub(UNICODE_LEADERS_PAT, '').gsub(UNICODE_TRAILERS_PAT, '') end |
.tidy_bytes(str) ⇒ Object
Replaces all the non-utf-8 bytes by their iso-8859-1 or cp1252 equivalent resulting in a valid utf-8 string
358 359 360 361 362 363 364 365 366 367 368 369 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 358 def tidy_bytes(str) str.split(//u).map do |c| if !UTF8_PAT.match(c) n = c.unpack('C')[0] n < 128 ? n.chr : n < 160 ? [UCD.cp1252[n] || n].pack('U') : n < 192 ? "\xC2" + n.chr : "\xC3" + (n-64).chr else c end end.join end |
.translate_offset(str, byte_offset) ⇒ Object
Used to translate an offset from bytes to characters, for instance one received from a regular expression match
320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 320 def translate_offset(str, byte_offset) return nil if byte_offset.nil? return 0 if str == '' chunk = str[0..byte_offset] begin begin chunk.unpack('U*').length - 1 rescue ArgumentError => e chunk = str[0..(byte_offset+=1)] # Stop retrying at the end of the string raise e unless byte_offset < chunk.length # We damaged a character, retry retry end # Catch the ArgumentError so we can throw our own rescue ArgumentError raise EncodingError.new('malformed UTF-8 character') end end |
.upcase(str) ⇒ Object
Convert characters in the string to uppercase
267 |
# File 'lib/multibyte/handlers/utf8_handler.rb', line 267 def upcase(str); to_case :uppercase_mapping, str; end |