Class: ActiveSupport::Multibyte::Chars
- Includes:
- Comparable
- Defined in:
- lib/active_support/multibyte/chars.rb
Overview
Chars enables you to work transparently with UTF-8 encoding in the Ruby String class without having extensive knowledge about the encoding. A Chars object accepts a string upon initialization and proxies String methods in an encoding safe manner. All the normal String methods are also implemented on the proxy.
String methods are proxied through the Chars object, and can be accessed through the mb_chars
method. Methods which would normally return a String object now return a Chars object so methods can be chained.
"The Perfect String ".mb_chars.downcase.strip.normalize #=> "the perfect string"
Chars objects are perfectly interchangeable with String objects as long as no explicit class checks are made. If certain methods do explicitly check the class, call to_s
before you pass chars objects to them.
bad.explicit_checking_method "T".mb_chars.downcase.to_s
The default Chars implementation assumes that the encoding of the string is UTF-8, if you want to handle different encodings you can write your own multibyte string handler and configure it through ActiveSupport::Multibyte.proxy_class.
class CharsForUTF32
def size
@wrapped_string.size / 4
end
def self.accepts?(string)
string.length % 4 == 0
end
end
ActiveSupport::Multibyte.proxy_class = CharsForUTF32
Constant Summary collapse
- HANGUL_SBASE =
Hangul character boundaries and properties
0xAC00
- HANGUL_LBASE =
0x1100
- HANGUL_VBASE =
0x1161
- HANGUL_TBASE =
0x11A7
- HANGUL_LCOUNT =
19
- HANGUL_VCOUNT =
21
- HANGUL_TCOUNT =
28
- HANGUL_NCOUNT =
HANGUL_VCOUNT * HANGUL_TCOUNT
- HANGUL_SCOUNT =
11172
- HANGUL_SLAST =
HANGUL_SBASE + HANGUL_SCOUNT
- HANGUL_JAMO_FIRST =
0x1100
- HANGUL_JAMO_LAST =
0x11FF
- UNICODE_WHITESPACE =
All the unicode whitespace
[ (0x0009..0x000D).to_a, # White_Space # Cc [5] <control-0009>..<control-000D> 0x0020, # White_Space # Zs SPACE 0x0085, # White_Space # Cc <control-0085> 0x00A0, # White_Space # Zs NO-BREAK SPACE 0x1680, # White_Space # Zs OGHAM SPACE MARK 0x180E, # White_Space # Zs MONGOLIAN VOWEL SEPARATOR (0x2000..0x200A).to_a, # White_Space # Zs [11] EN QUAD..HAIR SPACE 0x2028, # White_Space # Zl LINE SEPARATOR 0x2029, # White_Space # Zp PARAGRAPH SEPARATOR 0x202F, # White_Space # Zs NARROW NO-BREAK SPACE 0x205F, # White_Space # Zs MEDIUM MATHEMATICAL SPACE 0x3000, # White_Space # Zs IDEOGRAPHIC SPACE ].flatten.freeze
- UNICODE_LEADERS_AND_TRAILERS =
BOM (byte order mark) can also be seen as whitespace, it’s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored.
UNICODE_WHITESPACE + [65279]
- UNICODE_TRAILERS_PAT =
/(#{codepoints_to_pattern(UNICODE_LEADERS_AND_TRAILERS)})+\Z/
- UNICODE_LEADERS_PAT =
/\A(#{codepoints_to_pattern(UNICODE_LEADERS_AND_TRAILERS)})+/
- UTF8_PAT =
Instance Attribute Summary collapse
-
#wrapped_string ⇒ Object
(also: #to_s, #to_str)
readonly
Returns the value of attribute wrapped_string.
Class Method Summary collapse
-
.codepoints_to_pattern(array_of_codepoints) ⇒ Object
Returns a regular expression pattern that matches the passed Unicode codepoints.
-
.compose_codepoints(codepoints) ⇒ Object
Compose decomposed characters to the composed form.
-
.consumes?(string) ⇒ Boolean
Returns
true
when the proxy class can handle the string. -
.decompose_codepoints(type, codepoints) ⇒ Object
Decompose composed characters to the decomposed form.
-
.g_pack(unpacked) ⇒ Object
Reverse operation of g_unpack.
-
.g_unpack(string) ⇒ Object
Unpack the string at grapheme boundaries.
-
.in_char_class?(codepoint, classes) ⇒ Boolean
Detect whether the codepoint is in a certain character class.
-
.padding(padsize, padstr = ' ') ⇒ Object
:nodoc:.
-
.reorder_characters(codepoints) ⇒ Object
Re-order codepoints so the string becomes canonical.
-
.tidy_bytes(string) ⇒ Object
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
-
.u_unpack(string) ⇒ Object
Unpack the string at codepoints boundaries.
-
.wants?(string) ⇒ Boolean
Returns
true
if the Chars class can and should act as a proxy for the string string.
Instance Method Summary collapse
-
#+(other) ⇒ Object
Returns a new Chars object containing the other object concatenated to the string.
-
#<=>(other) ⇒ Object
Returns
-1
,0
or+1
depending on whether the Chars object is to be sorted before, equal or after the object on the right side of the operation. -
#=~(other) ⇒ Object
Like
String#=~
only it returns the character offset (in codepoints) instead of the byte offset. -
#[]=(*args) ⇒ Object
Like
String#[]=
, except instead of byte offsets you specify character offsets. -
#acts_like_string? ⇒ Boolean
Enable more predictable duck-typing on String-like classes.
-
#capitalize ⇒ Object
Converts the first character to uppercase and the remainder to lowercase.
-
#center(integer, padstr = ' ') ⇒ Object
Works just like
String#center
, only integer specifies characters instead of bytes. -
#compose ⇒ Object
Performs composition on all the characters.
-
#decompose ⇒ Object
Performs canonical decomposition on all the characters.
-
#downcase ⇒ Object
Convert characters in the string to lowercase.
-
#g_length ⇒ Object
Returns the number of grapheme clusters in the string.
-
#include?(other) ⇒ Boolean
Returns
true
if contained string contains other. -
#index(needle, offset = 0) ⇒ Object
Returns the position needle in the string, counting in codepoints.
-
#initialize(string) ⇒ Chars
constructor
:nodoc:.
-
#insert(offset, fragment) ⇒ Object
Inserts the passed string at specified codepoint offsets.
-
#ljust(integer, padstr = ' ') ⇒ Object
Works just like
String#ljust
, only integer specifies characters instead of bytes. -
#lstrip ⇒ Object
Strips entire range of Unicode whitespace from the left of the string.
-
#method_missing(method, *args, &block) ⇒ Object
Forward all undefined methods to the wrapped string.
-
#normalize(form = ActiveSupport::Multibyte.default_normalization_form) ⇒ Object
Returns the KC normalization of the string by default.
-
#ord ⇒ Object
Returns the codepoint of the first character in the string.
-
#respond_to?(method, include_private = false) ⇒ Boolean
Returns
true
if obj responds to the given method. -
#reverse ⇒ Object
Reverses all characters in the string.
-
#rindex(needle, offset = nil) ⇒ Object
Returns the position needle in the string, counting in codepoints, searching backward from offset or the end of the string.
-
#rjust(integer, padstr = ' ') ⇒ Object
Works just like
String#rjust
, only integer specifies characters instead of bytes. -
#rstrip ⇒ Object
Strips entire range of Unicode whitespace from the right of the string.
-
#size ⇒ Object
(also: #length)
Returns the number of codepoints in the string.
-
#slice(*args) ⇒ Object
(also: #[])
Implements Unicode-aware slice with codepoints.
-
#slice!(*args) ⇒ Object
Like
String#slice!
, except instead of byte offsets you specify character offsets. -
#split(*args) ⇒ Object
Works just like
String#split
, with the exception that the items in the resulting list are Chars instances instead of String. -
#strip ⇒ Object
Strips entire range of Unicode whitespace from the right and left of the string.
-
#tidy_bytes ⇒ Object
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
-
#upcase ⇒ Object
Convert characters in the string to uppercase.
Constructor Details
#initialize(string) ⇒ Chars
:nodoc:
84 85 86 87 |
# File 'lib/active_support/multibyte/chars.rb', line 84 def initialize(string) @wrapped_string = string @wrapped_string.force_encoding(Encoding::UTF_8) unless @wrapped_string.frozen? end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(method, *args, &block) ⇒ Object
Forward all undefined methods to the wrapped string.
95 96 97 98 99 100 101 102 103 |
# File 'lib/active_support/multibyte/chars.rb', line 95 def method_missing(method, *args, &block) if method.to_s =~ /!$/ @wrapped_string.__send__(method, *args, &block) self else result = @wrapped_string.__send__(method, *args, &block) result.kind_of?(String) ? chars(result) : result end end |
Instance Attribute Details
#wrapped_string ⇒ Object (readonly) Also known as: to_s, to_str
Returns the value of attribute wrapped_string.
78 79 80 |
# File 'lib/active_support/multibyte/chars.rb', line 78 def wrapped_string @wrapped_string end |
Class Method Details
.codepoints_to_pattern(array_of_codepoints) ⇒ Object
Returns a regular expression pattern that matches the passed Unicode codepoints
70 71 72 |
# File 'lib/active_support/multibyte/chars.rb', line 70 def self.codepoints_to_pattern(array_of_codepoints) #:nodoc: array_of_codepoints.collect{ |e| [e].pack 'U*' }.join('|') end |
.compose_codepoints(codepoints) ⇒ Object
Compose decomposed characters to the composed form.
577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 |
# File 'lib/active_support/multibyte/chars.rb', line 577 def compose_codepoints(codepoints) pos = 0 eoa = codepoints.length - 1 starter_pos = 0 starter_char = codepoints[0] previous_combining_class = -1 while pos < eoa pos += 1 lindex = starter_char - HANGUL_LBASE # -- Hangul if 0 <= lindex and lindex < HANGUL_LCOUNT vindex = codepoints[starter_pos+1] - HANGUL_VBASE rescue vindex = -1 if 0 <= vindex and vindex < HANGUL_VCOUNT tindex = codepoints[starter_pos+2] - HANGUL_TBASE rescue tindex = -1 if 0 <= tindex and tindex < HANGUL_TCOUNT j = starter_pos + 2 eoa -= 2 else tindex = 0 j = starter_pos + 1 eoa -= 1 end codepoints[starter_pos..j] = (lindex * HANGUL_VCOUNT + vindex) * HANGUL_TCOUNT + tindex + HANGUL_SBASE end starter_pos += 1 starter_char = codepoints[starter_pos] # -- Other characters else current_char = codepoints[pos] current = UCD.codepoints[current_char] if current.combining_class > previous_combining_class if ref = UCD.composition_map[starter_char] composition = ref[current_char] else composition = nil end unless composition.nil? codepoints[starter_pos] = composition starter_char = composition codepoints.delete_at pos eoa -= 1 pos -= 1 previous_combining_class = -1 else previous_combining_class = current.combining_class end else previous_combining_class = current.combining_class end if current.combining_class == 0 starter_pos = pos starter_char = codepoints[pos] end end end codepoints end |
.consumes?(string) ⇒ Boolean
Returns true
when the proxy class can handle the string. Returns false
otherwise.
123 124 125 126 127 128 129 |
# File 'lib/active_support/multibyte/chars.rb', line 123 def self.consumes?(string) # Unpack is a little bit faster than regular expressions. string.unpack('U*') true rescue ArgumentError false end |
.decompose_codepoints(type, codepoints) ⇒ Object
Decompose composed characters to the decomposed form.
556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 |
# File 'lib/active_support/multibyte/chars.rb', line 556 def decompose_codepoints(type, codepoints) codepoints.inject([]) do |decomposed, cp| # if it's a hangul syllable starter character if HANGUL_SBASE <= cp and cp < HANGUL_SLAST sindex = cp - HANGUL_SBASE ncp = [] # new codepoints ncp << HANGUL_LBASE + sindex / HANGUL_NCOUNT ncp << HANGUL_VBASE + (sindex % HANGUL_NCOUNT) / HANGUL_TCOUNT tindex = sindex % HANGUL_TCOUNT ncp << (HANGUL_TBASE + tindex) unless tindex == 0 decomposed.concat ncp # if the codepoint is decomposable in with the current decomposition type elsif (ncp = UCD.codepoints[cp].decomp_mapping) and (!UCD.codepoints[cp].decomp_type || type == :compatability) decomposed.concat decompose_codepoints(type, ncp.dup) else decomposed << cp end end end |
.g_pack(unpacked) ⇒ Object
527 528 529 |
# File 'lib/active_support/multibyte/chars.rb', line 527 def g_pack(unpacked) (unpacked.flatten).pack('U*') end |
.g_unpack(string) ⇒ Object
493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 |
# File 'lib/active_support/multibyte/chars.rb', line 493 def g_unpack(string) codepoints = u_unpack(string) unpacked = [] pos = 0 marker = 0 eoc = codepoints.length while(pos < eoc) pos += 1 previous = codepoints[pos-1] current = codepoints[pos] if ( # CR X LF one = ( previous == UCD.boundary[:cr] and current == UCD.boundary[:lf] ) or # L X (L|V|LV|LVT) two = ( UCD.boundary[:l] === previous and in_char_class?(current, [:l,:v,:lv,:lvt]) ) or # (LV|V) X (V|T) three = ( in_char_class?(previous, [:lv,:v]) and in_char_class?(current, [:v,:t]) ) or # (LVT|T) X (T) four = ( in_char_class?(previous, [:lvt,:t]) and UCD.boundary[:t] === current ) or # X Extend five = (UCD.boundary[:extend] === current) ) else unpacked << codepoints[marker..pos-1] marker = pos end end unpacked end |
.in_char_class?(codepoint, classes) ⇒ Boolean
Detect whether the codepoint is in a certain character class. Returns true
when it’s in the specified character class and false
otherwise. Valid character classes are: :cr
, :lf
, :l
, :v
, :lv
, :lvt
and :t
.
Primarily used by the grapheme cluster support.
484 485 486 |
# File 'lib/active_support/multibyte/chars.rb', line 484 def in_char_class?(codepoint, classes) classes.detect { |c| UCD.boundary[c] === codepoint } ? true : false end |
.padding(padsize, padstr = ' ') ⇒ Object
:nodoc:
531 532 533 534 535 536 537 |
# File 'lib/active_support/multibyte/chars.rb', line 531 def padding(padsize, padstr=' ') #:nodoc: if padsize != 0 new(padstr * ((padsize / u_unpack(padstr).size) + 1)).slice(0, padsize) else '' end end |
.reorder_characters(codepoints) ⇒ Object
Re-order codepoints so the string becomes canonical.
540 541 542 543 544 545 546 547 548 549 550 551 552 553 |
# File 'lib/active_support/multibyte/chars.rb', line 540 def reorder_characters(codepoints) length = codepoints.length- 1 pos = 0 while pos < length do cp1, cp2 = UCD.codepoints[codepoints[pos]], UCD.codepoints[codepoints[pos+1]] if (cp1.combining_class > cp2.combining_class) && (cp2.combining_class > 0) codepoints[pos..pos+1] = cp2.code, cp1.code pos += (pos > 0 ? -1 : 1) else pos += 1 end end codepoints end |
.tidy_bytes(string) ⇒ Object
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
636 637 638 639 640 641 642 643 644 645 646 647 648 649 |
# File 'lib/active_support/multibyte/chars.rb', line 636 def tidy_bytes(string) string.split(//u).map do |c| c.force_encoding(Encoding::ASCII) if c.respond_to?(:force_encoding) if !ActiveSupport::Multibyte::VALID_CHARACTER['UTF-8'].match(c) n = c.unpack('C')[0] n < 128 ? n.chr : n < 160 ? [UCD.cp1252[n] || n].pack('U') : n < 192 ? "\xC2" + n.chr : "\xC3" + (n-64).chr else c end end.join end |
.u_unpack(string) ⇒ Object
Unpack the string at codepoints boundaries. Raises an EncodingError when the encoding of the string isn’t valid UTF-8.
Example:
Chars.u_unpack('Café') #=> [67, 97, 102, 233]
471 472 473 474 475 476 477 |
# File 'lib/active_support/multibyte/chars.rb', line 471 def u_unpack(string) begin string.unpack 'U*' rescue ArgumentError raise EncodingError, 'malformed UTF-8 character' end end |
.wants?(string) ⇒ Boolean
Returns true
if the Chars class can and should act as a proxy for the string string. Returns false
otherwise.
118 119 120 |
# File 'lib/active_support/multibyte/chars.rb', line 118 def self.wants?(string) $KCODE == 'UTF8' && consumes?(string) end |
Instance Method Details
#+(other) ⇒ Object
Returns a new Chars object containing the other object concatenated to the string.
Example:
('Café'.mb_chars + ' périferôl').to_s #=> "Café périferôl"
147 148 149 |
# File 'lib/active_support/multibyte/chars.rb', line 147 def +(other) self << other end |
#<=>(other) ⇒ Object
Returns -1
, 0
or +1
depending on whether the Chars object is to be sorted before, equal or after the object on the right side of the operation. It accepts any object that implements to_s
. See String#<=>
for more details.
Example:
'é'.mb_chars <=> 'ü'.mb_chars #=> -1
139 140 141 |
# File 'lib/active_support/multibyte/chars.rb', line 139 def <=>(other) @wrapped_string <=> other.to_s end |
#=~(other) ⇒ Object
Like String#=~
only it returns the character offset (in codepoints) instead of the byte offset.
Example:
'Café périferôl'.mb_chars =~ /ô/ #=> 12
155 156 157 |
# File 'lib/active_support/multibyte/chars.rb', line 155 def =~(other) translate_offset(@wrapped_string =~ other) end |
#[]=(*args) ⇒ Object
Like String#[]=
, except instead of byte offsets you specify character offsets.
Example:
s = "Müller"
s.mb_chars[2] = "e" # Replace character with offset 2
s
#=> "Müeler"
s = "Müller"
s.mb_chars[1, 2] = "ö" # Replace 2 characters at character offset 1
s
#=> "Möler"
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 |
# File 'lib/active_support/multibyte/chars.rb', line 231 def []=(*args) replace_by = args.pop # Indexed replace with regular expressions already works if args.first.is_a?(Regexp) @wrapped_string[*args] = replace_by else result = self.class.u_unpack(@wrapped_string) if args[0].is_a?(Fixnum) raise IndexError, "index #{args[0]} out of string" if args[0] >= result.length min = args[0] max = args[1].nil? ? min : (min + args[1] - 1) range = Range.new(min, max) replace_by = [replace_by].pack('U') if replace_by.is_a?(Fixnum) elsif args.first.is_a?(Range) raise RangeError, "#{args[0]} out of range" if args[0].min >= result.length range = args[0] else needle = args[0].to_s min = index(needle) max = min + self.class.u_unpack(needle).length - 1 range = Range.new(min, max) end result[range] = self.class.u_unpack(replace_by) @wrapped_string.replace(result.pack('U*')) end end |
#acts_like_string? ⇒ Boolean
Enable more predictable duck-typing on String-like classes. See Object#acts_like?.
112 113 114 |
# File 'lib/active_support/multibyte/chars.rb', line 112 def acts_like_string? true end |
#capitalize ⇒ Object
Converts the first character to uppercase and the remainder to lowercase.
Example:
'über'.mb_chars.capitalize.to_s #=> "Über"
393 394 395 |
# File 'lib/active_support/multibyte/chars.rb', line 393 def capitalize (slice(0) || chars('')).upcase + (slice(1..-1) || chars('')).downcase end |
#center(integer, padstr = ' ') ⇒ Object
Works just like String#center
, only integer specifies characters instead of bytes.
Example:
"¾ cup".mb_chars.center(8).to_s
#=> " ¾ cup "
"¾ cup".mb_chars.center(8, " ").to_s # Use non-breaking whitespace
#=> " ¾ cup "
293 294 295 |
# File 'lib/active_support/multibyte/chars.rb', line 293 def center(integer, padstr=' ') justify(integer, :center, padstr) end |
#compose ⇒ Object
Performs composition on all the characters.
Example:
'é'.length #=> 3
'é'.mb_chars.compose.to_s.length #=> 2
435 436 437 |
# File 'lib/active_support/multibyte/chars.rb', line 435 def compose chars(self.class.compose_codepoints(self.class.u_unpack(@wrapped_string)).pack('U*')) end |
#decompose ⇒ Object
Performs canonical decomposition on all the characters.
Example:
'é'.length #=> 2
'é'.mb_chars.decompose.to_s.length #=> 3
426 427 428 |
# File 'lib/active_support/multibyte/chars.rb', line 426 def decompose chars(self.class.decompose_codepoints(:canonical, self.class.u_unpack(@wrapped_string)).pack('U*')) end |
#downcase ⇒ Object
Convert characters in the string to lowercase.
Example:
'VĚDA A VÝZKUM'.mb_chars.downcase.to_s #=> "věda a výzkum"
385 386 387 |
# File 'lib/active_support/multibyte/chars.rb', line 385 def downcase apply_mapping :lowercase_mapping end |
#g_length ⇒ Object
Returns the number of grapheme clusters in the string.
Example:
'क्षि'.mb_chars.length #=> 4
'क्षि'.mb_chars.g_length #=> 3
444 445 446 |
# File 'lib/active_support/multibyte/chars.rb', line 444 def g_length self.class.g_unpack(@wrapped_string).length end |
#include?(other) ⇒ Boolean
Returns true
if contained string contains other. Returns false
otherwise.
Example:
'Café'.mb_chars.include?('é') #=> true
188 189 190 191 |
# File 'lib/active_support/multibyte/chars.rb', line 188 def include?(other) # We have to redefine this method because Enumerable defines it. @wrapped_string.include?(other) end |
#index(needle, offset = 0) ⇒ Object
Returns the position needle in the string, counting in codepoints. Returns nil
if needle isn’t found.
Example:
'Café périferôl'.mb_chars.index('ô') #=> 12
'Café périferôl'.mb_chars.index(/\w/u) #=> 0
198 199 200 201 202 |
# File 'lib/active_support/multibyte/chars.rb', line 198 def index(needle, offset=0) wrapped_offset = self.first(offset).wrapped_string.length index = @wrapped_string.index(needle, wrapped_offset) index ? (self.class.u_unpack(@wrapped_string.slice(0...index)).size) : nil end |
#insert(offset, fragment) ⇒ Object
Inserts the passed string at specified codepoint offsets.
Example:
'Café'.mb_chars.insert(4, ' périferôl').to_s #=> "Café périferôl"
172 173 174 175 176 177 178 179 180 181 182 |
# File 'lib/active_support/multibyte/chars.rb', line 172 def insert(offset, fragment) unpacked = self.class.u_unpack(@wrapped_string) unless offset > unpacked.length @wrapped_string.replace( self.class.u_unpack(@wrapped_string).insert(offset, *self.class.u_unpack(fragment)).pack('U*') ) else raise IndexError, "index #{offset} out of string" end self end |
#ljust(integer, padstr = ' ') ⇒ Object
Works just like String#ljust
, only integer specifies characters instead of bytes.
Example:
"¾ cup".mb_chars.rjust(8).to_s
#=> "¾ cup "
"¾ cup".mb_chars.rjust(8, " ").to_s # Use non-breaking whitespace
#=> "¾ cup "
280 281 282 |
# File 'lib/active_support/multibyte/chars.rb', line 280 def ljust(integer, padstr=' ') justify(integer, :left, padstr) end |
#lstrip ⇒ Object
Strips entire range of Unicode whitespace from the left of the string.
303 304 305 |
# File 'lib/active_support/multibyte/chars.rb', line 303 def lstrip chars(@wrapped_string.gsub(UNICODE_LEADERS_PAT, '')) end |
#normalize(form = ActiveSupport::Multibyte.default_normalization_form) ⇒ Object
Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.
-
str
- The string to perform normalization on. -
form
- The form you want to normalize in. Should be one of the following::c
,:kc
,:d
, or:kd
. Default is ActiveSupport::Multibyte.default_normalization_form
404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 |
# File 'lib/active_support/multibyte/chars.rb', line 404 def normalize(form=ActiveSupport::Multibyte.default_normalization_form) # See http://www.unicode.org/reports/tr15, Table 1 codepoints = self.class.u_unpack(@wrapped_string) chars(case form when :d self.class.reorder_characters(self.class.decompose_codepoints(:canonical, codepoints)) when :c self.class.compose_codepoints(self.class.reorder_characters(self.class.decompose_codepoints(:canonical, codepoints))) when :kd self.class.reorder_characters(self.class.decompose_codepoints(:compatability, codepoints)) when :kc self.class.compose_codepoints(self.class.reorder_characters(self.class.decompose_codepoints(:compatability, codepoints))) else raise ArgumentError, "#{form} is not a valid normalization variant", caller end.pack('U*')) end |
#ord ⇒ Object
Returns the codepoint of the first character in the string.
Example:
'こんにちは'.mb_chars.ord #=> 12371
369 370 371 |
# File 'lib/active_support/multibyte/chars.rb', line 369 def ord self.class.u_unpack(@wrapped_string)[0] end |
#respond_to?(method, include_private = false) ⇒ Boolean
Returns true
if obj responds to the given method. Private methods are included in the search only if the optional second parameter evaluates to true
.
107 108 109 |
# File 'lib/active_support/multibyte/chars.rb', line 107 def respond_to?(method, include_private=false) super || @wrapped_string.respond_to?(method, include_private) || false end |
#reverse ⇒ Object
Reverses all characters in the string.
Example:
'Café'.mb_chars.reverse.to_s #=> 'éfaC'
322 323 324 |
# File 'lib/active_support/multibyte/chars.rb', line 322 def reverse chars(self.class.g_unpack(@wrapped_string).reverse.flatten.pack('U*')) end |
#rindex(needle, offset = nil) ⇒ Object
Returns the position needle in the string, counting in codepoints, searching backward from offset or the end of the string. Returns nil
if needle isn’t found.
Example:
'Café périferôl'.mb_chars.rindex('é') #=> 6
'Café périferôl'.mb_chars.rindex(/\w/u) #=> 13
211 212 213 214 215 216 |
# File 'lib/active_support/multibyte/chars.rb', line 211 def rindex(needle, offset=nil) offset ||= length wrapped_offset = self.first(offset).wrapped_string.length index = @wrapped_string.rindex(needle, wrapped_offset) index ? (self.class.u_unpack(@wrapped_string.slice(0...index)).size) : nil end |
#rjust(integer, padstr = ' ') ⇒ Object
Works just like String#rjust
, only integer specifies characters instead of bytes.
Example:
"¾ cup".mb_chars.rjust(8).to_s
#=> " ¾ cup"
"¾ cup".mb_chars.rjust(8, " ").to_s # Use non-breaking whitespace
#=> " ¾ cup"
267 268 269 |
# File 'lib/active_support/multibyte/chars.rb', line 267 def rjust(integer, padstr=' ') justify(integer, :right, padstr) end |
#rstrip ⇒ Object
Strips entire range of Unicode whitespace from the right of the string.
298 299 300 |
# File 'lib/active_support/multibyte/chars.rb', line 298 def rstrip chars(@wrapped_string.gsub(UNICODE_TRAILERS_PAT, '')) end |
#size ⇒ Object Also known as: length
Returns the number of codepoints in the string
313 314 315 |
# File 'lib/active_support/multibyte/chars.rb', line 313 def size self.class.u_unpack(@wrapped_string).size end |
#slice(*args) ⇒ Object Also known as: []
Implements Unicode-aware slice with codepoints. Slicing on one point returns the codepoints for that character.
Example:
'こんにちは'.mb_chars.slice(2..3).to_s #=> "にち"
331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 |
# File 'lib/active_support/multibyte/chars.rb', line 331 def slice(*args) if args.size > 2 raise ArgumentError, "wrong number of arguments (#{args.size} for 1)" # Do as if we were native elsif (args.size == 2 && !(args.first.is_a?(Numeric) || args.first.is_a?(Regexp))) raise TypeError, "cannot convert #{args.first.class} into Integer" # Do as if we were native elsif (args.size == 2 && !args[1].is_a?(Numeric)) raise TypeError, "cannot convert #{args[1].class} into Integer" # Do as if we were native elsif args[0].kind_of? Range cps = self.class.u_unpack(@wrapped_string).slice(*args) result = cps.nil? ? nil : cps.pack('U*') elsif args[0].kind_of? Regexp result = @wrapped_string.slice(*args) elsif args.size == 1 && args[0].kind_of?(Numeric) character = self.class.u_unpack(@wrapped_string)[args[0]] result = character.nil? ? nil : [character].pack('U') else result = self.class.u_unpack(@wrapped_string).slice(*args).pack('U*') end result.nil? ? nil : chars(result) end |
#slice!(*args) ⇒ Object
Like String#slice!
, except instead of byte offsets you specify character offsets.
Example:
s = 'こんにちは'
s.mb_chars.slice!(2..3).to_s #=> "にち"
s #=> "こんは"
359 360 361 362 363 |
# File 'lib/active_support/multibyte/chars.rb', line 359 def slice!(*args) slice = self[*args] self[*args] = '' slice end |
#split(*args) ⇒ Object
Works just like String#split
, with the exception that the items in the resulting list are Chars instances instead of String. This makes chaining methods easier.
Example:
'Café périferôl'.mb_chars.split(/é/).map { |part| part.upcase.to_s } #=> ["CAF", " P", "RIFERÔL"]
164 165 166 |
# File 'lib/active_support/multibyte/chars.rb', line 164 def split(*args) @wrapped_string.split(*args).map { |i| i.mb_chars } end |
#strip ⇒ Object
Strips entire range of Unicode whitespace from the right and left of the string.
308 309 310 |
# File 'lib/active_support/multibyte/chars.rb', line 308 def strip rstrip.lstrip end |
#tidy_bytes ⇒ Object
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
449 450 451 |
# File 'lib/active_support/multibyte/chars.rb', line 449 def tidy_bytes chars(self.class.tidy_bytes(@wrapped_string)) end |
#upcase ⇒ Object
Convert characters in the string to uppercase.
Example:
'Laurent, òu sont les tests?'.mb_chars.upcase.to_s #=> "LAURENT, ÒU SONT LES TESTS?"
377 378 379 |
# File 'lib/active_support/multibyte/chars.rb', line 377 def upcase apply_mapping :uppercase_mapping end |