Class: Mail::Multibyte::Chars
- Includes:
- Comparable
- Defined in:
- lib/mail/multibyte/chars.rb
Overview
Chars enables you to work transparently with UTF-8 encoding in the Ruby String class without having extensive knowledge about the encoding. A Chars object accepts a string upon initialization and proxies String methods in an encoding safe manner. All the normal String methods are also implemented on the proxy.
String methods are proxied through the Chars object, and can be accessed through the mb_chars
method. Methods which would normally return a String object now return a Chars object so methods can be chained.
"The Perfect String ".mb_chars.downcase.strip.normalize # => "the perfect string"
Chars objects are perfectly interchangeable with String objects as long as no explicit class checks are made. If certain methods do explicitly check the class, call to_s
before you pass chars objects to them.
bad.explicit_checking_method "T".mb_chars.downcase.to_s
The default Chars implementation assumes that the encoding of the string is UTF-8, if you want to handle different encodings you can write your own multibyte string handler and configure it through Mail::Multibyte.proxy_class.
class CharsForUTF32
def size
@wrapped_string.size / 4
end
def self.accepts?(string)
string.length % 4 == 0
end
end
Mail::Multibyte.proxy_class = CharsForUTF32
Instance Attribute Summary collapse
-
#wrapped_string ⇒ Object
(also: #to_s, #to_str)
readonly
Returns the value of attribute wrapped_string.
Class Method Summary collapse
-
.consumes?(string) ⇒ Boolean
Returns
true
when the proxy class can handle the string. -
.wants?(string) ⇒ Boolean
Returns
true
if the Chars class can and should act as a proxy for the string string.
Instance Method Summary collapse
-
#+(other) ⇒ Object
Returns a new Chars object containing the other object concatenated to the string.
-
#<=>(other) ⇒ Object
Returns -1, 0, or 1, depending on whether the Chars object is to be sorted before, equal or after the object on the right side of the operation.
-
#=~(other) ⇒ Object
Like
String#=~
only it returns the character offset (in codepoints) instead of the byte offset. -
#[]=(*args) ⇒ Object
Like
String#[]=
, except instead of byte offsets you specify character offsets. -
#acts_like_string? ⇒ Boolean
Enable more predictable duck-typing on String-like classes.
-
#capitalize ⇒ Object
Converts the first character to uppercase and the remainder to lowercase.
-
#center(integer, padstr = ' ') ⇒ Object
Works just like
String#center
, only integer specifies characters instead of bytes. -
#compose ⇒ Object
Performs composition on all the characters.
-
#decompose ⇒ Object
Performs canonical decomposition on all the characters.
-
#downcase ⇒ Object
Convert characters in the string to lowercase.
-
#g_length ⇒ Object
Returns the number of grapheme clusters in the string.
-
#include?(other) ⇒ Boolean
Returns
true
if contained string contains other. -
#index(needle, offset = 0) ⇒ Object
Returns the position needle in the string, counting in codepoints.
-
#initialize(string) ⇒ Chars
constructor
:nodoc:.
-
#insert(offset, fragment) ⇒ Object
Inserts the passed string at specified codepoint offsets.
-
#limit(limit) ⇒ Object
Limit the byte size of the string to a number of bytes without breaking characters.
-
#ljust(integer, padstr = ' ') ⇒ Object
Works just like
String#ljust
, only integer specifies characters instead of bytes. -
#lstrip ⇒ Object
Strips entire range of Unicode whitespace from the left of the string.
-
#method_missing(method, *args, &block) ⇒ Object
Forward all undefined methods to the wrapped string.
-
#normalize(form = nil) ⇒ Object
Returns the KC normalization of the string by default.
-
#ord ⇒ Object
Returns the codepoint of the first character in the string.
-
#respond_to?(method, include_private = false) ⇒ Boolean
Returns
true
if obj responds to the given method. -
#reverse ⇒ Object
Reverses all characters in the string.
-
#rindex(needle, offset = nil) ⇒ Object
Returns the position needle in the string, counting in codepoints, searching backward from offset or the end of the string.
-
#rjust(integer, padstr = ' ') ⇒ Object
Works just like
String#rjust
, only integer specifies characters instead of bytes. -
#rstrip ⇒ Object
Strips entire range of Unicode whitespace from the right of the string.
-
#size ⇒ Object
(also: #length)
Returns the number of codepoints in the string.
-
#slice(*args) ⇒ Object
(also: #[])
Implements Unicode-aware slice with codepoints.
-
#split(*args) ⇒ Object
Works just like
String#split
, with the exception that the items in the resulting list are Chars instances instead of String. -
#strip ⇒ Object
Strips entire range of Unicode whitespace from the right and left of the string.
-
#tidy_bytes(force = false) ⇒ Object
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
-
#titleize ⇒ Object
(also: #titlecase)
Capitalizes the first letter of every word, when possible.
-
#upcase ⇒ Object
Convert characters in the string to uppercase.
Constructor Details
#initialize(string) ⇒ Chars
:nodoc:
41 42 43 44 |
# File 'lib/mail/multibyte/chars.rb', line 41 def initialize(string) @wrapped_string = string @wrapped_string.force_encoding(Encoding::UTF_8) unless @wrapped_string.frozen? end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(method, *args, &block) ⇒ Object
Forward all undefined methods to the wrapped string.
52 53 54 55 56 57 58 59 60 |
# File 'lib/mail/multibyte/chars.rb', line 52 def method_missing(method, *args, &block) if method.to_s =~ /!$/ @wrapped_string.__send__(method, *args, &block) self else result = @wrapped_string.__send__(method, *args, &block) result.kind_of?(String) ? chars(result) : result end end |
Instance Attribute Details
#wrapped_string ⇒ Object (readonly) Also known as: to_s, to_str
Returns the value of attribute wrapped_string.
35 36 37 |
# File 'lib/mail/multibyte/chars.rb', line 35 def wrapped_string @wrapped_string end |
Class Method Details
.consumes?(string) ⇒ Boolean
Returns true
when the proxy class can handle the string. Returns false
otherwise.
74 75 76 77 78 79 80 |
# File 'lib/mail/multibyte/chars.rb', line 74 def self.consumes?(string) # Unpack is a little bit faster than regular expressions. string.unpack('U*') true rescue ArgumentError false end |
.wants?(string) ⇒ Boolean
Returns true
if the Chars class can and should act as a proxy for the string string. Returns false
otherwise.
98 99 100 |
# File 'lib/mail/multibyte/chars.rb', line 98 def self.wants?(string) $KCODE == 'UTF8' && consumes?(string) end |
Instance Method Details
#+(other) ⇒ Object
Returns a new Chars object containing the other object concatenated to the string.
Example:
('Café'.mb_chars + ' périferôl').to_s # => "Café périferôl"
106 107 108 |
# File 'lib/mail/multibyte/chars.rb', line 106 def +(other) chars(@wrapped_string + other) end |
#<=>(other) ⇒ Object
Returns -1, 0, or 1, depending on whether the Chars object is to be sorted before, equal or after the object on the right side of the operation. It accepts any object that implements to_s
:
'é'.mb_chars <=> 'ü'.mb_chars # => -1
See String#<=>
for more details.
91 92 93 |
# File 'lib/mail/multibyte/chars.rb', line 91 def <=>(other) @wrapped_string <=> other.to_s end |
#=~(other) ⇒ Object
Like String#=~
only it returns the character offset (in codepoints) instead of the byte offset.
Example:
'Café périferôl'.mb_chars =~ /ô/ # => 12
114 115 116 |
# File 'lib/mail/multibyte/chars.rb', line 114 def =~(other) translate_offset(@wrapped_string =~ other) end |
#[]=(*args) ⇒ Object
Like String#[]=
, except instead of byte offsets you specify character offsets.
Example:
s = "Müller"
s.mb_chars[2] = "e" # Replace character with offset 2
s
# => "Müeler"
s = "Müller"
s.mb_chars[1, 2] = "ö" # Replace 2 characters at character offset 1
s
# => "Möler"
264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 |
# File 'lib/mail/multibyte/chars.rb', line 264 def []=(*args) replace_by = args.pop # Indexed replace with regular expressions already works if args.first.is_a?(Regexp) @wrapped_string[*args] = replace_by else result = Unicode.u_unpack(@wrapped_string) if args[0].is_a?(Fixnum) raise IndexError, "index #{args[0]} out of string" if args[0] >= result.length min = args[0] max = args[1].nil? ? min : (min + args[1] - 1) range = Range.new(min, max) replace_by = [replace_by].pack('U') if replace_by.is_a?(Fixnum) elsif args.first.is_a?(Range) raise RangeError, "#{args[0]} out of range" if args[0].min >= result.length range = args[0] else needle = args[0].to_s min = index(needle) max = min + Unicode.u_unpack(needle).length - 1 range = Range.new(min, max) end result[range] = Unicode.u_unpack(replace_by) @wrapped_string.replace(result.pack('U*')) end end |
#acts_like_string? ⇒ Boolean
Enable more predictable duck-typing on String-like classes. See Object#acts_like?.
69 70 71 |
# File 'lib/mail/multibyte/chars.rb', line 69 def acts_like_string? true end |
#capitalize ⇒ Object
Converts the first character to uppercase and the remainder to lowercase.
Example:
'über'.mb_chars.capitalize.to_s # => "Über"
357 358 359 |
# File 'lib/mail/multibyte/chars.rb', line 357 def capitalize (slice(0) || chars('')).upcase + (slice(1..-1) || chars('')).downcase end |
#center(integer, padstr = ' ') ⇒ Object
Works just like String#center
, only integer specifies characters instead of bytes.
Example:
"¾ cup".mb_chars.center(8).to_s
# => " ¾ cup "
"¾ cup".mb_chars.center(8, " ").to_s # Use non-breaking whitespace
# => " ¾ cup "
232 233 234 |
# File 'lib/mail/multibyte/chars.rb', line 232 def center(integer, padstr=' ') justify(integer, :center, padstr) end |
#compose ⇒ Object
Performs composition on all the characters.
Example:
'é'.length # => 3
'é'.mb_chars.compose.to_s.length # => 2
395 396 397 |
# File 'lib/mail/multibyte/chars.rb', line 395 def compose chars(Unicode.compose_codepoints(Unicode.u_unpack(@wrapped_string)).pack('U*')) end |
#decompose ⇒ Object
Performs canonical decomposition on all the characters.
Example:
'é'.length # => 2
'é'.mb_chars.decompose.to_s.length # => 3
386 387 388 |
# File 'lib/mail/multibyte/chars.rb', line 386 def decompose chars(Unicode.decompose_codepoints(:canonical, Unicode.u_unpack(@wrapped_string)).pack('U*')) end |
#downcase ⇒ Object
Convert characters in the string to lowercase.
Example:
'VĚDA A VÝZKUM'.mb_chars.downcase.to_s # => "věda a výzkum"
349 350 351 |
# File 'lib/mail/multibyte/chars.rb', line 349 def downcase chars(Unicode.apply_mapping(@wrapped_string), :lowercase_mapping) end |
#g_length ⇒ Object
Returns the number of grapheme clusters in the string.
Example:
'क्षि'.mb_chars.length # => 4
'क्षि'.mb_chars.g_length # => 3
404 405 406 |
# File 'lib/mail/multibyte/chars.rb', line 404 def g_length Unicode.g_unpack(@wrapped_string).length end |
#include?(other) ⇒ Boolean
Returns true
if contained string contains other. Returns false
otherwise.
Example:
'Café'.mb_chars.include?('é') # => true
138 139 140 141 |
# File 'lib/mail/multibyte/chars.rb', line 138 def include?(other) # We have to redefine this method because Enumerable defines it. @wrapped_string.include?(other) end |
#index(needle, offset = 0) ⇒ Object
Returns the position needle in the string, counting in codepoints. Returns nil
if needle isn’t found.
Example:
'Café périferôl'.mb_chars.index('ô') # => 12
'Café périferôl'.mb_chars.index(/\w/u) # => 0
148 149 150 151 152 |
# File 'lib/mail/multibyte/chars.rb', line 148 def index(needle, offset=0) wrapped_offset = first(offset).wrapped_string.length index = @wrapped_string.index(needle, wrapped_offset) index ? (Unicode.u_unpack(@wrapped_string.slice(0...index)).size) : nil end |
#insert(offset, fragment) ⇒ Object
Inserts the passed string at specified codepoint offsets.
Example:
'Café'.mb_chars.insert(4, ' périferôl').to_s # => "Café périferôl"
122 123 124 125 126 127 128 129 130 131 132 |
# File 'lib/mail/multibyte/chars.rb', line 122 def insert(offset, fragment) unpacked = Unicode.u_unpack(@wrapped_string) unless offset > unpacked.length @wrapped_string.replace( Unicode.u_unpack(@wrapped_string).insert(offset, *Unicode.u_unpack(fragment)).pack('U*') ) else raise IndexError, "index #{offset} out of string" end self end |
#limit(limit) ⇒ Object
Limit the byte size of the string to a number of bytes without breaking characters. Usable when the storage for a string is limited for some reason.
Example:
s = 'こんにちは'
s.mb_chars.limit(7) # => "こに"
333 334 335 |
# File 'lib/mail/multibyte/chars.rb', line 333 def limit(limit) slice(0...translate_offset(limit)) end |
#ljust(integer, padstr = ' ') ⇒ Object
Works just like String#ljust
, only integer specifies characters instead of bytes.
Example:
"¾ cup".mb_chars.rjust(8).to_s
# => "¾ cup "
"¾ cup".mb_chars.rjust(8, " ").to_s # Use non-breaking whitespace
# => "¾ cup "
219 220 221 |
# File 'lib/mail/multibyte/chars.rb', line 219 def ljust(integer, padstr=' ') justify(integer, :left, padstr) end |
#lstrip ⇒ Object
Strips entire range of Unicode whitespace from the left of the string.
180 181 182 |
# File 'lib/mail/multibyte/chars.rb', line 180 def lstrip chars(@wrapped_string.gsub(Unicode::LEADERS_PAT, '')) end |
#normalize(form = nil) ⇒ Object
Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.
-
form
- The form you want to normalize in. Should be one of the following::c
,:kc
,:d
, or:kd
. Default is Mail::Multibyte::Unicode.default_normalization_form
377 378 379 |
# File 'lib/mail/multibyte/chars.rb', line 377 def normalize(form = nil) chars(Unicode.normalize(@wrapped_string, form)) end |
#ord ⇒ Object
Returns the codepoint of the first character in the string.
Example:
'こんにちは'.mb_chars.ord # => 12371
193 194 195 |
# File 'lib/mail/multibyte/chars.rb', line 193 def ord Unicode.u_unpack(@wrapped_string)[0] end |
#respond_to?(method, include_private = false) ⇒ Boolean
Returns true
if obj responds to the given method. Private methods are included in the search only if the optional second parameter evaluates to true
.
64 65 66 |
# File 'lib/mail/multibyte/chars.rb', line 64 def respond_to?(method, include_private=false) super || @wrapped_string.respond_to?(method, include_private) || false end |
#reverse ⇒ Object
Reverses all characters in the string.
Example:
'Café'.mb_chars.reverse.to_s # => 'éfaC'
295 296 297 |
# File 'lib/mail/multibyte/chars.rb', line 295 def reverse chars(Unicode.g_unpack(@wrapped_string).reverse.flatten.pack('U*')) end |
#rindex(needle, offset = nil) ⇒ Object
Returns the position needle in the string, counting in codepoints, searching backward from offset or the end of the string. Returns nil
if needle isn’t found.
Example:
'Café périferôl'.mb_chars.rindex('é') # => 6
'Café périferôl'.mb_chars.rindex(/\w/u) # => 13
161 162 163 164 165 166 |
# File 'lib/mail/multibyte/chars.rb', line 161 def rindex(needle, offset=nil) offset ||= length wrapped_offset = first(offset).wrapped_string.length index = @wrapped_string.rindex(needle, wrapped_offset) index ? (Unicode.u_unpack(@wrapped_string.slice(0...index)).size) : nil end |
#rjust(integer, padstr = ' ') ⇒ Object
Works just like String#rjust
, only integer specifies characters instead of bytes.
Example:
"¾ cup".mb_chars.rjust(8).to_s
# => " ¾ cup"
"¾ cup".mb_chars.rjust(8, " ").to_s # Use non-breaking whitespace
# => " ¾ cup"
206 207 208 |
# File 'lib/mail/multibyte/chars.rb', line 206 def rjust(integer, padstr=' ') justify(integer, :right, padstr) end |
#rstrip ⇒ Object
Strips entire range of Unicode whitespace from the right of the string.
175 176 177 |
# File 'lib/mail/multibyte/chars.rb', line 175 def rstrip chars(@wrapped_string.gsub(Unicode::TRAILERS_PAT, '')) end |
#size ⇒ Object Also known as: length
Returns the number of codepoints in the string
169 170 171 |
# File 'lib/mail/multibyte/chars.rb', line 169 def size Unicode.u_unpack(@wrapped_string).size end |
#slice(*args) ⇒ Object Also known as: []
Implements Unicode-aware slice with codepoints. Slicing on one point returns the codepoints for that character.
Example:
'こんにちは'.mb_chars.slice(2..3).to_s # => "にち"
304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 |
# File 'lib/mail/multibyte/chars.rb', line 304 def slice(*args) if args.size > 2 raise ArgumentError, "wrong number of arguments (#{args.size} for 1)" # Do as if we were native elsif (args.size == 2 && !(args.first.is_a?(Numeric) || args.first.is_a?(Regexp))) raise TypeError, "cannot convert #{args.first.class} into Integer" # Do as if we were native elsif (args.size == 2 && !args[1].is_a?(Numeric)) raise TypeError, "cannot convert #{args[1].class} into Integer" # Do as if we were native elsif args[0].kind_of? Range cps = Unicode.u_unpack(@wrapped_string).slice(*args) result = cps.nil? ? nil : cps.pack('U*') elsif args[0].kind_of? Regexp result = @wrapped_string.slice(*args) elsif args.size == 1 && args[0].kind_of?(Numeric) character = Unicode.u_unpack(@wrapped_string)[args[0]] result = character && [character].pack('U') else cps = Unicode.u_unpack(@wrapped_string).slice(*args) result = cps && cps.pack('U*') end result && chars(result) end |
#split(*args) ⇒ Object
Works just like String#split
, with the exception that the items in the resulting list are Chars instances instead of String. This makes chaining methods easier.
Example:
'Café périferôl'.mb_chars.split(/é/).map { |part| part.upcase.to_s } # => ["CAF", " P", "RIFERÔL"]
247 248 249 |
# File 'lib/mail/multibyte/chars.rb', line 247 def split(*args) @wrapped_string.split(*args).map { |i| i.mb_chars } end |
#strip ⇒ Object
Strips entire range of Unicode whitespace from the right and left of the string.
185 186 187 |
# File 'lib/mail/multibyte/chars.rb', line 185 def strip rstrip.lstrip end |
#tidy_bytes(force = false) ⇒ Object
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
Passing true
will forcibly tidy all bytes, assuming that the string’s encoding is entirely CP1252 or ISO-8859-1.
411 412 413 |
# File 'lib/mail/multibyte/chars.rb', line 411 def tidy_bytes(force = false) chars(Unicode.tidy_bytes(@wrapped_string, force)) end |
#titleize ⇒ Object Also known as: titlecase
Capitalizes the first letter of every word, when possible.
Example:
"ÉL QUE SE ENTERÓ".mb_chars.titleize # => "Él Que Se Enteró"
"日本語".mb_chars.titleize # => "日本語"
366 367 368 |
# File 'lib/mail/multibyte/chars.rb', line 366 def titleize chars(downcase.to_s.gsub(/\b('?[\S])/u) { Unicode.apply_mapping $1, :uppercase_mapping }) end |
#upcase ⇒ Object
Convert characters in the string to uppercase.
Example:
'Laurent, où sont les tests ?'.mb_chars.upcase.to_s # => "LAURENT, OÙ SONT LES TESTS ?"
341 342 343 |
# File 'lib/mail/multibyte/chars.rb', line 341 def upcase chars(Unicode.apply_mapping(@wrapped_string), :uppercase_mapping) end |