Module: Babosa::UTF8::UTF8Proxy
- Included in:
- ActiveSupportProxy, DumbProxy, JavaProxy, UnicodeProxy
- Defined in:
- lib/babosa/utf8/proxy.rb
Overview
A UTF-8 proxy for Babosa can be any object which responds to the methods in this module. The following proxies are provided by Babosa: ActiveSupportProxy, DumbProxy, JavaProxy, and UnicodeProxy.
Constant Summary collapse
- CP1252 =
{ 128 => [226, 130, 172], 129 => nil, 130 => [226, 128, 154], 131 => [198, 146], 132 => [226, 128, 158], 133 => [226, 128, 166], 134 => [226, 128, 160], 135 => [226, 128, 161], 136 => [203, 134], 137 => [226, 128, 176], 138 => [197, 160], 139 => [226, 128, 185], 140 => [197, 146], 141 => nil, 142 => [197, 189], 143 => nil, 144 => nil, 145 => [226, 128, 152], 146 => [226, 128, 153], 147 => [226, 128, 156], 148 => [226, 128, 157], 149 => [226, 128, 162], 150 => [226, 128, 147], 151 => [226, 128, 148], 152 => [203, 156], 153 => [226, 132, 162], 154 => [197, 161], 155 => [226, 128, 186], 156 => [197, 147], 157 => nil, 158 => [197, 190], 159 => [197, 184] }
Instance Method Summary collapse
-
#downcase(string) ⇒ Object
This is a stub for a method that should return a Unicode-aware downcased version of the given string.
-
#normalize_utf8(string) ⇒ Object
This is a stub for a method that should return the Unicode NFC normalization of the given string.
-
#tidy_bytes(string) ⇒ Object
Attempt to replace invalid UTF-8 bytes with valid ones.
-
#upcase(string) ⇒ Object
This is a stub for a method that should return a Unicode-aware upcased version of the given string.
Instance Method Details
#downcase(string) ⇒ Object
This is a stub for a method that should return a Unicode-aware downcased version of the given string.
49 50 51 |
# File 'lib/babosa/utf8/proxy.rb', line 49 def downcase(string) raise NotImplementedError end |
#normalize_utf8(string) ⇒ Object
This is a stub for a method that should return the Unicode NFC normalization of the given string.
61 62 63 |
# File 'lib/babosa/utf8/proxy.rb', line 61 def normalize_utf8(string) raise NotImplementedError end |
#tidy_bytes(string) ⇒ Object
Attempt to replace invalid UTF-8 bytes with valid ones. This method naively assumes if you have invalid UTF8 bytes, they are either Windows CP-1252 or ISO8859-1. In practice this isn’t a bad assumption, but may not always work.
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
# File 'lib/babosa/utf8/proxy.rb', line 69 def tidy_bytes(string) bytes = string.unpack("C*") conts_expected = 0 last_lead = 0 bytes.each_index do |i| byte = bytes[i] is_ascii = byte < 128 is_cont = byte > 127 && byte < 192 is_lead = byte > 191 && byte < 245 is_unused = byte > 240 is_restricted = byte > 244 # Impossible or highly unlikely byte? Clean it. if is_unused || is_restricted bytes[i] = tidy_byte(byte) elsif is_cont # Not expecting contination byte? Clean up. Otherwise, now expect one less. conts_expected == 0 ? bytes[i] = tidy_byte(byte) : conts_expected -= 1 else if conts_expected > 0 # Expected continuation, but got ASCII or leading? Clean backwards up to # the leading byte. (1..(i - last_lead)).each {|j| bytes[i - j] = tidy_byte(bytes[i - j])} conts_expected = 0 end if is_lead # Final byte is leading? Clean it. if i == bytes.length - 1 bytes[i] = tidy_byte(bytes.last) else # Valid leading byte? Expect continuations determined by position of # first zero bit, with max of 3. conts_expected = byte < 224 ? 1 : byte < 240 ? 2 : 3 last_lead = i end end end end bytes.empty? ? "" : bytes.flatten.compact.pack("C*").unpack("U*").pack("U*") end |
#upcase(string) ⇒ Object
This is a stub for a method that should return a Unicode-aware upcased version of the given string.
55 56 57 |
# File 'lib/babosa/utf8/proxy.rb', line 55 def upcase(string) raise NotImplementedError end |