Module: Punycode
- Includes:
- Status
- Defined in:
- lib/punycode.rb
Overview
punycode4r
usage
simple usage
require 'punycode'
utf8_string = "\346\226\207\345\255\227\345\210\227"
punycode_string = Punycode.encode(utf8_string)
p punycode_string #=> "1br58tspi"
p(Punycode.decode(punycode_string) == utf8_string) #=> true
IDN (Internationalized Domain Name)
When you use punycode in IDN, you must need to do NAMEPREP (RFC 3491) before Punycode.encode, and add ACE Prefix (defined in RFC 3490) after Punycode.encode.
This library supports punycode only. NAMEPREP requires other libraries.
Defined Under Namespace
Modules: Status
Constant Summary collapse
- BASE =
*** Bootstring parameters for Punycode ***
36
- TMIN =
1
- TMAX =
26
- SKEW =
38
- DAMP =
700
- INITIAL_BIAS =
72
- INITIAL_N =
0x80
- DELIMITER =
0x2D
- MAXINT =
maxint is the maximum value of a punycode_uint variable:
1 << 64
- UNICODE_MAX_LENGTH =
256
- ACE_MAX_LENGTH =
256
- PRINT_ASCII =
The following string is used to convert printable characters between ASCII and the native charset:
"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" \ "\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n" \ " !\"\#$%&'()*+,-./" \ "0123456789:;<=>?" \ "@ABCDEFGHIJKLMNO" \ "PQRSTUVWXYZ[\\]^_" \ "`abcdefghijklmno" \ "pqrstuvwxyz{|}~\n"
Class Method Summary collapse
-
.adapt(delta, numpoints, firsttime) ⇒ Object
*** Bias adaptation function ***.
-
.basic(cp) ⇒ Object
basic(cp) tests whether cp is a basic code point:.
- .decode(punycode, case_flags = []) ⇒ Object
-
.decode_digit(cp) ⇒ Object
decode_digit(cp) returns the numeric value of a basic code point (for use in representing integers) in the range 0 to base-1, or base if cp is does not represent a value.
-
.delim(cp) ⇒ Object
delim(cp) tests whether cp is a delimiter:.
- .encode(unicode_string, case_flags = nil, print_ascii_only = false) ⇒ Object
-
.encode_basic(bcp, flag) ⇒ Object
encode_basic(bcp,flag) forces a basic code point to lowercase if flag is zero, uppercase if flag is nonzero, and returns the resulting code point.
-
.encode_digit(d, flag) ⇒ Object
encode_digit(d,flag) returns the basic code point whose value (when used for representing integers) is d, which needs to be in the range 0 to base-1.
-
.flagged(bcp) ⇒ Object
flagged(bcp) tests whether a basic code point is flagged (uppercase).
-
.punycode_decode(input_length, input, output_length, output, case_flags) ⇒ Object
punycode_decode() converts Punycode to Unicode.
-
.punycode_encode(input_length, input, case_flags, output_length, output) ⇒ Object
punycode_encode() converts Unicode to Punycode.
Class Method Details
.adapt(delta, numpoints, firsttime) ⇒ Object
*** Bias adaptation function ***
109 110 111 112 113 114 115 116 117 118 119 120 121 |
# File 'lib/punycode.rb', line 109 def adapt(delta, numpoints, firsttime) delta = firsttime ? delta / DAMP : delta >> 1 # delta >> 1 is a faster way of doing delta / 2 delta += delta / numpoints k = 0 while delta > ((BASE - TMIN) * TMAX) / 2 delta /= BASE - TMIN k += BASE end k + (BASE - TMIN + 1) * delta / (delta + SKEW) end |
.basic(cp) ⇒ Object
basic(cp) tests whether cp is a basic code point:
50 51 52 |
# File 'lib/punycode.rb', line 50 def basic(cp) cp < 0x80 end |
.decode(punycode, case_flags = []) ⇒ Object
389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 |
# File 'lib/punycode.rb', line 389 def decode(punycode, case_flags=[]) input = [] output = [] if ACE_MAX_LENGTH*2 < punycode.size raise PunycodeBigOutput end punycode.each_byte do |c| unless c >= 0 && c <= 127 raise PunycodeBadInput end input.push(c) end output_length = [UNICODE_MAX_LENGTH] Punycode.punycode_decode(input.length, input, output_length, output, case_flags) output.pack('U*') end |
.decode_digit(cp) ⇒ Object
decode_digit(cp) returns the numeric value of a basic code point (for use in representing integers) in the range 0 to base-1, or base if cp is does not represent a value.
62 63 64 65 |
# File 'lib/punycode.rb', line 62 def decode_digit(cp) cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 : cp - 97 < 26 ? cp - 97 : BASE end |
.delim(cp) ⇒ Object
delim(cp) tests whether cp is a delimiter:
55 56 57 |
# File 'lib/punycode.rb', line 55 def delim(cp) cp == DELIMITER end |
.encode(unicode_string, case_flags = nil, print_ascii_only = false) ⇒ Object
367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 |
# File 'lib/punycode.rb', line 367 def encode(unicode_string, case_flags=nil, print_ascii_only=false) input = unicode_string.unpack('U*') output = [0] * (ACE_MAX_LENGTH+1) output_length = [ACE_MAX_LENGTH] punycode_encode(input.size, input, case_flags, output_length, output) outlen = output_length[0] outlen.times do |j| c = output[j] unless c >= 0 && c <= 127 raise Error, "assertion error: invalid output char" end unless PRINT_ASCII[c] raise PunycodeBadInput end output[j] = PRINT_ASCII[c] if print_ascii_only end output[0..outlen].map{|x|x.chr}.join('').sub(/\0+\z/, '') end |
.encode_basic(bcp, flag) ⇒ Object
encode_basic(bcp,flag) forces a basic code point to lowercase if flag is zero, uppercase if flag is nonzero, and returns the resulting code point. The code point is unchanged if it is caseless. The behavior is undefined if bcp is not a basic code point.
90 91 92 93 94 95 96 97 98 99 100 |
# File 'lib/punycode.rb', line 90 def encode_basic(bcp, flag) # bcp -= (bcp - 97 < 26) << 5; if (0...26) === (bcp - 97) bcp -= 1 << 5 end # return bcp + ((!flag && (bcp - 65 < 26)) << 5); if !flag and (0...26) === (bcp - 65) bcp += 1 << 5 end bcp end |
.encode_digit(d, flag) ⇒ Object
encode_digit(d,flag) returns the basic code point whose value (when used for representing integers) is d, which needs to be in the range 0 to base-1. The lowercase form is used unless flag is nonzero, in which case the uppercase form is used. The behavior is undefined if flag is nonzero and digit d has no uppercase form.
72 73 74 75 76 |
# File 'lib/punycode.rb', line 72 def encode_digit(d, flag) return d + 22 + 75 * ((d < 26) ? 1 : 0) - ((flag ? 1 : 0) << 5) # 0..25 map to ASCII a..z or A..Z # 26..35 map to ASCII 0..9 end |
.flagged(bcp) ⇒ Object
flagged(bcp) tests whether a basic code point is flagged (uppercase). The behavior is undefined if bcp is not a basic code point.
81 82 83 |
# File 'lib/punycode.rb', line 81 def flagged(bcp) (0...26) === (bcp - 65) end |
.punycode_decode(input_length, input, output_length, output, case_flags) ⇒ Object
punycode_decode() converts Punycode to Unicode. The input is represented as an array of ASCII code points, and the output will be represented as an array of Unicode code points. The input_length is the number of code points in the input. The output_length is an in/out argument: the caller passes in the maximum number of code points that it can receive, and on successful return it will contain the actual number of code points output. The case_flags array needs room for at least output_length values, or it can be a null pointer if the case information is not needed. A nonzero flag suggests that the corresponding Unicode character be forced to uppercase by the caller (if possible), while zero suggests that it be forced to lowercase (if possible). ASCII code points are output already in the proper case, but their flags will be set appropriately so that applying the flags would be harmless. The return value can be any of the punycode_status values defined above; if not punycode_success, then output_length, output, and case_flags might contain garbage. On success, the decoder will never need to write an output_length greater than input_length, because of how the encoding is defined.
270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 |
# File 'lib/punycode.rb', line 270 def punycode_decode(input_length, input, output_length, output, case_flags) # Initialize the state: n = INITIAL_N out = i = 0 max_out = output_length[0] bias = INITIAL_BIAS # Handle the basic code points: Let b be the number of input code # points before the last delimiter, or 0 if there is none, then # copy the first b code points to the output. b = 0 input_length.times do |j| b = j if delim(input[j]) end raise PunycodeBigOutput if b > max_out b.times do |j| case_flags[out] = flagged(input[j]) if case_flags raise PunycodeBadInput unless basic(input[j]) output[out] = input[j] out+=1 end # Main decoding loop: Start just after the last delimiter if any # basic code points were copied; start at the beginning otherwise. in_ = b > 0 ? b + 1 : 0 while in_ < input_length # in_ is the index of the next character to be consumed, and # out is the number of code points in the output array. # Decode a generalized variable-length integer into delta, # which gets added to i. The overflow checking is easier # if we increase i as we go, then subtract off its starting # value at the end to obtain delta. oldi = i; w = 1; k = BASE while true raise PunycodeBadInput if in_ >= input_length digit = decode_digit(input[in_]) in_+=1 raise PunycodeBadInput if digit >= BASE raise PunycodeOverflow if digit > (MAXINT - i) / w i += digit * w t = if k <= bias # + TMIN # +TMIN not needed TMIN elsif k >= bias + TMAX TMAX else k - bias end break if digit < t raise PunycodeOverflow if w > MAXINT / (BASE - t) w *= BASE - t k += BASE end bias = adapt(i - oldi, out + 1, oldi == 0) # i was supposed to wrap around from out+1 to 0, # incrementing n each time, so we'll fix that now: raise PunycodeOverflow if i / (out + 1) > MAXINT - n n += i / (out + 1) i %= out + 1 # Insert n at position i of the output: # not needed for Punycode: # raise PUNYCODE_INVALID_INPUT if decode_digit(n) <= base raise PunycodeBigOutput if out >= max_out if case_flags #memmove(case_flags + i + 1, case_flags + i, out - i) case_flags[i + 1, out - i] = case_flags[i, out - i] # Case of last character determines uppercase flag: case_flags[i] = flagged(input[in_ - 1]) end #memmove(output + i + 1, output + i, (out - i) * sizeof *output) output[i + 1, out - i] = output[i, out - i] output[i] = n i+=1 out+=1 end output_length[0] = out return PunycodeSuccess end |
.punycode_encode(input_length, input, case_flags, output_length, output) ⇒ Object
punycode_encode() converts Unicode to Punycode. The input is represented as an array of Unicode code points (not code units; surrogate pairs are not allowed), and the output will be represented as an array of ASCII code points. The output string is not null-terminated; it will contain zeros if and only if the input contains zeros. (Of course the caller can leave room for a terminator and add one if needed.) The input_length is the number of code points in the input. The output_length is an in/out argument: the caller passes in the maximum number of code points that it can receive, and on successful return it will contain the number of code points actually output. The case_flags array holds input_length boolean values, where nonzero suggests that the corresponding Unicode character be forced to uppercase after being decoded (if possible), and zero suggests that it be forced to lowercase (if possible). ASCII code points are encoded literally, except that ASCII letters are forced to uppercase or lowercase according to the corresponding uppercase flags. If case_flags is a null pointer then ASCII letters are left as they are, and other code points are treated as if their uppercase flags were zero. The return value can be any of the punycode_status values defined above except punycode_bad_input; if not punycode_success, then output_size and output might contain garbage.
149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
# File 'lib/punycode.rb', line 149 def punycode_encode(input_length, input, case_flags, output_length, output) # Initialize the state: n = INITIAL_N delta = out = 0 max_out = output_length[0] bias = INITIAL_BIAS # Handle the basic code points: input_length.times do |j| if basic(input[j]) raise PunycodeBigOutput if max_out - out < 2 output[out] = if case_flags encode_basic(input[j], case_flags[j]) else input[j] end out+=1 # elsif (input[j] < n) # raise PunycodeBadInput # (not needed for Punycode with unsigned code points) end end h = b = out # h is the number of code points that have been handled, b is the # number of basic code points, and out is the number of characters # that have been output. if b > 0 output[out] = DELIMITER out+=1 end # Main encoding loop: while h < input_length # All non-basic code points < n have been # handled already. Find the next larger one: m = MAXINT input_length.times do |j| # next if basic(input[j]) # (not needed for Punycode) m = input[j] if (n...m) === input[j] end # Increase delta enough to advance the decoder's # <n,i> state to <m,0>, but guard against overflow: raise PunycodeOverflow if m - n > (MAXINT - delta) / (h + 1) delta += (m - n) * (h + 1) n = m input_length.times do |j| # Punycode does not need to check whether input[j] is basic: if input[j] < n # || basic(input[j]) delta+=1 raise PunycodeOverflow if delta == 0 end if input[j] == n # Represent delta as a generalized variable-length integer: q = delta; k = BASE while true raise PunycodeBigOutput if out >= max_out t = if k <= bias # + TMIN # +TMIN not needed TMIN elsif k >= bias + TMAX TMAX else k - bias end break if q < t output[out] = encode_digit(t + (q - t) % (BASE - t), false) out+=1 q = (q - t) / (BASE - t) k += BASE end output[out] = encode_digit(q, case_flags && case_flags[j]) out+=1 bias = adapt(delta, h + 1, h == b) delta = 0 h+=1 end end delta+=1; n+=1 end output_length[0] = out return PunycodeSuccess end |