Module: Babosa::UTF8::UTF8Proxy

Included in:
ActiveSupportProxy, DumbProxy, JavaProxy, UnicodeProxy
Defined in:
lib/babosa/utf8/proxy.rb

Overview

A UTF-8 proxy for Babosa can be any object which responds to the methods in this module. The following proxies are provided by Babosa: ActiveSupportProxy, DumbProxy, JavaProxy, and UnicodeProxy.

Constant Summary collapse

CP1252 =
{
  128 => [226, 130, 172],
  129 => nil,
  130 => [226, 128, 154],
  131 => [198, 146],
  132 => [226, 128, 158],
  133 => [226, 128, 166],
  134 => [226, 128, 160],
  135 => [226, 128, 161],
  136 => [203, 134],
  137 => [226, 128, 176],
  138 => [197, 160],
  139 => [226, 128, 185],
  140 => [197, 146],
  141 => nil,
  142 => [197, 189],
  143 => nil,
  144 => nil,
  145 => [226, 128, 152],
  146 => [226, 128, 153],
  147 => [226, 128, 156],
  148 => [226, 128, 157],
  149 => [226, 128, 162],
  150 => [226, 128, 147],
  151 => [226, 128, 148],
  152 => [203, 156],
  153 => [226, 132, 162],
  154 => [197, 161],
  155 => [226, 128, 186],
  156 => [197, 147],
  157 => nil,
  158 => [197, 190],
  159 => [197, 184]
}

Instance Method Summary collapse

Instance Method Details

#downcase(string) ⇒ Object

This is a stub for a method that should return a Unicode-aware downcased version of the given string.

Raises:

  • (NotImplementedError)


49
50
51
# File 'lib/babosa/utf8/proxy.rb', line 49

def downcase(string)
  raise NotImplementedError
end

#normalize_utf8(string) ⇒ Object

This is a stub for a method that should return the Unicode NFC normalization of the given string.

Raises:

  • (NotImplementedError)


61
62
63
# File 'lib/babosa/utf8/proxy.rb', line 61

def normalize_utf8(string)
  raise NotImplementedError
end

#tidy_bytes(string) ⇒ Object

Attempt to replace invalid UTF-8 bytes with valid ones. This method naively assumes if you have invalid UTF8 bytes, they are either Windows CP-1252 or ISO8859-1. In practice this isn’t a bad assumption, but may not always work.



69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# File 'lib/babosa/utf8/proxy.rb', line 69

def tidy_bytes(string)
  bytes = string.unpack("C*")
  conts_expected = 0
  last_lead = 0

  bytes.each_index do |i|
    byte          = bytes[i]
    is_ascii      = byte < 128
    is_cont       = byte > 127 && byte < 192
    is_lead       = byte > 191 && byte < 245
    is_unused     = byte > 240
    is_restricted = byte > 244

    # Impossible or highly unlikely byte? Clean it.
    if is_unused || is_restricted
      bytes[i] = tidy_byte(byte)
    elsif is_cont
      # Not expecting contination byte? Clean up. Otherwise, now expect one less.
      conts_expected == 0 ? bytes[i] = tidy_byte(byte) : conts_expected -= 1
    else
      if conts_expected > 0
        # Expected continuation, but got ASCII or leading? Clean backwards up to
        # the leading byte.
        (1..(i - last_lead)).each {|j| bytes[i - j] = tidy_byte(bytes[i - j])}
        conts_expected = 0
      end
      if is_lead
        # Final byte is leading? Clean it.
        if i == bytes.length - 1
          bytes[i] = tidy_byte(bytes.last)
        else
          # Valid leading byte? Expect continuations determined by position of
          # first zero bit, with max of 3.
          conts_expected = byte < 224 ? 1 : byte < 240 ? 2 : 3
          last_lead = i
        end
      end
    end
  end
  bytes.empty? ? "" : bytes.flatten.compact.pack("C*").unpack("U*").pack("U*")
end

#upcase(string) ⇒ Object

This is a stub for a method that should return a Unicode-aware upcased version of the given string.

Raises:

  • (NotImplementedError)


55
56
57
# File 'lib/babosa/utf8/proxy.rb', line 55

def upcase(string)
  raise NotImplementedError
end