Module: Buildbox::UTF8

Defined in:
lib/buildbox/utf8.rb

Class Method Summary collapse

Class Method Details

.clean(text) ⇒ Object

Replace or delete invalid UTF-8 characters from text, which is assumed to be in UTF-8.

The text is expected to come from external to Integrity sources such as commit messages or build output.

On ruby 1.9, invalid UTF-8 characters are replaced with question marks. On ruby 1.8, if iconv extension is present, invalid UTF-8 characters are removed. On ruby 1.8, if iconv extension is not present, the string is unmodified.



13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# File 'lib/buildbox/utf8.rb', line 13

def self.clean(text)
  # http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/
  # http://stackoverflow.com/questions/9126782/how-to-change-deprecated-iconv-to-stringencode-for-invalid-utf8-correction
  if text.respond_to?(:encoding)
    # ruby 1.9
    text = text.force_encoding('utf-8').encode(intermediate_encoding, :invalid => :replace, :replace => '?').encode('utf-8')
  else
    # ruby 1.8
    # As no encoding checks are done, any string will be accepted.
    # But delete invalid utf-8 characters anyway for consistency with 1.9.
    iconv, iconv_fallback = clean_utf8_iconv
    if iconv
      begin
        output = iconv.iconv(text)
      rescue Iconv::IllegalSequence
        output = iconv_fallback.iconv(text)
      end
    end
  end
  text
end

.clean_utf8_iconvObject



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# File 'lib/buildbox/utf8.rb', line 50

def self.clean_utf8_iconv
  unless @iconv_loaded
    begin
      require 'iconv'
    rescue LoadError
      @iconv = nil
    else
      @iconv = Iconv.new('utf-8//translit//ignore', 'utf-8')
      # On some systems (Linux appears to be vulnerable, FreeBSD not)
      # iconv chokes on invalid utf-8 with //translit//ignore.
      @iconv_fallback = Iconv.new('utf-8//ignore', 'utf-8')
    end
    @iconv_loaded = true
  end
  [@iconv, @iconv_fallback]
end

.intermediate_encodingObject

Apparently utf-16 is not available everywhere, in particular not on travis. Try to find a usable encoding.

Raises:

  • (CannotFindEncoding)


37
38
39
40
41
42
43
44
45
46
47
48
# File 'lib/buildbox/utf8.rb', line 37

def self.intermediate_encoding
  map = {}
  Encoding.list.each do |encoding|
    map[encoding.name.downcase] = true
  end
  %w(utf-16 utf-16be utf-16le utf-7 utf-32 utf-32le utf-32be).each do |candidate|
    if map[candidate]
      return candidate
    end
  end
  raise CannotFindEncoding, 'Cannot find an intermediate encoding for conversion to UTF-8'
end