Class: CMess::GuessEncoding::Automatic

Inherits:
Object
  • Object
show all
Extended by:
Forwardable
Includes:
Encoding
Defined in:
lib/cmess/guess_encoding/automatic.rb

Overview

Tries to detect the encoding of a given input by applying several heuristics to determine the most likely candidate. If no heuristic catches on, resorts to Encoding::UNKNOWN.

If a BOM is found, it may determine the encoding directly.

For supported encodings see EncodingGuessers and BOMGuessers.

Defined Under Namespace

Modules: BOMGuessers, EncodingGuessers

Constant Summary collapse

TEST_ENCODINGS =

Single-byte encodings to test statistically by TEST_CHARS.

[
  MACINTOSH,
  ISO_8859_1,
  ISO_8859_2,
  ISO_8859_3,
  ISO_8859_4,
  ISO_8859_5,
  ISO_8859_6,
  ISO_8859_7,
  ISO_8859_8,
  ISO_8859_9,
  ISO_8859_10,
  ISO_8859_11,
  ISO_8859_13,
  ISO_8859_14,
  ISO_8859_15,
  ISO_8859_16,
  CP1252,
  CP850,
  MS_ANSI
]
CHARS_TO_TEST =

Certain (non-ASCII) chars to test for in TEST_ENCODINGS.

(
  '€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂ' <<
  'ÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'
).chars.to_a
TEST_CHARS =

Map TEST_ENCODINGS to respectively encoded CHARS_TO_TEST.

Hash.new { |h, k|
  e, f = self[k], UTF_8
  TEST_ENCODINGS << e unless TEST_ENCODINGS.include?(e)
  h[e] = CHARS_TO_TEST.flat_map { |c| c.encode(e, f).unpack('C') }
}.update(SafeYAML.load_file(File.join(CMess::DATA_DIR, 'test_chars.yaml')))
TEST_THRESHOLD_DIRECT =

Relative count of TEST_CHARS must exceed this threshold to yield a direct match.

0.1
TEST_THRESHOLD_APPROX =

Relative count of TEST_CHARS must exceed this threshold to yield an approximate match.

0.0004
GUESS_METHOD_RE =

Pattern for method names in EncodingGuessers and BOMGuessers.

%r{\A((?:bom_)?encoding)_\d+_(.+)\z}

Class Attribute Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Encoding

#[], #all_encodings

Constructor Details

#initialize(input, chunk_size = nil) ⇒ Automatic

Returns a new instance of Automatic.



147
148
149
150
151
152
153
154
155
156
# File 'lib/cmess/guess_encoding/automatic.rb', line 147

def initialize(input, chunk_size = nil)
  @input = case input
    when IO     then input
    when String then StringIO.new(input)
    else raise ArgumentError,
      "don't know how to handle input of type #{input.class}"
  end

  @chunk_size = chunk_size
end

Class Attribute Details

.bom_guessersObject (readonly)

Returns the value of attribute bom_guessers.



112
113
114
# File 'lib/cmess/guess_encoding/automatic.rb', line 112

def bom_guessers
  @bom_guessers
end

.encoding_guessersObject (readonly)

Returns the value of attribute encoding_guessers.



112
113
114
# File 'lib/cmess/guess_encoding/automatic.rb', line 112

def encoding_guessers
  @encoding_guessers
end

.supported_bomsObject (readonly)

Returns the value of attribute supported_boms.



112
113
114
# File 'lib/cmess/guess_encoding/automatic.rb', line 112

def supported_boms
  @supported_boms
end

.supported_encodingsObject (readonly)

Returns the value of attribute supported_encodings.



112
113
114
# File 'lib/cmess/guess_encoding/automatic.rb', line 112

def supported_encodings
  @supported_encodings
end

Instance Attribute Details

#byte_countObject (readonly)

Returns the value of attribute byte_count.



158
159
160
# File 'lib/cmess/guess_encoding/automatic.rb', line 158

def byte_count
  @byte_count
end

#byte_totalObject (readonly)

Returns the value of attribute byte_total.



158
159
160
# File 'lib/cmess/guess_encoding/automatic.rb', line 158

def byte_total
  @byte_total
end

#chunk_sizeObject (readonly)

Returns the value of attribute chunk_size.



158
159
160
# File 'lib/cmess/guess_encoding/automatic.rb', line 158

def chunk_size
  @chunk_size
end

#first_byteObject (readonly)

Returns the value of attribute first_byte.



158
159
160
# File 'lib/cmess/guess_encoding/automatic.rb', line 158

def first_byte
  @first_byte
end

#inputObject (readonly)

Returns the value of attribute input.



158
159
160
# File 'lib/cmess/guess_encoding/automatic.rb', line 158

def input
  @input
end

Class Method Details

.guess(input, chunk_size = nil, ignore_bom = false) ⇒ Object



115
116
117
# File 'lib/cmess/guess_encoding/automatic.rb', line 115

def guess(input, chunk_size = nil, ignore_bom = false)
  new(input, chunk_size).guess(ignore_bom)
end

Instance Method Details

#bomObject



174
175
176
# File 'lib/cmess/guess_encoding/automatic.rb', line 174

def bom
  @bom ||= check_bom
end

#guess(ignore_bom = false) ⇒ Object



160
161
162
163
164
165
166
167
168
169
170
171
172
# File 'lib/cmess/guess_encoding/automatic.rb', line 160

def guess(ignore_bom = false)
  return bom if bom && !ignore_bom

  while read
    encoding_guessers.each { |block|
      if encoding = instance_eval(&block) and supported_encoding?(encoding)
        return encoding
      end
    }
  end

  UNKNOWN
end