Class: CMess::GuessEncoding::Automatic
- Inherits:
-
Object
- Object
- CMess::GuessEncoding::Automatic
- Extended by:
- Forwardable
- Includes:
- Encoding
- Defined in:
- lib/cmess/guess_encoding/automatic.rb
Overview
Tries to detect the encoding of a given input by applying several heuristics to determine the most likely candidate. If no heuristic catches on, resorts to Encoding::UNKNOWN.
If a BOM is found, it may determine the encoding directly.
For supported encodings see EncodingGuessers and BOMGuessers.
Defined Under Namespace
Modules: BOMGuessers, EncodingGuessers
Constant Summary collapse
- TEST_ENCODINGS =
Single-byte encodings to test statistically by TEST_CHARS.
[ MACINTOSH, ISO_8859_1, ISO_8859_2, ISO_8859_3, ISO_8859_4, ISO_8859_5, ISO_8859_6, ISO_8859_7, ISO_8859_8, ISO_8859_9, ISO_8859_10, ISO_8859_11, ISO_8859_13, ISO_8859_14, ISO_8859_15, ISO_8859_16, CP1252, CP850, MS_ANSI ]
- CHARS_TO_TEST =
Certain (non-ASCII) chars to test for in TEST_ENCODINGS.
( '€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂ' << 'ÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ' ).chars.to_a
- TEST_CHARS =
Map TEST_ENCODINGS to respectively encoded CHARS_TO_TEST.
Hash.new { |h, k| e, f = self[k], UTF_8 TEST_ENCODINGS << e unless TEST_ENCODINGS.include?(e) h[e] = CHARS_TO_TEST.flat_map { |c| c.encode(e, f).unpack('C') } }.update(SafeYAML.load_file(File.join(CMess::DATA_DIR, 'test_chars.yaml')))
- TEST_THRESHOLD_DIRECT =
Relative count of TEST_CHARS must exceed this threshold to yield a direct match.
0.1- TEST_THRESHOLD_APPROX =
Relative count of TEST_CHARS must exceed this threshold to yield an approximate match.
0.0004- GUESS_METHOD_RE =
Pattern for method names in EncodingGuessers and BOMGuessers.
%r{\A((?:bom_)?encoding)_\d+_(.+)\z}
Class Attribute Summary collapse
-
.bom_guessers ⇒ Object
readonly
Returns the value of attribute bom_guessers.
-
.encoding_guessers ⇒ Object
readonly
Returns the value of attribute encoding_guessers.
-
.supported_boms ⇒ Object
readonly
Returns the value of attribute supported_boms.
-
.supported_encodings ⇒ Object
readonly
Returns the value of attribute supported_encodings.
Instance Attribute Summary collapse
-
#byte_count ⇒ Object
readonly
Returns the value of attribute byte_count.
-
#byte_total ⇒ Object
readonly
Returns the value of attribute byte_total.
-
#chunk_size ⇒ Object
readonly
Returns the value of attribute chunk_size.
-
#first_byte ⇒ Object
readonly
Returns the value of attribute first_byte.
-
#input ⇒ Object
readonly
Returns the value of attribute input.
Class Method Summary collapse
Instance Method Summary collapse
- #bom ⇒ Object
- #guess(ignore_bom = false) ⇒ Object
-
#initialize(input, chunk_size = nil) ⇒ Automatic
constructor
A new instance of Automatic.
Methods included from Encoding
Constructor Details
#initialize(input, chunk_size = nil) ⇒ Automatic
Returns a new instance of Automatic.
147 148 149 150 151 152 153 154 155 156 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 147 def initialize(input, chunk_size = nil) @input = case input when IO then input when String then StringIO.new(input) else raise ArgumentError, "don't know how to handle input of type #{input.class}" end @chunk_size = chunk_size end |
Class Attribute Details
.bom_guessers ⇒ Object (readonly)
Returns the value of attribute bom_guessers.
112 113 114 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 112 def bom_guessers @bom_guessers end |
.encoding_guessers ⇒ Object (readonly)
Returns the value of attribute encoding_guessers.
112 113 114 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 112 def encoding_guessers @encoding_guessers end |
.supported_boms ⇒ Object (readonly)
Returns the value of attribute supported_boms.
112 113 114 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 112 def supported_boms @supported_boms end |
.supported_encodings ⇒ Object (readonly)
Returns the value of attribute supported_encodings.
112 113 114 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 112 def supported_encodings @supported_encodings end |
Instance Attribute Details
#byte_count ⇒ Object (readonly)
Returns the value of attribute byte_count.
158 159 160 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 158 def byte_count @byte_count end |
#byte_total ⇒ Object (readonly)
Returns the value of attribute byte_total.
158 159 160 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 158 def byte_total @byte_total end |
#chunk_size ⇒ Object (readonly)
Returns the value of attribute chunk_size.
158 159 160 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 158 def chunk_size @chunk_size end |
#first_byte ⇒ Object (readonly)
Returns the value of attribute first_byte.
158 159 160 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 158 def first_byte @first_byte end |
#input ⇒ Object (readonly)
Returns the value of attribute input.
158 159 160 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 158 def input @input end |
Class Method Details
.guess(input, chunk_size = nil, ignore_bom = false) ⇒ Object
115 116 117 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 115 def guess(input, chunk_size = nil, ignore_bom = false) new(input, chunk_size).guess(ignore_bom) end |
Instance Method Details
#bom ⇒ Object
174 175 176 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 174 def bom @bom ||= check_bom end |
#guess(ignore_bom = false) ⇒ Object
160 161 162 163 164 165 166 167 168 169 170 171 172 |
# File 'lib/cmess/guess_encoding/automatic.rb', line 160 def guess(ignore_bom = false) return bom if bom && !ignore_bom while read encoding_guessers.each { |block| if encoding = instance_eval(&block) and supported_encoding?(encoding) return encoding end } end UNKNOWN end |