Class: PDF::Reader::Encoding
- Inherits:
-
Object
- Object
- PDF::Reader::Encoding
- Defined in:
- lib/pdf/reader/encoding.rb
Overview
Util class for working with string encodings in PDF files. Mostly used to convert strings of various PDF-dialect encodings into UTF-8.
Constant Summary collapse
- CONTROL_CHARS =
:nodoc:
[0,1,2,3,4,5,6,7,8,11,12,14,15,16,17,18,19,20,21,22,23, 24,25,26,27,28,29,30,31]
- UNKNOWN_CHAR =
▯
0x25AF
Instance Attribute Summary collapse
-
#unpack ⇒ Object
readonly
Returns the value of attribute unpack.
Instance Method Summary collapse
- #differences ⇒ Object
-
#differences=(diff) ⇒ Object
set the differences table for this encoding.
-
#initialize(enc) ⇒ Encoding
constructor
A new instance of Encoding.
-
#int_to_name(glyph_code) ⇒ Object
convert an integer glyph code into an Adobe glyph name.
- #int_to_utf8_string(glyph_code) ⇒ Object
-
#to_utf8(str) ⇒ Object
convert the specified string to utf8.
Constructor Details
#initialize(enc) ⇒ Encoding
Returns a new instance of Encoding.
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
# File 'lib/pdf/reader/encoding.rb', line 39 def initialize(enc) @mapping = {} # maps from character codes to Unicode codepoints # also maps control and invalid chars to UNKNOWN_CHAR @string_cache = {} # maps from character codes to UTF-8 strings. if enc.kind_of?(Hash) self.differences = enc[:Differences] if enc[:Differences] enc = enc[:Encoding] || enc[:BaseEncoding] elsif enc != nil enc = enc.to_sym else enc = nil end @enc_name = enc @unpack = get_unpack(enc) @map_file = get_mapping_file(enc) load_mapping(@map_file) if @map_file add_control_chars_to_mapping end |
Instance Attribute Details
#unpack ⇒ Object (readonly)
Returns the value of attribute unpack.
37 38 39 |
# File 'lib/pdf/reader/encoding.rb', line 37 def unpack @unpack end |
Instance Method Details
#differences ⇒ Object
87 88 89 90 |
# File 'lib/pdf/reader/encoding.rb', line 87 def differences # this method is only used by the spec tests @differences ||= {} end |
#differences=(diff) ⇒ Object
set the differences table for this encoding. should be an array in the following format:
[25, :A, 26, :B]
The array alternates between a decimal byte number and a glyph name to map to that byte
To save space the following array is also valid and equivalent to the previous one
[25, :A, :B]
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
# File 'lib/pdf/reader/encoding.rb', line 70 def differences=(diff) raise ArgumentError, "diff must be an array" unless diff.kind_of?(Array) @differences = {} byte = 0 diff.each do |val| if val.kind_of?(Numeric) byte = val.to_i else @differences[byte] = val @mapping[byte] = names_to_unicode[val] byte += 1 end end @differences end |
#int_to_name(glyph_code) ⇒ Object
convert an integer glyph code into an Adobe glyph name.
int_to_name(65)
=> :A
Standard character encodings are defined at the bottom of this file
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
# File 'lib/pdf/reader/encoding.rb', line 122 def int_to_name(glyph_code) if @enc_name == :"Identity-H" || @enc_name == :"Identity-V" nil elsif @enc_name == :MacRomanEncoding MAC_ROMAN_ENCODING_TO_NAME[glyph_code] elsif @enc_name == :WinAnsiEncoding WIN_ANSI_ENCODING_TO_NAME[glyph_code] elsif @differences @differences[glyph_code] elsif @enc_name == :StandardEncoding STANDARD_ENCODING_TO_NAME[glyph_code] else raise "#{@enc_name} does not have an int_to_name mapping" end end |
#int_to_utf8_string(glyph_code) ⇒ Object
111 112 113 |
# File 'lib/pdf/reader/encoding.rb', line 111 def int_to_utf8_string(glyph_code) @string_cache[glyph_code] ||= internal_int_to_utf8_string(glyph_code) end |
#to_utf8(str) ⇒ Object
convert the specified string to utf8
-
unpack raw bytes into codepoints
-
replace any that have entries in the differences table with a glyph name
-
convert codepoints from source encoding to Unicode codepoints
-
convert any glyph names to Unicode codepoints
-
replace characters that didn’t convert to Unicode nicely with something valid
-
pack the final array of Unicode codepoints into a utf-8 string
-
mark the string as utf-8 if we’re running on a M17N aware VM
103 104 105 106 107 108 109 |
# File 'lib/pdf/reader/encoding.rb', line 103 def to_utf8(str) if utf8_conversion_impossible? little_boxes(str.unpack(unpack).size) else convert_to_utf8(str) end end |