Class: PDF::Reader::ObjectHash

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/pdf/reader/object_hash.rb

Overview

Provides low level access to the objects in a PDF file via a hash-like object.

A PDF file can be viewed as a large hash map. It is a series of objects stored at precise byte offsets, and a table that maps object IDs to byte offsets. Given an object ID, looking up an object is an O(1) operation.

Each PDF object can be mapped to a ruby object, so by passing an object ID to the [] method, a ruby representation of that object will be retrieved.

The class behaves much like a standard Ruby hash, including the use of the Enumerable mixin. The key difference is no []= method - the hash is read only.

Basic Usage

h = PDF::Reader::ObjectHash.new("somefile.pdf")
h[1]
=> 3469

h[PDF::Reader::Reference.new(1,0)]
=> 3469

Direct Known Subclasses

Hash

Constant Summary collapse

CACHEABLE_TYPES =
[:Catalog, :Page, :Pages]

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input) ⇒ ObjectHash

Creates a new ObjectHash object. input can be a string with a valid filename, a string containing a PDF file, or an IO object.



39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# File 'lib/pdf/reader/object_hash.rb', line 39

def initialize(input)
  if input.respond_to?(:seek) && input.respond_to?(:read)
    @io = input
  elsif File.file?(input.to_s)
    if File.respond_to?(:binread)
      input = File.binread(input.to_s)
    else
      input = File.read(input.to_s)
    end
    @io = StringIO.new(input)
  else
    raise ArgumentError, "input must be an IO-like object or a filename"
  end
  @pdf_version = read_version
  @xref        = PDF::Reader::XRef.new(@io)
  @trailer     = @xref.trailer
  @cache       = PDF::Reader::ObjectCache.new

  if trailer[:Encrypt]
    raise ::PDF::Reader::UnsupportedFeatureError, 'PDF::Reader cannot read encrypted PDF files'
  end
end

Instance Attribute Details

#defaultObject

Returns the value of attribute default.



33
34
35
# File 'lib/pdf/reader/object_hash.rb', line 33

def default
  @default
end

#pdf_versionObject (readonly)

Returns the value of attribute pdf_version.



34
35
36
# File 'lib/pdf/reader/object_hash.rb', line 34

def pdf_version
  @pdf_version
end

#trailerObject (readonly)

Returns the value of attribute trailer.



34
35
36
# File 'lib/pdf/reader/object_hash.rb', line 34

def trailer
  @trailer
end

Instance Method Details

#[](key) ⇒ Object

Access an object from the PDF. key can be an int or a PDF::Reader::Reference object.

If an int is used, the object with that ID and a generation number of 0 will be returned.

If a PDF::Reader::Reference object is used the exact ID and generation number can be specified.



85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/pdf/reader/object_hash.rb', line 85

def [](key)
  return default if key.to_i <= 0
  begin
    unless key.kind_of?(PDF::Reader::Reference)
      key = PDF::Reader::Reference.new(key.to_i, 0)
    end
    if @cache.has_key?(key)
      @cache[key]
    elsif xref[key].is_a?(Fixnum)
      buf = new_buffer(xref[key])
      @cache[key] = Parser.new(buf, self).object(key.id, key.gen)
    elsif xref[key].is_a?(PDF::Reader::Reference)
      container_key = xref[key]
      object_streams[container_key] ||= PDF::Reader::ObjectStream.new(object(container_key))
      @cache[key] = object_streams[container_key][key.id]
    end
  rescue InvalidObjectError
    return default
  end
end

#cacheable?(obj) ⇒ Boolean

Returns:

  • (Boolean)


106
107
108
# File 'lib/pdf/reader/object_hash.rb', line 106

def cacheable?(obj)
  obj.is_a?(Hash) && CACHEABLE_TYPES.include?(obj[:Type])
end

#each(&block) ⇒ Object Also known as: each_pair

iterate over each key, value. Just like a ruby hash.



143
144
145
146
147
# File 'lib/pdf/reader/object_hash.rb', line 143

def each(&block)
  @xref.each do |ref|
    yield ref, self[ref]
  end
end

#each_key(&block) ⇒ Object

iterate over each key. Just like a ruby hash.



152
153
154
155
156
# File 'lib/pdf/reader/object_hash.rb', line 152

def each_key(&block)
  each do |id, obj|
    yield id
  end
end

#each_value(&block) ⇒ Object

iterate over each value. Just like a ruby hash.



160
161
162
163
164
# File 'lib/pdf/reader/object_hash.rb', line 160

def each_value(&block)
  each do |id, obj|
    yield obj
  end
end

#empty?Boolean

return true if there are no objects in this file

Returns:

  • (Boolean)


175
176
177
# File 'lib/pdf/reader/object_hash.rb', line 175

def empty?
  size == 0 ? true : false
end

#fetch(key, local_default = nil) ⇒ Object

Access an object from the PDF. key can be an int or a PDF::Reader::Reference object.

If an int is used, the object with that ID and a generation number of 0 will be returned.

If a PDF::Reader::Reference object is used the exact ID and generation number can be specified.

local_default is the object that will be returned if the requested key doesn’t exist.



130
131
132
133
134
135
136
137
138
139
# File 'lib/pdf/reader/object_hash.rb', line 130

def fetch(key, local_default = nil)
  obj = self[key]
  if obj
    return obj
  elsif local_default
    return local_default
  else
    raise IndexError, "#{key} is invalid" if key.to_i <= 0
  end
end

#has_key?(check_key) ⇒ Boolean Also known as: include?, key?, member?, value?

return true if the specified key exists in the file. key can be an int or a PDF::Reader::Reference

Returns:

  • (Boolean)


182
183
184
185
186
187
188
189
190
191
192
# File 'lib/pdf/reader/object_hash.rb', line 182

def has_key?(check_key)
  # TODO update from O(n) to O(1)
  each_key do |key|
    if check_key.kind_of?(PDF::Reader::Reference)
      return true if check_key == key
    else
      return true if check_key.to_i == key.id
    end
  end
  return false
end

#has_value?(value) ⇒ Boolean

return true if the specifiedvalue exists in the file

Returns:

  • (Boolean)


199
200
201
202
203
204
205
# File 'lib/pdf/reader/object_hash.rb', line 199

def has_value?(value)
  # TODO update from O(n) to O(1)
  each_value do |obj|
    return true if obj == value
  end
  return false
end

#keysObject

return an array of all keys in the file



214
215
216
217
218
# File 'lib/pdf/reader/object_hash.rb', line 214

def keys
  ret = []
  each_key { |k| ret << k }
  ret
end

#obj_type(ref) ⇒ Object

returns the type of object a ref points to



63
64
65
66
67
# File 'lib/pdf/reader/object_hash.rb', line 63

def obj_type(ref)
  self[ref].class.to_s.to_sym
rescue
  nil
end

#object(key) ⇒ Object Also known as: deref

If key is a PDF::Reader::Reference object, lookup the corresponding object in the PDF and return it. Otherwise return key untouched.



113
114
115
# File 'lib/pdf/reader/object_hash.rb', line 113

def object(key)
  key.is_a?(PDF::Reader::Reference) ? self[key] : key
end

#page_referencesObject

returns an array of PDF::Reader::References. Each reference in the array points a Page object, one for each page in the PDF. The first reference is page 1, second reference is page 2, etc.

Useful for apps that want to extract data from specific pages.



250
251
252
253
# File 'lib/pdf/reader/object_hash.rb', line 250

def page_references
  root  = fetch(trailer[:Root])
  @page_references ||= get_page_objects(root[:Pages]).flatten
end

#sizeObject Also known as: length

return the number of objects in the file. An object with multiple generations is counted once.



168
169
170
# File 'lib/pdf/reader/object_hash.rb', line 168

def size
  xref.size
end

#stream?(ref) ⇒ Boolean

returns true if the supplied references points to an object with a stream

Returns:

  • (Boolean)


70
71
72
73
74
# File 'lib/pdf/reader/object_hash.rb', line 70

def stream?(ref)
  self[ref].class == PDF::Reader::Stream
rescue
  false
end

#to_aObject

return an array of arrays. Each sub array contains a key/value pair.



236
237
238
239
240
241
242
# File 'lib/pdf/reader/object_hash.rb', line 236

def to_a
  ret = []
  each do |id, obj|
    ret << [id, obj]
  end
  ret
end

#to_sObject



208
209
210
# File 'lib/pdf/reader/object_hash.rb', line 208

def to_s
  "<PDF::Reader::ObjectHash size: #{self.size}>"
end

#valuesObject

return an array of all values in the file



222
223
224
225
226
# File 'lib/pdf/reader/object_hash.rb', line 222

def values
  ret = []
  each_value { |v| ret << v }
  ret
end

#values_at(*ids) ⇒ Object

return an array of all values from the specified keys



230
231
232
# File 'lib/pdf/reader/object_hash.rb', line 230

def values_at(*ids)
  ids.map { |id| self[id] }
end