Module: HexaPDF::Task::Optimize

Defined in:
lib/hexapdf/task/optimize.rb

Overview

Task for optimizing the PDF document.

For a list of optimization methods this task can perform have a look at the ::call method.

Defined Under Namespace

Classes: SerializationProcessor

Class Method Summary collapse

Class Method Details

.call(doc, compact: false, object_streams: :preserve, xref_streams: :preserve, compress_pages: false, prune_page_resources: false) ⇒ Object

Optimizes the PDF document.

The field entries that are optional and set to their default value are always deleted. Additional optimization methods are performed depending on the values of the following arguments:

compact

Compacts the object space by merging the revisions and then deleting null and unused values if set to true.

object_streams

Specifies if and how object streams should be used: For :preserve, existing object streams are preserved; for :generate objects are packed into object streams as much as possible; and for :delete existing object streams are deleted.

xref_streams

Specifies if cross-reference streams should be used. Can be :preserve (no modifications), :generate (use cross-reference streams) or :delete (remove cross-reference streams).

If object_streams is set to :generate, this option is implicitly changed to :generate.

compress_pages

Compresses the content streams of all pages if set to true. Note that this can take a very long time because each content stream has to be unfiltered, parsed, serialized and then filtered again.

prune_page_resources

Removes all unused XObjects from the resources dictionaries of all pages. It is recommended to also set the compact argument because otherwise the unused XObjects won’t be deleted from the document.

This is sometimes necessary after importing pages from other PDF files that use a single resources dictionary for all pages.



85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# File 'lib/hexapdf/task/optimize.rb', line 85

def self.call(doc, compact: false, object_streams: :preserve, xref_streams: :preserve,
              compress_pages: false, prune_page_resources: false)
  used_refs = compress_pages(doc) if compress_pages
  prune_page_resources(doc, used_refs) if prune_page_resources

  if compact
    compact(doc, object_streams, xref_streams)
  elsif object_streams != :preserve
    process_object_streams(doc, object_streams, xref_streams)
  elsif xref_streams != :preserve
    process_xref_streams(doc, xref_streams)
  else
    doc.each(&method(:delete_fields_with_defaults))
  end
end

.compact(doc, object_streams, xref_streams) ⇒ Object

Compacts the document by merging all revisions into one, deleting null and unused entries and renumbering the objects.

For the meaning of the other arguments see ::call.



105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# File 'lib/hexapdf/task/optimize.rb', line 105

def self.compact(doc, object_streams, xref_streams)
  doc.revisions.merge
  unused = Set.new(doc.task(:dereference))
  rev = doc.revisions.add

  oid = 1
  doc.revisions.all[0].each do |obj|
    if obj.null? || unused.include?(obj) || (obj.type == :ObjStm) ||
        (obj.type == :XRef && xref_streams != :preserve)
      obj.data.value = nil
      next
    end

    delete_fields_with_defaults(obj)
    obj.oid = oid
    obj.gen = 0
    rev.add(obj)
    oid += 1
  end
  doc.revisions.all.delete_at(0)

  if object_streams == :generate
    process_object_streams(doc, :generate, xref_streams)
  elsif xref_streams == :generate
    doc.add({}, type: Type::XRefStream)
  end
end

.compress_pages(doc) ⇒ Object

Compresses the contents of all pages by parsing and then serializing again. The HexaPDF serializer is already optimized for small output size so nothing else needs to be done.

Returns a hash of the form key=>true where the keys are the used XObjects (for use with #prune_page_resources).



234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
# File 'lib/hexapdf/task/optimize.rb', line 234

def self.compress_pages(doc)
  used_refs = {}
  doc.pages.each do |page|
    processor = SerializationProcessor.new do |error_message|
      doc.config['parser.on_correctable_error'].call(doc, error_message, 0) &&
        raise(HexaPDF::Error, error_message)
    end
    HexaPDF::Content::Parser.parse(page.contents, processor)
    page.contents = processor.result
    page[:Contents].set_filter(:FlateDecode)
    xobjects = page.resources[:XObject]
    processor.used_references.each {|ref| used_refs[xobjects[ref]] = true } if xobjects
  end
  used_refs
end

.delete_fields_with_defaults(obj) ⇒ Object

Deletes field entries (except for /Type) of the object that are optional and currently set to their default value.



219
220
221
222
223
224
225
226
227
# File 'lib/hexapdf/task/optimize.rb', line 219

def self.delete_fields_with_defaults(obj)
  return unless obj.kind_of?(HexaPDF::Dictionary) && !obj.null?
  obj.each do |name, value|
    if name != :Type && (field = obj.class.field(name)) && !field.required? &&
       field.default? && value == field.default
      obj.delete(name)
    end
  end
end

.process_object_streams(doc, method, xref_streams) ⇒ Object

Processes the object streams in each revision according to method: For :preserve, nothing is done, for :delete all object streams are deleted and for :generate objects are packed into object streams as much as possible.



136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
# File 'lib/hexapdf/task/optimize.rb', line 136

def self.process_object_streams(doc, method, xref_streams)
  case method
  when :delete
    doc.revisions.each do |rev|
      xref_stream = false
      objects_to_delete = []
      rev.each do |obj|
        case obj.type
        when :ObjStm
          objects_to_delete << obj
        when :XRef
          xref_stream = true
          objects_to_delete << obj if xref_streams == :delete
        else
          delete_fields_with_defaults(obj)
        end
      end
      objects_to_delete.each {|obj| rev.delete(obj) }
      if xref_streams == :generate && !xref_stream
        rev.add(doc.wrap({}, type: Type::XRefStream, oid: doc.revisions.next_oid))
      end
    end
  when :generate
    doc.revisions.each do |rev|
      xref_stream = false
      count = 0
      objstms = [doc.wrap({}, type: Type::ObjectStream)]
      old_objstms = []
      rev.each do |obj|
        case obj.type
        when :XRef
          xref_stream = true
        when :ObjStm
          old_objstms << obj
        end
        delete_fields_with_defaults(obj)

        next if obj.respond_to?(:stream)

        objstms[-1].add_object(obj)
        count += 1
        if count == 200
          objstms << doc.wrap({}, type: Type::ObjectStream)
          count = 0
        end
      end
      old_objstms.each {|objstm| rev.delete(objstm) }
      objstms.each do |objstm|
        objstm.data.oid = doc.revisions.next_oid
        rev.add(objstm)
      end
      rev.add(doc.wrap({}, type: Type::XRefStream, oid: doc.revisions.next_oid)) unless xref_stream
    end
  end
end

.process_xref_streams(doc, method) ⇒ Object

Processes the cross-reference streams in each revision according to method: For :preserve, nothing is done, for :delete all cross-reference streams are deleted and for :generate cross-reference streams are added.



195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# File 'lib/hexapdf/task/optimize.rb', line 195

def self.process_xref_streams(doc, method)
  case method
  when :delete
    doc.each do |obj, rev|
      if obj.type == :XRef
        rev.delete(obj)
      else
        delete_fields_with_defaults(obj)
      end
    end
  when :generate
    doc.revisions.each do |rev|
      xref_stream = false
      rev.each do |obj|
        xref_stream = true if obj.type == :XRef
        delete_fields_with_defaults(obj)
      end
      rev.add(doc.wrap({}, type: Type::XRefStream, oid: doc.revisions.next_oid)) unless xref_stream
    end
  end
end

.prune_page_resources(doc, used_refs) ⇒ Object

Deletes all XObject entries from the resources dictionaries of all pages whose names do not match the keys in used_refs.



252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
# File 'lib/hexapdf/task/optimize.rb', line 252

def self.prune_page_resources(doc, used_refs)
  unless used_refs
    used_refs = {}
    doc.pages.each do |page|
      next unless (xobjects = page.resources[:XObject])
      HexaPDF::Content::Parser.parse(page.contents) do |op, operands|
        used_refs[xobjects[operands[0]]] = true if op == :Do
      end
    end
  end

  doc.pages.each do |page|
    next unless (xobjects = page.resources[:XObject])
    xobjects.each do |key, obj|
      next if used_refs[obj]
      xobjects.delete(key)
    end
  end
end