Module: Mindee::PDF::PDFCompressor

Defined in:
lib/mindee/pdf/pdf_compressor.rb

Overview

Image compressor module to handle PDF compression.

Class Method Summary collapse

Class Method Details

.compress_pdf(pdf_data, quality: 85, force_source_text_compression: false, disable_source_text: true) ⇒ Object

Compresses each page of a provided PDF stream. Skips if force_source_text isn't set and source text is detected.

Parameters:

  • quality (Integer) (defaults to: 85)

    Compression quality (70-100 for most JPG images in the test dataset).

  • force_source_text_compression (Boolean) (defaults to: false)

    If true, attempts to re-write detected text.

  • disable_source_text (Boolean) (defaults to: true)

    If true, doesn't re-apply source text to the original PDF.



14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# File 'lib/mindee/pdf/pdf_compressor.rb', line 14

def self.compress_pdf(pdf_data, quality: 85, force_source_text_compression: false, disable_source_text: true)
  if PDFTools.source_text?(pdf_data)
    if force_source_text_compression
      if disable_source_text
        puts "\e[33m[WARNING] Re-writing PDF source-text is an EXPERIMENTAL feature.\e[0m"
      else
        puts "\e[33m[WARNING] Source-file contains text, but disable_source_text flag is ignored. " \
             "Resulting file will not contain any embedded text.\e[0m"
      end
    else
      puts "\e[33m[WARNING] Source-text detected in input PDF. Aborting operation.\e[0m"
      return pdf_data
    end
  end

  pdf_data.rewind
  pdf = Origami::PDF.read(pdf_data)
  pages = process_pdf_pages(pdf, quality)

  output_pdf = create_output_pdf(pages, disable_source_text, pdf_data)

  output_stream = StringIO.new
  output_pdf.save(output_stream)
  output_stream
end

.create_output_pdf(pages, disable_source_text, pdf_data) ⇒ Origami::PDF

Creates the output PDF with processed pages.

Parameters:

  • pages (Array)

    Processed pages.

  • disable_source_text (Boolean)

    Whether to disable source text.

  • pdf_data (StringIO)

    Original PDF data.

Returns:

  • (Origami::PDF)

    Output PDF object.



55
56
57
58
59
60
61
62
63
64
65
# File 'lib/mindee/pdf/pdf_compressor.rb', line 55

def self.create_output_pdf(pages, disable_source_text, pdf_data)
  output_pdf = Origami::PDF.new
  # NOTE: Page order and XObject handling require adjustment due to origami adding the last page first.
  pages.rotate!(1) if pages.count >= 2

  inject_text(pdf_data, pages) unless disable_source_text

  pages.each { |page| output_pdf.append_page(page) }

  output_pdf
end

.inject_text(pdf_data, pages) ⇒ Object

Extracts text from a source text PDF, and injects it into a newly-created one.

Parameters:

  • pdf_data (StringIO)

    Stream representation of the PDF.

  • pages (Array<Origami::Page>)

    Array of pages containing the rasterized version of the initial pages.



70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# File 'lib/mindee/pdf/pdf_compressor.rb', line 70

def self.inject_text(pdf_data, pages)
  reader = PDFReader::Reader.new(pdf_data)

  reader.pages.each_with_index do |original_page, index|
    break if index >= pages.length

    receiver = PDFReader::Reader::PageTextReceiver.new
    original_page.walk(receiver)

    receiver.runs.each do |text_run|
      x = text_run.origin.x
      y = text_run.origin.y
      text = text_run.text
      font_size = text_run.font_size

      content_stream = Origami::Stream.new
      content_stream.dictionary[:Filter] = :FlateDecode
      content_stream.data = "BT\n/F1 #{font_size} Tf\n#{x} #{y} Td\n(#{text}) Tj\nET\n"

      pages[index].Contents.data += content_stream.data
    end
  end
end

.process_pdf_page(page_stream, page_index, image_quality, media_box) ⇒ Origami::Page

Takes in a page stream, rasterizes it into a JPEG image, and applies the result onto a new Origami PDF page.

Parameters:

  • page_stream (StringIO)

    Stream representation of a single page from the initial PDF.

  • page_index (Integer)

    Index of the current page. Technically not needed, but left for debugging purposes.

  • image_quality (Integer)

    Quality to apply to the rasterized page.

  • media_box (Array<Integer>, nil)

    Extracted media box from the page. Can be nil.

Returns:

  • (Origami::Page)


100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# File 'lib/mindee/pdf/pdf_compressor.rb', line 100

def self.process_pdf_page(page_stream, page_index, image_quality, media_box)
  new_page = Origami::Page.new
  compressed_image = Mindee::Image::ImageUtils.pdf_to_magick_image(page_stream, image_quality)
  width, height = Mindee::Image::ImageUtils.calculate_dimensions_from_media_box(compressed_image, media_box)

  compressed_xobject = PDF::PDFTools.create_xobject(compressed_image)
  PDF::PDFTools.set_xobject_properties(compressed_xobject, compressed_image)

  xobject_name = "X#{page_index + 1}"
  PDF::PDFTools.add_content_to_page(new_page, xobject_name, width, height)
  new_page.add_xobject(compressed_xobject, xobject_name)

  PDF::PDFTools.set_page_dimensions(new_page, width, height)
  new_page
end

.process_pdf_pages(pdf, quality) ⇒ Array<Origami::Page>

Processes all pages in the PDF.

Parameters:

  • pdf (Origami::PDF)

    The Origami PDF object to process.

  • quality (Integer)

    Compression quality.

Returns:

  • (Array<Origami::Page>)

    Processed pages.



44
45
46
47
48
# File 'lib/mindee/pdf/pdf_compressor.rb', line 44

def self.process_pdf_pages(pdf, quality)
  pdf.pages.map.with_index do |page, index|
    process_pdf_page(Mindee::PDF::PdfProcessor.get_page(pdf, index), index, quality, page[:MediaBox])
  end
end