Class: DerivativeRodeo::Services::PdfSplitter::PagesSummary

Inherits:
Struct
  • Object
show all
Defined in:
lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb

Overview

A simple data structure that summarizes the image properties of the given path.

Constant Summary collapse

COL_WIDTH =

class constant column numbers

3
COL_HEIGHT =
4
COL_COLOR_DESC =
5
COL_CHANNELS =
6
COL_BITS =
7
COL_XPPI =

only poppler 0.25+ has this column in output:

12

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Instance Attribute Details

#bits_per_channelObject Also known as: bits

Returns the value of attribute bits_per_channel

Returns:

  • (Object)

    the current value of bits_per_channel



9
10
11
# File 'lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb', line 9

def bits_per_channel
  @bits_per_channel
end

#channelsObject

Returns the value of attribute channels

Returns:

  • (Object)

    the current value of channels



9
10
11
# File 'lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb', line 9

def channels
  @channels
end

#color_descriptionObject

Returns the value of attribute color_description

Returns:

  • (Object)

    the current value of color_description



9
10
11
# File 'lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb', line 9

def color_description
  @color_description
end

#heightObject

Returns the value of attribute height

Returns:

  • (Object)

    the current value of height



9
10
11
# File 'lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb', line 9

def height
  @height
end

#page_countObject

Returns the value of attribute page_count

Returns:

  • (Object)

    the current value of page_count



9
10
11
# File 'lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb', line 9

def page_count
  @page_count
end

#pathObject

Returns the value of attribute path

Returns:

  • (Object)

    the current value of path



9
10
11
# File 'lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb', line 9

def path
  @path
end

#pixels_per_inchObject Also known as: ppi

Returns the value of attribute pixels_per_inch

Returns:

  • (Object)

    the current value of pixels_per_inch



9
10
11
# File 'lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb', line 9

def pixels_per_inch
  @pixels_per_inch
end

#widthObject

Returns the value of attribute width

Returns:

  • (Object)

    the current value of width



9
10
11
# File 'lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb', line 9

def width
  @width
end

Class Method Details

.extract_from(path:) ⇒ DerivativeRodeo::PdfSplitter::PagesSummary

Note:

Uses poppler 0.19+ pdfimages command to extract image listing metadata from PDF files. Though we are optimizing for 0.25 or later for poppler.

Note:

For dpi extraction, falls back to calculating using MiniMagick, if neccessary.

Responsible for determining the image properties of the PDF.

The first two lines are tabular header information:

rubocop:disable Metrics/AbcSize - Because this helps us process the results in one loop. rubocop:disable Metrics/MethodLength - Again, to help speed up the processing loop. rubocop:disable Metrics/CyclomaticComplexity rubocop:disable Metrics/PerceivedComplexity

Examples:

Output from PDF Images


bash-5.1$ pdfimages -list fmc_color.pdf  | head -5
page   num  step   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1     0 image    2475   413  rgb     3   8  jpeg   no        10  0   300   300 21.8K 0.7%

Parameters:

  • path (String)

Returns:

  • (DerivativeRodeo::PdfSplitter::PagesSummary)


72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# File 'lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb', line 72

def PagesSummary.extract_from(path:)
  # NOTE: https://github.com/scientist-softserv/iiif_print/pull/223/files for piping warnings
  # to /dev/null
  command = format('pdfimages -list %<path>s 2>/dev/null', path: path)

  page_count = 0
  color_description = 'gray'
  width = 0
  height = 0
  channels = 0
  bits_per_channel = 0
  pixels_per_inch = 0
  Open3.popen3(command) do |_stdin, stdout, _stderr, _wait_thr|
    stdout.read.split("\n").each_with_index do |line, index|
      # Skip the two header lines (see the above example)
      next if index <= 1

      page_count += 1
      cells = line.gsub(/\s+/m, ' ').strip.split(' ')

      color_description = 'rgb' if cells[COL_COLOR_DESC] != 'gray'
      width = cells[COL_WIDTH].to_i if cells[COL_WIDTH].to_i > width
      height = cells[COL_HEIGHT].to_i if cells[COL_HEIGHT].to_i > height
      channels = cells[COL_CHANNELS].to_i if cells[COL_CHANNELS].to_i > channels
      bits_per_channel = cells[COL_BITS].to_i if cells[COL_BITS].to_i > bits_per_channel

      # In the case of poppler version < 0.25, we will have no more than 12 columns.  As such,
      # we need to do some alternative magic to calculate this.
      if page_count == 1 && cells.size <= 12
        pdf = MiniMagick::Image.open(path)
        width_points = pdf.width
        width_px = width
        pixels_per_inch = (72 * width_px / width_points).to_i
      elsif cells[COL_XPPI].to_i > pixels_per_inch
        pixels_per_inch = cells[COL_XPPI].to_i
      end
      # By the magic of nil#to_i if we don't have more than 12 columns, we've already set
      # the pixels_per_inch and this line won't due much of anything.
    end
  end

  new(
    path: path,
    page_count: page_count,
    pixels_per_inch: pixels_per_inch,
    width: width,
    height: height,
    color_description: color_description,
    channels: channels,
    bits_per_channel: bits_per_channel
  )
end

Instance Method Details

#colorArray<String, Integer, Integer>

Returns:

  • (Array<String, Integer, Integer>)


24
25
26
# File 'lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb', line 24

def color
  [color_description, channels, bits_per_channel]
end

#valid?Boolean

If the underlying extraction couldn’t set the various properties, we likely have an invalid_pdf.

Returns:

  • (Boolean)


32
33
34
35
36
37
38
39
40
# File 'lib/derivative_rodeo/services/pdf_splitter/pages_summary.rb', line 32

def valid?
  return false if pdf_pages_summary.color_description.nil?
  return false if pdf_pages_summary.channels.nil?
  return false if pdf_pages_summary.bits_per_channel.nil?
  return false if pdf_pages_summary.height.nil?
  return false if pdf_pages_summary.page_count.to_i.zero?

  true
end