Class: DerivativeRodeo::Generators::PdfSplitGenerator

Inherits:
BaseGenerator
  • Object
show all
Includes:
CopyFileConcern
Defined in:
lib/derivative_rodeo/generators/pdf_split_generator.rb

Overview

This class is responsible for splitting each given PDF (e.g. BaseGenerator#input_files) into one image per page (e.g. #with_each_requisite_location_and_tmp_file_path). We need to ensure that we have each of those image files in S3/file storage then enqueue those files for processing.

Instance Attribute Summary

Attributes inherited from BaseGenerator

#input_uris, #output_extension, #output_location_template, #preprocessed_location_template

Class Method Summary collapse

Instance Method Summary collapse

Methods included from CopyFileConcern

#build_step, #copy

Methods inherited from BaseGenerator

#build_step, #destination, #generated_files, #generated_uris, #initialize, #input_files, #run, #valid_instantiation?

Constructor Details

This class inherits a constructor from DerivativeRodeo::Generators::BaseGenerator

Class Method Details

.filename_for_a_derived_page_from_a_pdf?(filename:, extension: nil) ⇒ TrueClass, FalseClass

A helper method for downstream implementations to ask if this file is perhaps split from a PDF.

Parameters:

  • filename (String)
  • extension (String) (defaults to: nil)

    the extension (either with or without the leading period); if none is provided use the extension of the given :filename.

Returns:

  • (TrueClass)

    when the file name likely represents a file split from a PDF.

  • (FalseClass)

    when the file name does not, by convention, represent a file split from a PDF.

See Also:



31
32
33
34
35
36
37
38
# File 'lib/derivative_rodeo/generators/pdf_split_generator.rb', line 31

def self.filename_for_a_derived_page_from_a_pdf?(filename:, extension: nil)
  extension ||= File.extname(filename)

  # Strip the leading period from the extension.
  extension = extension[1..-1] if extension.start_with?('.')
  regexp = %r{--page-\d+\.#{extension}$}
  !!regexp.match(filename)
end

Instance Method Details

#derive_preprocessed_template_from(input_location:, preprocessed_location_template:) ⇒ Object

We’re working with an input location with a filename basename of “123.ARCHIVAL–page-1.tiff” The :preprocessed_location_template, due to constraints, likely ends with the original PDF’s filename (e.g. “123.ARCHIVAL.pdf”)

And since the template doesn’t have a concept of page number, we introduce this kludge.



140
141
142
# File 'lib/derivative_rodeo/generators/pdf_split_generator.rb', line 140

def derive_preprocessed_template_from(input_location:, preprocessed_location_template:)
  File.join(File.dirname(preprocessed_location_template), input_location.file_name)
end

#existing_page_locations(input_location:) ⇒ Enumerable<StorageLocations::BaseLocation>

Note:

There is relation to BaseGenerator#destination and this method.

Note:

The tail_regexp is in relation to the #image_file_basename_template

We want to check the output location and pre-processed location for the existence of already split pages. This method checks both places.

Parameters:

Returns:



70
71
72
73
74
75
76
77
78
79
80
# File 'lib/derivative_rodeo/generators/pdf_split_generator.rb', line 70

def existing_page_locations(input_location:)
  # See image_file_basename_template
  tail_regexp = %r{#{input_location.file_basename}--page-\d+\.#{output_extension}$}

  output_locations = input_location.derived_file_from(template: output_location_template).matching_locations_in_file_dir(tail_regexp: tail_regexp)
  return output_locations if output_locations.count.positive?

  return [] if preprocessed_location_template.blank?

  input_location.derived_file_from(template: preprocessed_location_template).matching_locations_in_file_dir(tail_regexp: tail_regexp)
end

#image_file_basename_template(basename:) ⇒ String

Note:

This must include “%d” in the returning value, as that is how Ghostscript will assign

Note:

I have extracted this function to make it abundantly clear the expected location

the page number.

each split image. Further there is an interaction in this

Parameters:

  • basename (String)

    The given PDF file’s base name (e.g. “hello.pdf” would have a base name of “hello”).

Returns:

  • (String)

    A template for the filenames of the images produced by Ghostscript.

See Also:



54
55
56
# File 'lib/derivative_rodeo/generators/pdf_split_generator.rb', line 54

def image_file_basename_template(basename:)
  "#{basename}--page-%d.#{output_extension}"
end

#with_each_requisite_location_and_tmp_file_path(splitter: Services::PdfSplitter) {|image_location, image_path| ... } ⇒ Object

Note:

This function makes a concession; namely that if it encounters any

Take the given PDF(s) and into one image per page. Remember that the URL should account for the page number.

When we have two PDFs (10 pages and 20 pages respectively), we will have 30 requisite files; the files must have URLs that associate with their respective parent PDFs.

#existing_page_locations it will use all of that result as the entire number of pages. We could make this smarter but at the moment we’re deferring on that.

rubocop:disable Metrics/MethodLength rubocop:disable Metrics/AbcSize

Parameters:

  • splitter (#call) (defaults to: Services::PdfSplitter)

Yield Parameters:

See Also:



104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# File 'lib/derivative_rodeo/generators/pdf_split_generator.rb', line 104

def with_each_requisite_location_and_tmp_file_path(splitter: Services::PdfSplitter)
  input_files.each do |input_location|
    input_location.with_existing_tmp_path do |input_tmp_file_path|
      existing_locations = existing_page_locations(input_location: input_location)

      if existing_locations.count.positive?
        logger.info("#{self.class}##{__method__} found #{existing_locations.count} file(s) at existing split location for #{input_location.file_uri.inspect}.")
        existing_locations.each_with_index do |location, index|
          logger.info("#{self.class}##{__method__} found ##{index} split file #{location.file_path.inspect} for #{input_location.file_uri.inspect}.")
          yield(location, location.file_path)
        end
      else
        logger.info("#{self.class}##{__method__} did not find at existing location split files for #{input_location.file_uri.inspect}.  Proceeding with #{splitter}.call")
        # We're going to need to create the files and "cast" them to locations.
        splitter.call(
          input_tmp_file_path,
          image_extension: output_extension,
          image_file_basename_template: image_file_basename_template(basename: input_location.file_basename)
        ).each_with_index do |image_path, index|
          logger.info("#{self.class}##{__method__} generated (via #{splitter}.call) ##{index} split file #{image_path.inspect} for #{input_location.file_uri.inspect}.")
          image_location = StorageLocations::FileLocation.new("file://#{image_path}")
          yield(image_location, image_path)
        end
      end
    end
  end
end