Class: DerivativeRodeo::Generators::PdfSplitGenerator
- Inherits:
-
BaseGenerator
- Object
- BaseGenerator
- DerivativeRodeo::Generators::PdfSplitGenerator
- Includes:
- CopyFileConcern
- Defined in:
- lib/derivative_rodeo/generators/pdf_split_generator.rb
Overview
This class is responsible for splitting each given PDF (e.g. BaseGenerator#input_files) into one image per page (e.g. #with_each_requisite_location_and_tmp_file_path). We need to ensure that we have each of those image files in S3/file storage then enqueue those files for processing.
Instance Attribute Summary
Attributes inherited from BaseGenerator
#input_uris, #output_extension, #output_location_template, #preprocessed_location_template
Class Method Summary collapse
-
.filename_for_a_derived_page_from_a_pdf?(filename:, extension: nil) ⇒ TrueClass, FalseClass
A helper method for downstream implementations to ask if this file is perhaps split from a PDF.
Instance Method Summary collapse
-
#derive_preprocessed_template_from(input_location:, preprocessed_location_template:) ⇒ Object
We’re working with an input location with a filename basename of “123.ARCHIVAL–page-1.tiff” The :preprocessed_location_template, due to constraints, likely ends with the original PDF’s filename (e.g. “123.ARCHIVAL.pdf”).
-
#existing_page_locations(input_location:) ⇒ Enumerable<StorageLocations::BaseLocation>
We want to check the output location and pre-processed location for the existence of already split pages.
-
#image_file_basename_template(basename:) ⇒ String
the page number.
-
#with_each_requisite_location_and_tmp_file_path(splitter: Services::PdfSplitter) {|image_location, image_path| ... } ⇒ Object
Take the given PDF(s) and into one image per page.
Methods included from CopyFileConcern
Methods inherited from BaseGenerator
#build_step, #destination, #generated_files, #generated_uris, #initialize, #input_files, #run, #valid_instantiation?
Constructor Details
This class inherits a constructor from DerivativeRodeo::Generators::BaseGenerator
Class Method Details
.filename_for_a_derived_page_from_a_pdf?(filename:, extension: nil) ⇒ TrueClass, FalseClass
A helper method for downstream implementations to ask if this file is perhaps split from a PDF.
31 32 33 34 35 36 37 38 |
# File 'lib/derivative_rodeo/generators/pdf_split_generator.rb', line 31 def self.filename_for_a_derived_page_from_a_pdf?(filename:, extension: nil) extension ||= File.extname(filename) # Strip the leading period from the extension. extension = extension[1..-1] if extension.start_with?('.') regexp = %r{--page-\d+\.#{extension}$} !!regexp.match(filename) end |
Instance Method Details
#derive_preprocessed_template_from(input_location:, preprocessed_location_template:) ⇒ Object
We’re working with an input location with a filename basename of “123.ARCHIVAL–page-1.tiff” The :preprocessed_location_template, due to constraints, likely ends with the original PDF’s filename (e.g. “123.ARCHIVAL.pdf”)
And since the template doesn’t have a concept of page number, we introduce this kludge.
140 141 142 |
# File 'lib/derivative_rodeo/generators/pdf_split_generator.rb', line 140 def derive_preprocessed_template_from(input_location:, preprocessed_location_template:) File.join(File.dirname(preprocessed_location_template), input_location.file_name) end |
#existing_page_locations(input_location:) ⇒ Enumerable<StorageLocations::BaseLocation>
There is relation to BaseGenerator#destination and this method.
The tail_regexp is in relation to the #image_file_basename_template
We want to check the output location and pre-processed location for the existence of already split pages. This method checks both places.
70 71 72 73 74 75 76 77 78 79 80 |
# File 'lib/derivative_rodeo/generators/pdf_split_generator.rb', line 70 def existing_page_locations(input_location:) # See image_file_basename_template tail_regexp = %r{#{input_location.file_basename}--page-\d+\.#{output_extension}$} output_locations = input_location.derived_file_from(template: output_location_template).matching_locations_in_file_dir(tail_regexp: tail_regexp) return output_locations if output_locations.count.positive? return [] if preprocessed_location_template.blank? input_location.derived_file_from(template: preprocessed_location_template).matching_locations_in_file_dir(tail_regexp: tail_regexp) end |
#image_file_basename_template(basename:) ⇒ String
This must include “%d” in the returning value, as that is how Ghostscript will assign
I have extracted this function to make it abundantly clear the expected location
the page number.
each split image. Further there is an interaction in this
54 55 56 |
# File 'lib/derivative_rodeo/generators/pdf_split_generator.rb', line 54 def image_file_basename_template(basename:) "#{basename}--page-%d.#{output_extension}" end |
#with_each_requisite_location_and_tmp_file_path(splitter: Services::PdfSplitter) {|image_location, image_path| ... } ⇒ Object
This function makes a concession; namely that if it encounters any
Take the given PDF(s) and into one image per page. Remember that the URL should account for the page number.
When we have two PDFs (10 pages and 20 pages respectively), we will have 30 requisite files; the files must have URLs that associate with their respective parent PDFs.
#existing_page_locations it will use all of that result as the entire number of pages. We could make this smarter but at the moment we’re deferring on that.
rubocop:disable Metrics/MethodLength rubocop:disable Metrics/AbcSize
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
# File 'lib/derivative_rodeo/generators/pdf_split_generator.rb', line 104 def with_each_requisite_location_and_tmp_file_path(splitter: Services::PdfSplitter) input_files.each do |input_location| input_location.with_existing_tmp_path do |input_tmp_file_path| existing_locations = existing_page_locations(input_location: input_location) if existing_locations.count.positive? logger.info("#{self.class}##{__method__} found #{existing_locations.count} file(s) at existing split location for #{input_location.file_uri.inspect}.") existing_locations.each_with_index do |location, index| logger.info("#{self.class}##{__method__} found ##{index} split file #{location.file_path.inspect} for #{input_location.file_uri.inspect}.") yield(location, location.file_path) end else logger.info("#{self.class}##{__method__} did not find at existing location split files for #{input_location.file_uri.inspect}. Proceeding with #{splitter}.call") # We're going to need to create the files and "cast" them to locations. splitter.call( input_tmp_file_path, image_extension: output_extension, image_file_basename_template: image_file_basename_template(basename: input_location.file_basename) ).each_with_index do |image_path, index| logger.info("#{self.class}##{__method__} generated (via #{splitter}.call) ##{index} split file #{image_path.inspect} for #{input_location.file_uri.inspect}.") image_location = StorageLocations::FileLocation.new("file://#{image_path}") yield(image_location, image_path) end end end end end |