Class: DerivativeRodeo::Generators::HocrGenerator
- Inherits:
-
BaseGenerator
- Object
- BaseGenerator
- DerivativeRodeo::Generators::HocrGenerator
- Defined in:
- lib/derivative_rodeo/generators/hocr_generator.rb
Overview
Responsible for finding or creating a hocr file (or configured :output_suffix) using tesseract. Will create and store a monochrome derivative if one is not found.
From ‘tesseract -h`
Usage:
tesseract --help | --help-extra | --version
tesseract --list-langs
tesseract imagename outputbase [options...] [configfile...]
Defined Under Namespace
Modules: RequiresExistingFile
Class Attributes collapse
-
#additional_tessearct_options ⇒ Object
Additional options to send to tesseract command; default ‘nil`.
-
#command_environment_variables ⇒ Object
Command arena variables to for tesseract command; default ‘nil`.
-
#output_suffix ⇒ Object
The tesseract command’s output base; default ‘:hocr`.
Attributes inherited from BaseGenerator
#input_uris, #output_extension, #output_location_template, #preprocessed_location_template
Instance Method Summary collapse
-
#build_step(output_location:, input_tmp_file_path:) ⇒ StorageLocations::BaseLocation
Run tesseract on monocrhome file and store the resulting output in the configured output_extension (default ‘hocr’).
- #run_tesseract(in_path, out_path) ⇒ Object
-
#tesseractify(input_tmp_file_path, output_location) ⇒ Object
private
Call ‘tesseract` on the monochrome file and store the resulting hocr in the tmp_path.
-
#with_each_requisite_location_and_tmp_file_path(builder: MonochromeGenerator) {|file, tmp_path| ... } ⇒ Object
When generating a hocr file from an image, we’ve found the best results are when we’re processing a monochrome image.
Methods inherited from BaseGenerator
#derive_preprocessed_template_from, #destination, #generated_files, #generated_uris, #initialize, #input_files, #run, #valid_instantiation?
Constructor Details
This class inherits a constructor from DerivativeRodeo::Generators::BaseGenerator
Instance Attribute Details
#additional_tessearct_options ⇒ Object
Additional options to send to tesseract command; default ‘nil`.
33 |
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 33 class_attribute :additional_tessearct_options, default: nil |
#command_environment_variables ⇒ Object
Command arena variables to for tesseract command; default ‘nil`. Should be a space seperated string of KEY=value pairs
28 |
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 28 class_attribute :command_environment_variables, default: "OMP_THREAD_LIMIT=1" |
#output_suffix ⇒ Object
The tesseract command’s output base; default ‘:hocr`.
37 |
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 37 class_attribute :output_suffix, default: :hocr |
Instance Method Details
#build_step(output_location:, input_tmp_file_path:) ⇒ StorageLocations::BaseLocation
Run tesseract on monocrhome file and store the resulting output in the configured output_extension (default ‘hocr’)
52 53 54 |
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 52 def build_step(output_location:, input_tmp_file_path:, **) tesseractify(input_tmp_file_path, output_location) end |
#run_tesseract(in_path, out_path) ⇒ Object
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 95 def run_tesseract(in_path, out_path) # we pull the extension off the output path, because tesseract will add it back cmd = "" cmd += command_environment_variables + " " if command_environment_variables.present? # TODO: The line of code could mean we had a file with multiple periods and we'd just # replace the first one. Should we instead prefer the following: # # `out_path.split(".")[0..-2].join('.') + ".#{output_extension}"` output_to_path = out_path.sub('.' + output_extension, '') cmd += "tesseract #{in_path} #{output_to_path}" cmd += " #{}" if .present? cmd += " #{output_suffix}" # TODO: capture output in case of exceptions; perhaps delegate that to the #run method. run(cmd) end |
#tesseractify(input_tmp_file_path, output_location) ⇒ Object
This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.
Call ‘tesseract` on the monochrome file and store the resulting hocr in the tmp_path
86 87 88 89 90 |
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 86 def tesseractify(input_tmp_file_path, output_location) output_location.with_new_tmp_path do |out_tmp_path| run_tesseract(input_tmp_file_path, out_tmp_path) end end |
#with_each_requisite_location_and_tmp_file_path(builder: MonochromeGenerator) {|file, tmp_path| ... } ⇒ Object
When generating a hocr file from an image, we’ve found the best results are when we’re processing a monochrome image. As such, this generator will auto-convert a given image to monochrome.
67 68 69 70 71 72 73 74 75 76 |
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 67 def with_each_requisite_location_and_tmp_file_path(builder: MonochromeGenerator) mono_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template) requisite_files ||= builder.new(input_uris: input_uris, output_location_template: mono_location_template).generated_files requisite_files.each do |input_location| input_location.with_existing_tmp_path do |tmp_file_path| yield(input_location, tmp_file_path) end end end |