Class: DerivativeRodeo::Generators::HocrGenerator

Inherits:
BaseGenerator
  • Object
show all
Defined in:
lib/derivative_rodeo/generators/hocr_generator.rb

Overview

Responsible for finding or creating a hocr file (or configured :output_suffix) using tesseract. Will create and store a monochrome derivative if one is not found.

From ‘tesseract -h`

Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

Defined Under Namespace

Modules: RequiresExistingFile

Class Attributes collapse

Attributes inherited from BaseGenerator

#input_uris, #output_extension, #output_location_template, #preprocessed_location_template

Instance Method Summary collapse

Methods inherited from BaseGenerator

#derive_preprocessed_template_from, #destination, #generated_files, #generated_uris, #initialize, #input_files, #run, #valid_instantiation?

Constructor Details

This class inherits a constructor from DerivativeRodeo::Generators::BaseGenerator

Instance Attribute Details

#additional_tessearct_optionsObject

Additional options to send to tesseract command; default ‘nil`.



33
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 33

class_attribute :additional_tessearct_options, default: nil

#command_environment_variablesObject

Command arena variables to for tesseract command; default ‘nil`. Should be a space seperated string of KEY=value pairs

Examples:

# this works for space_stone aws lambda
Derivative::Rodeo::Step::HocrStep.command_environment_variables =
  'OMP_THREAD_LIMIT=1 TESSDATA_PREFIX=/opt/share/tessdata LD_LIBRARY_PATH=/opt/lib PATH=/opt/bin:$PATH'


28
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 28

class_attribute :command_environment_variables, default: "OMP_THREAD_LIMIT=1"

#output_suffixObject

The tesseract command’s output base; default ‘:hocr`.



37
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 37

class_attribute :output_suffix, default: :hocr

Instance Method Details

#build_step(output_location:, input_tmp_file_path:) ⇒ StorageLocations::BaseLocation

Run tesseract on monocrhome file and store the resulting output in the configured output_extension (default ‘hocr’)

Parameters:

Returns:

See Also:

  • #requisite_files


52
53
54
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 52

def build_step(output_location:, input_tmp_file_path:, **)
  tesseractify(input_tmp_file_path, output_location)
end

#run_tesseract(in_path, out_path) ⇒ Object

Parameters:

  • in_path (String)

    the source of the file

  • out_path (String)


95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 95

def run_tesseract(in_path, out_path)
  # we pull the extension off the output path, because tesseract will add it back
  cmd = ""
  cmd += command_environment_variables + " " if command_environment_variables.present?
  # TODO: The line of code could mean we had a file with multiple periods and we'd just
  # replace the first one.  Should we instead prefer the following:
  #
  # `out_path.split(".")[0..-2].join('.') + ".#{output_extension}"`
  output_to_path = out_path.sub('.' + output_extension, '')
  cmd += "tesseract #{in_path} #{output_to_path}"
  cmd += " #{additional_tessearct_options}" if additional_tessearct_options.present?
  cmd += " #{output_suffix}"

  # TODO: capture output in case of exceptions; perhaps delegate that to the #run method.
  run(cmd)
end

#tesseractify(input_tmp_file_path, output_location) ⇒ Object

This method is part of a private API. You should avoid using this method if possible, as it may be removed or be changed in the future.

Call ‘tesseract` on the monochrome file and store the resulting hocr in the tmp_path

Parameters:



86
87
88
89
90
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 86

def tesseractify(input_tmp_file_path, output_location)
  output_location.with_new_tmp_path do |out_tmp_path|
    run_tesseract(input_tmp_file_path, out_tmp_path)
  end
end

#with_each_requisite_location_and_tmp_file_path(builder: MonochromeGenerator) {|file, tmp_path| ... } ⇒ Object

When generating a hocr file from an image, we’ve found the best results are when we’re processing a monochrome image. As such, this generator will auto-convert a given image to monochrome.

Parameters:

Yield Parameters:

See Also:



67
68
69
70
71
72
73
74
75
76
# File 'lib/derivative_rodeo/generators/hocr_generator.rb', line 67

def with_each_requisite_location_and_tmp_file_path(builder: MonochromeGenerator)
  mono_location_template = Services::ConvertUriViaTemplateService.coerce_pre_requisite_template_from(template: output_location_template)

  requisite_files ||= builder.new(input_uris: input_uris, output_location_template: mono_location_template).generated_files
  requisite_files.each do |input_location|
    input_location.with_existing_tmp_path do |tmp_file_path|
      yield(input_location, tmp_file_path)
    end
  end
end