Class: OCRSDK::PDF

Inherits:
Image show all
Defined in:
lib/ocrsdk/pdf.rb

Constant Summary

Constants included from Verifiers::Profile

Verifiers::Profile::PROFILES

Constants included from Verifiers::Format

Verifiers::Format::INPUT_FORMATS, Verifiers::Format::OUTPUT_FORMATS

Constants included from Verifiers::Language

Verifiers::Language::LANGUAGES

Instance Method Summary collapse

Methods inherited from Image

#as_pdf, #as_pdf_sync, #as_text, #as_text_sync, #initialize

Methods included from Verifiers::Profile

#profile_to_s, #supported_profile?

Methods included from Verifiers::Format

#format_to_s, #supported_input_format?, #supported_output_format?

Methods included from Verifiers::Language

#language_to_s, #language_to_sym, #languages_to_s, #supported_language?

Methods inherited from AbstractEntity

#initialize

Constructor Details

This class inherits a constructor from OCRSDK::Image

Instance Method Details

#recognizeable?Boolean

We’re on a shaky ground regarding what kind of pdfs should be recognized and what shouldn’t. Currently we count that if there are

images * 20 > length of text

then this document might need recognition. Assumption is that there might be a title, page numbers or credits along with images.

Returns:

  • (Boolean)


9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# File 'lib/ocrsdk/pdf.rb', line 9

def recognizeable?
  reader = PDF::Reader.new @image_path

  images = 0
  text   = 0
  chars  = Set.new
  reader.pages.each do |page|
    text   += page.text.length
    chars  += page.text.split('').map(&:ord).uniq
    images += page.xobjects.map {|k, v| v.hash[:Subtype]}.count(:Image)
  end

  # count number of distinct characters
  # in case of "searchable", but incorrectly recognized document
  images * 20 > text || chars.length < 10
rescue PDF::Reader::MalformedPDFError, PDF::Reader::UnsupportedFeatureError
  false
end