Class: OCRSDK::PDF
- Inherits:
-
Image
- Object
- AbstractEntity
- Image
- OCRSDK::PDF
- Defined in:
- lib/ocrsdk/pdf.rb
Constant Summary
Constants included from Verifiers::Profile
Constants included from Verifiers::Format
Verifiers::Format::INPUT_FORMATS, Verifiers::Format::OUTPUT_FORMATS
Constants included from Verifiers::Language
Verifiers::Language::LANGUAGES
Instance Method Summary collapse
-
#recognizeable? ⇒ Boolean
We’re on a shaky ground regarding what kind of pdfs should be recognized and what shouldn’t.
Methods inherited from Image
#as_pdf, #as_pdf_sync, #as_text, #as_text_sync, #initialize
Methods included from Verifiers::Profile
#profile_to_s, #supported_profile?
Methods included from Verifiers::Format
#format_to_s, #supported_input_format?, #supported_output_format?
Methods included from Verifiers::Language
#language_to_s, #language_to_sym, #languages_to_s, #supported_language?
Methods inherited from AbstractEntity
Constructor Details
This class inherits a constructor from OCRSDK::Image
Instance Method Details
#recognizeable? ⇒ Boolean
We’re on a shaky ground regarding what kind of pdfs should be recognized and what shouldn’t. Currently we count that if there are
images * 20 > length of text
then this document might need recognition.
Assumption is that there might be a title, page numbers or credits along with images.
In case of title page we also skip the first page which should not affect documents which will not need to be recognized
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# File 'lib/ocrsdk/pdf.rb', line 15 def recognizeable? reader = PDF::Reader.new @image_path images = 0 text = 0 chars = Set.new start = reader.pages.length > 1 ? 1 : 0 reader.pages[start..-1].each do |page| text += page.text.length chars += page.text.split('').map(&:ord).uniq images += page.xobjects.map {|k, v| v.hash[:Subtype]}.count(:Image) end # count number of distinct characters # in case of "searchable", but incorrectly recognized document images * 20 > text || chars.length < 10 rescue PDF::Reader::MalformedPDFError, PDF::Reader::UnsupportedFeatureError false end |