Class: OCRSDK::PDF
- Inherits:
-
Image
- Object
- AbstractEntity
- Image
- OCRSDK::PDF
- Defined in:
- lib/ocrsdk/pdf.rb
Constant Summary
Constants included from Verifiers::Profile
Constants included from Verifiers::Format
Verifiers::Format::INPUT_FORMATS, Verifiers::Format::OUTPUT_FORMATS
Constants included from Verifiers::Language
Verifiers::Language::LANGUAGES
Instance Method Summary collapse
-
#recognizeable? ⇒ Boolean
We’re on a shaky ground regarding what kind of pdfs should be recognized and what shouldn’t.
Methods inherited from Image
#as_pdf, #as_pdf_sync, #as_text, #as_text_sync, #initialize
Methods included from Verifiers::Profile
#profile_to_s, #supported_profile?
Methods included from Verifiers::Format
#format_to_s, #supported_input_format?, #supported_output_format?
Methods included from Verifiers::Language
#language_to_s, #language_to_sym, #languages_to_s, #supported_language?
Methods inherited from AbstractEntity
Constructor Details
This class inherits a constructor from OCRSDK::Image
Instance Method Details
#recognizeable? ⇒ Boolean
We’re on a shaky ground regarding what kind of pdfs should be recognized and what shouldn’t. Currently we count that if there are
images * 20 > length of text
then this document might need recognition. Assumption is that there might be a title, page numbers or credits along with images.
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# File 'lib/ocrsdk/pdf.rb', line 9 def recognizeable? reader = PDF::Reader.new @image_path images = 0 text = 0 chars = Set.new reader.pages.each do |page| text += page.text.length chars += page.text.split('').map(&:ord).uniq images += page.xobjects.map {|k, v| v.hash[:Subtype]}.count(:Image) end # count number of distinct characters # in case of "searchable", but incorrectly recognized document images * 20 > text || chars.length < 10 rescue PDF::Reader::MalformedPDFError, PDF::Reader::UnsupportedFeatureError false end |