Module: Docsplit

Extended by:
TransparentPDFs
Defined in:
lib/docsplit.rb,
lib/docsplit/command_line.rb,
lib/docsplit/text_cleaner.rb,
lib/docsplit/info_extractor.rb,
lib/docsplit/page_extractor.rb,
lib/docsplit/text_extractor.rb,
lib/docsplit/image_extractor.rb,
lib/docsplit/transparent_pdfs.rb

Overview

The Docsplit module delegates to the Java PDF extractors.

Defined Under Namespace

Modules: TransparentPDFs Classes: CommandLine, ExtractionFailed, ImageExtractor, InfoExtractor, PageExtractor, TextCleaner, TextExtractor

Constant Summary collapse

VERSION =

Keep in sync with gemspec.

'0.6.3'
ROOT =
File.expand_path(File.dirname(__FILE__) + '/..')
CLASSPATH =
"#{ROOT}/build#{File::PATH_SEPARATOR}#{ROOT}/vendor/'*'"
LOGGING =
"-Djava.util.logging.config.file=#{ROOT}/vendor/logging.properties"
HEADLESS =
"-Djava.awt.headless=true"
OFFICE =
RUBY_PLATFORM.match(/darwin/i) ? '' : "-Doffice.home=#{office}"
METADATA_KEYS =
[:author, :date, :creator, :keywords, :producer, :subject, :title, :length]
GM_FORMATS =
["image/gif", "image/jpeg", "image/png", "image/x-ms-bmp", "image/svg+xml", "image/tiff", "image/x-portable-bitmap", "application/postscript", "image/x-portable-pixmap"]
DEPENDENCIES =
{:java => false, :gm => false, :pdftotext => false, :pdftk => false, :tesseract => false}
ESCAPE =
lambda {|x| Shellwords.shellescape(x) }

Class Method Summary collapse

Methods included from TransparentPDFs

ensure_pdfs

Class Method Details

.clean_text(text) ⇒ Object

Utility method to clean OCR’d text with garbage characters.



92
93
94
# File 'lib/docsplit.rb', line 92

def self.clean_text(text)
  TextCleaner.new.clean(text)
end

.extract_images(pdfs, opts = {}) ⇒ Object

Use the ExtractImages Java class to rasterize a PDF into each page’s image.



55
56
57
58
59
# File 'lib/docsplit.rb', line 55

def self.extract_images(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  opts[:pages] = normalize_value(opts[:pages]) if opts[:pages]
  ImageExtractor.new.extract(pdfs, opts)
end

.extract_pages(pdfs, opts = {}) ⇒ Object

Use the ExtractPages Java class to burst a PDF into single pages.



43
44
45
46
# File 'lib/docsplit.rb', line 43

def self.extract_pages(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  PageExtractor.new.extract(pdfs, opts)
end

.extract_pdf(docs, opts = {}) ⇒ Object

Use JODCConverter to extract the documents as PDFs. If the document is in an image format, use GraphicsMagick to extract the PDF.



63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/docsplit.rb', line 63

def self.extract_pdf(docs, opts={})
  out = opts[:output] || '.'
  FileUtils.mkdir_p out unless File.exists?(out)
  [docs].flatten.each do |doc|
    ext = File.extname(doc)
    basename = File.basename(doc, ext)
    escaped_doc, escaped_out, escaped_basename = [doc, out, basename].map(&ESCAPE)

    if GM_FORMATS.include?(`file -b --mime #{ESCAPE[doc]}`.strip.split(/[:;]\s+/)[0])
      `gm convert #{escaped_doc} #{escaped_out}/#{escaped_basename}.pdf`
    else
      options = "-jar #{ROOT}/vendor/jodconverter/jodconverter-core-3.0-beta-4.jar -r #{ROOT}/vendor/conf/document-formats.js"
      run "#{options} #{escaped_doc} #{escaped_out}/#{escaped_basename}.pdf", [], {}
    end
  end
end

.extract_text(pdfs, opts = {}) ⇒ Object

Use the ExtractText Java class to write out all embedded text.



49
50
51
52
# File 'lib/docsplit.rb', line 49

def self.extract_text(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  TextExtractor.new.extract(pdfs, opts)
end