Module: Docsplit

Extended by:: TransparentPDFs

Defined in:: lib/docsplit.rb,
lib/docsplit/command_line.rb,
lib/docsplit/text_cleaner.rb,
lib/docsplit/info_extractor.rb,
lib/docsplit/page_extractor.rb,
lib/docsplit/text_extractor.rb,
lib/docsplit/image_extractor.rb,
lib/docsplit/transparent_pdfs.rb

Overview

The Docsplit module delegates to the Java PDF extractors.

Defined Under Namespace

Modules: TransparentPDFs Classes: CommandLine, ExtractionFailed, ImageExtractor, InfoExtractor, PageExtractor, TextCleaner, TextExtractor

Constant Summary collapse

VERSION = Keep in sync with gemspec.

'0.6.3'

ROOT =

File.expand_path(File.dirname(__FILE__) + '/..')

CLASSPATH =

"#{ROOT}/build#{File::PATH_SEPARATOR}#{ROOT}/vendor/'*'"

LOGGING =

"-Djava.util.logging.config.file=#{ROOT}/vendor/logging.properties"

HEADLESS =

"-Djava.awt.headless=true"

OFFICE =

RUBY_PLATFORM.match(/darwin/i) ? '' : "-Doffice.home=#{office}"

METADATA_KEYS =

[:author, :date, :creator, :keywords, :producer, :subject, :title, :length]

GM_FORMATS =

["image/gif", "image/jpeg", "image/png", "image/x-ms-bmp", "image/svg+xml", "image/tiff", "image/x-portable-bitmap", "application/postscript", "image/x-portable-pixmap"]

DEPENDENCIES =

{:java => false, :gm => false, :pdftotext => false, :pdftk => false, :tesseract => false}

ESCAPE =

lambda {|x| Shellwords.shellescape(x) }

Class Method Summary collapse

.clean_text(text) ⇒ Object

Utility method to clean OCR’d text with garbage characters.
.extract_images(pdfs, opts = {}) ⇒ Object

Use the ExtractImages Java class to rasterize a PDF into each page’s image.
.extract_pages(pdfs, opts = {}) ⇒ Object

Use the ExtractPages Java class to burst a PDF into single pages.
.extract_pdf(docs, opts = {}) ⇒ Object

Use JODCConverter to extract the documents as PDFs.
.extract_text(pdfs, opts = {}) ⇒ Object

Use the ExtractText Java class to write out all embedded text.

Methods included from TransparentPDFs

ensure_pdfs

Class Method Details

.clean_text(text) ⇒ `Object`

Utility method to clean OCR’d text with garbage characters.



92
93
94

# File 'lib/docsplit.rb', line 92

def self.clean_text(text)
  TextCleaner.new.clean(text)
end

.extract_images(pdfs, opts = {}) ⇒ `Object`

Use the ExtractImages Java class to rasterize a PDF into each page’s image.

# File 'lib/docsplit.rb', line 55

def self.extract_images(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  opts[:pages] = normalize_value(opts[:pages]) if opts[:pages]
  ImageExtractor.new.extract(pdfs, opts)
end

.extract_pages(pdfs, opts = {}) ⇒ `Object`

Use the ExtractPages Java class to burst a PDF into single pages.

# File 'lib/docsplit.rb', line 43

def self.extract_pages(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  PageExtractor.new.extract(pdfs, opts)
end

.extract_pdf(docs, opts = {}) ⇒ `Object`

Use JODCConverter to extract the documents as PDFs. If the document is in an image format, use GraphicsMagick to extract the PDF.

# File 'lib/docsplit.rb', line 63

def self.extract_pdf(docs, opts={})
  out = opts[:output] || '.'
  FileUtils.mkdir_p out unless File.exists?(out)
  [docs].flatten.each do |doc|
    ext = File.extname(doc)
    basename = File.basename(doc, ext)
    escaped_doc, escaped_out, escaped_basename = [doc, out, basename].map(&ESCAPE)

    if GM_FORMATS.include?(`file -b --mime #{ESCAPE[doc]}`.strip.split(/[:;]\s+/)[0])
      `gm convert #{escaped_doc} #{escaped_out}/#{escaped_basename}.pdf`
    else
      options = "-jar #{ROOT}/vendor/jodconverter/jodconverter-core-3.0-beta-4.jar -r #{ROOT}/vendor/conf/document-formats.js"
      run "#{options} #{escaped_doc} #{escaped_out}/#{escaped_basename}.pdf", [], {}
    end
  end
end

.extract_text(pdfs, opts = {}) ⇒ `Object`

Use the ExtractText Java class to write out all embedded text.

# File 'lib/docsplit.rb', line 49

def self.extract_text(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  TextExtractor.new.extract(pdfs, opts)
end