Module: Docsplit

Extended by:
TransparentPDFs
Defined in:
lib/docsplit.rb,
lib/docsplit/command_line.rb,
lib/docsplit/text_cleaner.rb,
lib/docsplit/pdf_extractor.rb,
lib/docsplit/info_extractor.rb,
lib/docsplit/page_extractor.rb,
lib/docsplit/text_extractor.rb,
lib/docsplit/image_extractor.rb,
lib/docsplit/transparent_pdfs.rb

Overview

The Docsplit module delegates to the Java PDF extractors.

Defined Under Namespace

Modules: TransparentPDFs Classes: CommandLine, ExtractionFailed, ImageExtractor, InfoExtractor, PageExtractor, PdfExtractor, TextCleaner, TextExtractor

Constant Summary collapse

VERSION =

Keep in sync with gemspec.

'0.8.0'
ESCAPE =
lambda {|x| Shellwords.shellescape(x) }
ROOT =
File.expand_path(File.dirname(__FILE__) + '/..')
ESCAPED_ROOT =
ESCAPE[ROOT]
METADATA_KEYS =
[:author, :date, :creator, :keywords, :producer, :subject, :title, :length]
GM_FORMATS =
["image/gif", "image/jpeg", "image/png", "image/x-ms-bmp", "image/svg+xml", "image/tiff", "image/x-portable-bitmap", "application/postscript", "image/x-portable-pixmap"]
DEPENDENCIES =
{:java => false, :gm => false, :pdftotext => false, :pdftk => false, :pdftailor => false, :tesseract => false, :osd => false}

Class Method Summary collapse

Methods included from TransparentPDFs

ensure_pdfs, is_pdf?

Class Method Details

.clean_text(text) ⇒ Object

Utility method to clean OCR’d text with garbage characters.



85
86
87
# File 'lib/docsplit.rb', line 85

def self.clean_text(text)
  TextCleaner.new.clean(text)
end

.extract_images(pdfs, opts = {}) ⇒ Object

Use the ExtractImages Java class to rasterize a PDF into each page’s image.



56
57
58
59
60
# File 'lib/docsplit.rb', line 56

def self.extract_images(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  opts[:pages] = normalize_value(opts[:pages]) if opts[:pages]
  ImageExtractor.new.extract(pdfs, opts)
end

.extract_info(pdfs, opts = {}) ⇒ Object



79
80
81
82
# File 'lib/docsplit.rb', line 79

def self.extract_info(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  InfoExtractor.new.extract_all(pdfs, opts)
end

.extract_pages(pdfs, opts = {}) ⇒ Object

Use the ExtractPages Java class to burst a PDF into single pages.



44
45
46
47
# File 'lib/docsplit.rb', line 44

def self.extract_pages(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  PageExtractor.new.extract(pdfs, opts)
end

.extract_pdf(docs, opts = {}) ⇒ Object

Use JODCConverter to extract the documents as PDFs. If the document is in an image format, use GraphicsMagick to extract the PDF.



64
65
66
# File 'lib/docsplit.rb', line 64

def self.extract_pdf(docs, opts={})
  PdfExtractor.new.extract(docs, opts)
end

.extract_text(pdfs, opts = {}) ⇒ Object

Use the ExtractText Java class to write out all embedded text.



50
51
52
53
# File 'lib/docsplit.rb', line 50

def self.extract_text(pdfs, opts={})
  pdfs = ensure_pdfs(pdfs)
  TextExtractor.new.extract(pdfs, opts)
end