Module: Docsplit
- Extended by:
- TransparentPDFs
- Defined in:
- lib/docsplit.rb,
lib/docsplit/command_line.rb,
lib/docsplit/text_cleaner.rb,
lib/docsplit/info_extractor.rb,
lib/docsplit/page_extractor.rb,
lib/docsplit/text_extractor.rb,
lib/docsplit/image_extractor.rb,
lib/docsplit/transparent_pdfs.rb
Overview
The Docsplit module delegates to the Java PDF extractors.
Defined Under Namespace
Modules: TransparentPDFs Classes: CommandLine, ExtractionFailed, ImageExtractor, InfoExtractor, PageExtractor, TextCleaner, TextExtractor
Constant Summary collapse
- VERSION =
Keep in sync with gemspec.
'0.6.3'
- ROOT =
File.(File.dirname(__FILE__) + '/..')
- CLASSPATH =
"#{ROOT}/build#{File::PATH_SEPARATOR}#{ROOT}/vendor/'*'"
- LOGGING =
"-Djava.util.logging.config.file=#{ROOT}/vendor/logging.properties"
- HEADLESS =
"-Djava.awt.headless=true"
- OFFICE =
RUBY_PLATFORM.match(/darwin/i) ? '' : "-Doffice.home=#{office}"
- METADATA_KEYS =
[:author, :date, :creator, :keywords, :producer, :subject, :title, :length]
- GM_FORMATS =
["image/gif", "image/jpeg", "image/png", "image/x-ms-bmp", "image/svg+xml", "image/tiff", "image/x-portable-bitmap", "application/postscript", "image/x-portable-pixmap"]
- DEPENDENCIES =
{:java => false, :gm => false, :pdftotext => false, :pdftk => false, :tesseract => false}
- ESCAPE =
lambda {|x| Shellwords.shellescape(x) }
Class Method Summary collapse
-
.clean_text(text) ⇒ Object
Utility method to clean OCR’d text with garbage characters.
-
.extract_images(pdfs, opts = {}) ⇒ Object
Use the ExtractImages Java class to rasterize a PDF into each page’s image.
-
.extract_pages(pdfs, opts = {}) ⇒ Object
Use the ExtractPages Java class to burst a PDF into single pages.
-
.extract_pdf(docs, opts = {}) ⇒ Object
Use JODCConverter to extract the documents as PDFs.
-
.extract_text(pdfs, opts = {}) ⇒ Object
Use the ExtractText Java class to write out all embedded text.
Methods included from TransparentPDFs
Class Method Details
.clean_text(text) ⇒ Object
Utility method to clean OCR’d text with garbage characters.
92 93 94 |
# File 'lib/docsplit.rb', line 92 def self.clean_text(text) TextCleaner.new.clean(text) end |
.extract_images(pdfs, opts = {}) ⇒ Object
Use the ExtractImages Java class to rasterize a PDF into each page’s image.
55 56 57 58 59 |
# File 'lib/docsplit.rb', line 55 def self.extract_images(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) opts[:pages] = normalize_value(opts[:pages]) if opts[:pages] ImageExtractor.new.extract(pdfs, opts) end |
.extract_pages(pdfs, opts = {}) ⇒ Object
Use the ExtractPages Java class to burst a PDF into single pages.
43 44 45 46 |
# File 'lib/docsplit.rb', line 43 def self.extract_pages(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) PageExtractor.new.extract(pdfs, opts) end |
.extract_pdf(docs, opts = {}) ⇒ Object
Use JODCConverter to extract the documents as PDFs. If the document is in an image format, use GraphicsMagick to extract the PDF.
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
# File 'lib/docsplit.rb', line 63 def self.extract_pdf(docs, opts={}) out = opts[:output] || '.' FileUtils.mkdir_p out unless File.exists?(out) [docs].flatten.each do |doc| ext = File.extname(doc) basename = File.basename(doc, ext) escaped_doc, escaped_out, escaped_basename = [doc, out, basename].map(&ESCAPE) if GM_FORMATS.include?(`file -b --mime #{ESCAPE[doc]}`.strip.split(/[:;]\s+/)[0]) `gm convert #{escaped_doc} #{escaped_out}/#{escaped_basename}.pdf` else = "-jar #{ROOT}/vendor/jodconverter/jodconverter-core-3.0-beta-4.jar -r #{ROOT}/vendor/conf/document-formats.js" run "#{} #{escaped_doc} #{escaped_out}/#{escaped_basename}.pdf", [], {} end end end |
.extract_text(pdfs, opts = {}) ⇒ Object
Use the ExtractText Java class to write out all embedded text.
49 50 51 52 |
# File 'lib/docsplit.rb', line 49 def self.extract_text(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) TextExtractor.new.extract(pdfs, opts) end |