Module: Docsplit
- Extended by:
- TransparentPDFs
- Defined in:
- lib/docsplit.rb,
lib/docsplit/command_line.rb,
lib/docsplit/text_cleaner.rb,
lib/docsplit/pdf_extractor.rb,
lib/docsplit/info_extractor.rb,
lib/docsplit/page_extractor.rb,
lib/docsplit/text_extractor.rb,
lib/docsplit/image_extractor.rb,
lib/docsplit/transparent_pdfs.rb
Overview
The Docsplit module delegates to the Java PDF extractors.
Defined Under Namespace
Modules: TransparentPDFs Classes: CommandLine, ExtractionFailed, ImageExtractor, InfoExtractor, PageExtractor, PdfExtractor, TextCleaner, TextExtractor
Constant Summary collapse
- VERSION =
Keep in sync with gemspec.
'0.8.0'
- ESCAPE =
lambda {|x| Shellwords.shellescape(x) }
- ROOT =
File.(File.dirname(__FILE__) + '/..')
- ESCAPED_ROOT =
ESCAPE[ROOT]
- METADATA_KEYS =
[:author, :date, :creator, :keywords, :producer, :subject, :title, :length]
- GM_FORMATS =
["image/gif", "image/jpeg", "image/png", "image/x-ms-bmp", "image/svg+xml", "image/tiff", "image/x-portable-bitmap", "application/postscript", "image/x-portable-pixmap"]
- DEPENDENCIES =
{:java => false, :gm => false, :pdftotext => false, :pdftk => false, :pdftailor => false, :tesseract => false, :osd => false}
Class Method Summary collapse
-
.clean_text(text) ⇒ Object
Utility method to clean OCR’d text with garbage characters.
-
.extract_images(pdfs, opts = {}) ⇒ Object
Use the ExtractImages Java class to rasterize a PDF into each page’s image.
- .extract_info(pdfs, opts = {}) ⇒ Object
-
.extract_pages(pdfs, opts = {}) ⇒ Object
Use the ExtractPages Java class to burst a PDF into single pages.
-
.extract_pdf(docs, opts = {}) ⇒ Object
Use JODCConverter to extract the documents as PDFs.
-
.extract_text(pdfs, opts = {}) ⇒ Object
Use the ExtractText Java class to write out all embedded text.
Methods included from TransparentPDFs
Class Method Details
.clean_text(text) ⇒ Object
Utility method to clean OCR’d text with garbage characters.
85 86 87 |
# File 'lib/docsplit.rb', line 85 def self.clean_text(text) TextCleaner.new.clean(text) end |
.extract_images(pdfs, opts = {}) ⇒ Object
Use the ExtractImages Java class to rasterize a PDF into each page’s image.
56 57 58 59 60 |
# File 'lib/docsplit.rb', line 56 def self.extract_images(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) opts[:pages] = normalize_value(opts[:pages]) if opts[:pages] ImageExtractor.new.extract(pdfs, opts) end |
.extract_info(pdfs, opts = {}) ⇒ Object
79 80 81 82 |
# File 'lib/docsplit.rb', line 79 def self.extract_info(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) InfoExtractor.new.extract_all(pdfs, opts) end |
.extract_pages(pdfs, opts = {}) ⇒ Object
Use the ExtractPages Java class to burst a PDF into single pages.
44 45 46 47 |
# File 'lib/docsplit.rb', line 44 def self.extract_pages(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) PageExtractor.new.extract(pdfs, opts) end |
.extract_pdf(docs, opts = {}) ⇒ Object
Use JODCConverter to extract the documents as PDFs. If the document is in an image format, use GraphicsMagick to extract the PDF.
64 65 66 |
# File 'lib/docsplit.rb', line 64 def self.extract_pdf(docs, opts={}) PdfExtractor.new.extract(docs, opts) end |
.extract_text(pdfs, opts = {}) ⇒ Object
Use the ExtractText Java class to write out all embedded text.
50 51 52 53 |
# File 'lib/docsplit.rb', line 50 def self.extract_text(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) TextExtractor.new.extract(pdfs, opts) end |