Module: Docsplit
- Extended by:
- TransparentPDFs
- Defined in:
- lib/docsplit.rb,
lib/docsplit/command_line.rb,
lib/docsplit/text_cleaner.rb,
lib/docsplit/info_extractor.rb,
lib/docsplit/page_extractor.rb,
lib/docsplit/text_extractor.rb,
lib/docsplit/image_extractor.rb,
lib/docsplit/transparent_pdfs.rb
Overview
The Docsplit module delegates to the Java PDF extractors.
Defined Under Namespace
Modules: TransparentPDFs Classes: CommandLine, ExtractionFailed, ImageExtractor, InfoExtractor, PageExtractor, TextCleaner, TextExtractor
Constant Summary collapse
- VERSION =
Keep in sync with gemspec.
'0.5.0'
- ROOT =
File.(File.dirname(__FILE__) + '/..')
- CLASSPATH =
"#{ROOT}/build#{File::PATH_SEPARATOR}#{ROOT}/vendor/'*'"
- LOGGING =
"-Djava.util.logging.config.file=#{ROOT}/vendor/logging.properties"
- HEADLESS =
"-Djava.awt.headless=true"
- OFFICE =
RUBY_PLATFORM.match(/darwin/i) ? '' : '-Doffice.home=/usr/lib/openoffice'
- METADATA_KEYS =
[:author, :date, :creator, :keywords, :producer, :subject, :title, :length]
- DEPENDENCIES =
{:java => false, :gm => false, :pdftotext => false, :pdftk => false, :tesseract => false}
Class Method Summary collapse
-
.clean_text(text) ⇒ Object
Utility method to clean OCR’d text with garbage characters.
-
.extract_images(pdfs, opts = {}) ⇒ Object
Use the ExtractImages Java class to rasterize a PDF into each page’s image.
-
.extract_pages(pdfs, opts = {}) ⇒ Object
Use the ExtractPages Java class to burst a PDF into single pages.
-
.extract_pdf(docs, opts = {}) ⇒ Object
Use JODCConverter to extract the documents as PDFs.
-
.extract_text(pdfs, opts = {}) ⇒ Object
Use the ExtractText Java class to write out all embedded text.
Methods included from TransparentPDFs
Class Method Details
.clean_text(text) ⇒ Object
Utility method to clean OCR’d text with garbage characters.
76 77 78 |
# File 'lib/docsplit.rb', line 76 def self.clean_text(text) TextCleaner.new.clean(text) end |
.extract_images(pdfs, opts = {}) ⇒ Object
Use the ExtractImages Java class to rasterize a PDF into each page’s image.
49 50 51 52 53 |
# File 'lib/docsplit.rb', line 49 def self.extract_images(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) opts[:pages] = normalize_value(opts[:pages]) if opts[:pages] ImageExtractor.new.extract(pdfs, opts) end |
.extract_pages(pdfs, opts = {}) ⇒ Object
Use the ExtractPages Java class to burst a PDF into single pages.
37 38 39 40 |
# File 'lib/docsplit.rb', line 37 def self.extract_pages(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) PageExtractor.new.extract(pdfs, opts) end |
.extract_pdf(docs, opts = {}) ⇒ Object
Use JODCConverter to extract the documents as PDFs.
56 57 58 59 60 61 62 |
# File 'lib/docsplit.rb', line 56 def self.extract_pdf(docs, opts={}) [docs].flatten.each do |doc| basename = File.basename(doc, File.extname(doc)) = "-jar #{ROOT}/vendor/jodconverter/jodconverter-core-3.0-beta-3.jar -r #{ROOT}/vendor/conf/document-formats.js" run "#{} \"#{doc}\" \"#{opts[:output] || '.'}/#{basename}.pdf\"", [], {} end end |
.extract_text(pdfs, opts = {}) ⇒ Object
Use the ExtractText Java class to write out all embedded text.
43 44 45 46 |
# File 'lib/docsplit.rb', line 43 def self.extract_text(pdfs, opts={}) pdfs = ensure_pdfs(pdfs) TextExtractor.new.extract(pdfs, opts) end |