Class: Slaw::Extract::Extractor
- Inherits:
-
Object
- Object
- Slaw::Extract::Extractor
- Includes:
- Logging
- Defined in:
- lib/slaw/extract/extractor.rb
Overview
Routines for extracting and cleaning up context from other formats, such as PDF.
You may need to set the location of the ‘pdftotext` binary.
On Mac OS X, use ‘brew install xpdf` or download from www.foolabs.com/xpdf/download.html
On Heroku, you’ll need to do some hoop jumping, see theprogrammingbutler.com/blog/archives/2011/07/28/running-pdftotext-on-heroku/
Constant Summary collapse
- @@pdftotext_path =
"pdftotext"
Class Method Summary collapse
-
.pdftotext_path ⇒ Object
Get location of the pdftotext executable for all instances.
-
.pdftotext_path=(val) ⇒ Object
Set location of the pdftotext executable for all instances.
Instance Method Summary collapse
-
#extract_from_file(filename) ⇒ String
Extract text from a file.
-
#extract_from_pdf(filename) ⇒ String
Extract text from a PDF.
- #extract_from_text(filename) ⇒ Object
-
#extract_via_tika(filename) ⇒ Object
Extract text from
filename
by sending it to apache tika tika.apache.org/. - #get_mimetype(filename) ⇒ Object
-
#pdf_to_text_cmd(filename) ⇒ Array<String>
Build a command for the external PDF-to-text utility.
- #remove_pdf_password(filename) ⇒ Object
Methods included from Logging
Class Method Details
.pdftotext_path ⇒ Object
Get location of the pdftotext executable for all instances.
115 116 117 |
# File 'lib/slaw/extract/extractor.rb', line 115 def self.pdftotext_path @@pdftotext_path end |
.pdftotext_path=(val) ⇒ Object
Set location of the pdftotext executable for all instances.
120 121 122 |
# File 'lib/slaw/extract/extractor.rb', line 120 def self.pdftotext_path=(val) @@pdftotext_path = val end |
Instance Method Details
#extract_from_file(filename) ⇒ String
Extract text from a file.
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# File 'lib/slaw/extract/extractor.rb', line 25 def extract_from_file(filename) mimetype = get_mimetype(filename) case mimetype && mimetype.type when 'application/pdf' extract_from_pdf(filename) when 'text/plain', nil extract_from_text(filename) else text = extract_via_tika(filename) if text.empty? or text.nil? raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}") end text end end |
#extract_from_pdf(filename) ⇒ String
Extract text from a PDF
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 |
# File 'lib/slaw/extract/extractor.rb', line 47 def extract_from_pdf(filename) retried = false while true cmd = pdf_to_text_cmd(filename) logger.info("Executing: #{cmd}") stdout, status = Open3.capture2(*cmd) case status.exitstatus when 0 return stdout when 3 return nil if retried retried = true self.remove_pdf_password(filename) else return nil end end end |
#extract_from_text(filename) ⇒ Object
77 78 79 |
# File 'lib/slaw/extract/extractor.rb', line 77 def extract_from_text(filename) File.read(filename) end |
#extract_via_tika(filename) ⇒ Object
Extract text from filename
by sending it to apache tika tika.apache.org/
83 84 85 86 87 88 89 90 91 92 93 |
# File 'lib/slaw/extract/extractor.rb', line 83 def extract_via_tika(filename) # the Yomu gem falls over when trying to write large amounts of data # the JVM stdin, so we manually call java ourselves, relying on yomu # to supply the gem require 'slaw/extract/yomu_patch' logger.info("Using Tika to get text from #{filename}. You need a JVM installed for this.") text = Yomu.text_from_file(filename) logger.info("Tika returned #{text.length} bytes") text end |
#get_mimetype(filename) ⇒ Object
109 110 111 112 |
# File 'lib/slaw/extract/extractor.rb', line 109 def get_mimetype(filename) File.open(filename) { |f| MimeMagic.by_magic(f) } \ || MimeMagic.by_path(filename) end |
#pdf_to_text_cmd(filename) ⇒ Array<String>
Build a command for the external PDF-to-text utility.
73 74 75 |
# File 'lib/slaw/extract/extractor.rb', line 73 def pdf_to_text_cmd(filename) [Extractor.pdftotext_path, "-enc", "UTF-8", filename, "-"] end |
#remove_pdf_password(filename) ⇒ Object
95 96 97 98 99 100 101 102 103 104 105 106 107 |
# File 'lib/slaw/extract/extractor.rb', line 95 def remove_pdf_password(filename) file = Tempfile.new('steno') begin logger.info("Trying to remove password from #{filename}") cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=#{file.path} -c .setpdfwrite -f #{filename}".split(" ") logger.info("Executing: #{cmd}") Open3.capture2(*cmd) FileUtils.move(file.path, filename) ensure file.close file.unlink end end |