Class: Slaw::Extract::Extractor

Inherits:
Object
  • Object
show all
Includes:
Logging
Defined in:
lib/slaw/extract/extractor.rb

Overview

Routines for extracting and cleaning up context from other formats, such as PDF.

You may need to set the location of the ‘pdftotext` binary.

On Mac OS X, use ‘brew install xpdf` or download from www.foolabs.com/xpdf/download.html

On Heroku, you’ll need to do some hoop jumping, see theprogrammingbutler.com/blog/archives/2011/07/28/running-pdftotext-on-heroku/

Constant Summary collapse

@@pdftotext_path =
"pdftotext"

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Logging

#logger

Class Method Details

.pdftotext_pathObject

Get location of the pdftotext executable for all instances.



115
116
117
# File 'lib/slaw/extract/extractor.rb', line 115

def self.pdftotext_path
  @@pdftotext_path
end

.pdftotext_path=(val) ⇒ Object

Set location of the pdftotext executable for all instances.



120
121
122
# File 'lib/slaw/extract/extractor.rb', line 120

def self.pdftotext_path=(val)
  @@pdftotext_path = val
end

Instance Method Details

#extract_from_file(filename) ⇒ String

Extract text from a file.

Parameters:

  • filename (String)

    filename to extract from

Returns:

  • (String)

    extracted text



25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# File 'lib/slaw/extract/extractor.rb', line 25

def extract_from_file(filename)
  mimetype = get_mimetype(filename)

  case mimetype && mimetype.type
  when 'application/pdf'
    extract_from_pdf(filename)
  when 'text/plain', nil
    extract_from_text(filename)
  else
    text = extract_via_tika(filename)
    if text.empty? or text.nil?
      raise ArgumentError.new("Unsupported file type #{mimetype || 'unknown'}")
    end
    text
  end
end

#extract_from_pdf(filename) ⇒ String

Extract text from a PDF

Parameters:

  • filename (String)

    filename to extract from

Returns:

  • (String)

    extracted text



47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# File 'lib/slaw/extract/extractor.rb', line 47

def extract_from_pdf(filename)
  retried = false

  while true
    cmd = pdf_to_text_cmd(filename)
    logger.info("Executing: #{cmd}")
    stdout, status = Open3.capture2(*cmd)

    case status.exitstatus
    when 0
      return stdout
    when 3
      return nil if retried
      retried = true
      self.remove_pdf_password(filename)
    else
      return nil
    end
  end
end

#extract_from_text(filename) ⇒ Object



77
78
79
# File 'lib/slaw/extract/extractor.rb', line 77

def extract_from_text(filename)
  File.read(filename)
end

#extract_via_tika(filename) ⇒ Object

Extract text from filename by sending it to apache tika tika.apache.org/



83
84
85
86
87
88
89
90
91
92
93
# File 'lib/slaw/extract/extractor.rb', line 83

def extract_via_tika(filename)
  # the Yomu gem falls over when trying to write large amounts of data
  # the JVM stdin, so we manually call java ourselves, relying on yomu
  # to supply the gem
  require 'slaw/extract/yomu_patch'
  logger.info("Using Tika to get text from #{filename}. You need a JVM installed for this.")

  text = Yomu.text_from_file(filename)
  logger.info("Tika returned #{text.length} bytes")
  text
end

#get_mimetype(filename) ⇒ Object



109
110
111
112
# File 'lib/slaw/extract/extractor.rb', line 109

def get_mimetype(filename)
  File.open(filename) { |f| MimeMagic.by_magic(f) } \
    || MimeMagic.by_path(filename)
end

#pdf_to_text_cmd(filename) ⇒ Array<String>

Build a command for the external PDF-to-text utility.

Parameters:

  • filename (String)

    the pdf file

Returns:

  • (Array<String>)

    command and params to execute



73
74
75
# File 'lib/slaw/extract/extractor.rb', line 73

def pdf_to_text_cmd(filename)
  [Extractor.pdftotext_path, "-enc", "UTF-8", filename, "-"]
end

#remove_pdf_password(filename) ⇒ Object



95
96
97
98
99
100
101
102
103
104
105
106
107
# File 'lib/slaw/extract/extractor.rb', line 95

def remove_pdf_password(filename)
  file = Tempfile.new('steno')
  begin
    logger.info("Trying to remove password from #{filename}")
    cmd = "gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=#{file.path} -c .setpdfwrite -f #{filename}".split(" ")
    logger.info("Executing: #{cmd}")
    Open3.capture2(*cmd)
    FileUtils.move(file.path, filename)
  ensure
    file.close
    file.unlink
  end
end