Class: Unbreakable::Scraper

Inherits:
Object
  • Object
show all
Extended by:
Forwardable
Defined in:
lib/unbreakable/scraper.rb

Overview

You may implement a scraper by subclassing this class:

require 'open-uri'
class MyScraper < Unbreakable::Scraper
  # Stores the contents of +http://www.example.com/+ in +index.html+.
  def retrieve(args)
    store(:path => 'index.html'){ open('http://www.example.com/').read }
  end

  # Processes +index.html+.
  def process(args)
    fetch('index.html').process(:transform).apply
  end

  # Alternatively, you can just set the files to fetch, which will be
  # processed using a +:transform+ processor which you must implement.
  def processable
    ['index.html']
  end
end

To configure:

scraper.configure do |c|
  c.datastore = MyDataStore.new       # default Unbreakable::DataStorage::FileDataStore.new(scraper)
  c.log = Logger.new('/path/to/file') # default Logger.new(STDOUT)
  c.datastore.store_meta = true       # default false
end

The following instance methods must be implemented in sub-classes:

  • retrieve

  • process or processable

Constant Summary collapse

@@commands =
[]

Instance Method Summary collapse

Constructor Details

#initializeScraper

Initializes a Dragonfly app for storage and processing.



51
52
53
54
55
56
57
58
59
60
61
# File 'lib/unbreakable/scraper.rb', line 51

def initialize
  @app = Dragonfly[SecureRandom.hex.to_sym]
  # defaults to Logger.new('/var/tmp/dragonfly.log')
  @app.log = Logger.new(STDOUT)
  # defaults to Dragonfly::DataStorage::FileDataStore.new
  @app.datastore = Unbreakable::DataStorage::FileDataStore.new(self)
  # defaults to '/var/tmp/dragonfly'
  @app.datastore.root_path = '/var/tmp/unbreakable'
  # defaults to true
  @app.datastore.store_meta = false
end

Instance Method Details

#general_optionsObject

This method is abstract.

Override to add general options to the option parser.

def general_options

  @opts.on('--echo ARG', 'Write a string to standard output') do |x|
    puts x
  end
end


109
# File 'lib/unbreakable/scraper.rb', line 109

def general_options; end

#optsOptionParser

Returns an option parser.

Returns:

  • (OptionParser)

    an option parser



65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# File 'lib/unbreakable/scraper.rb', line 65

def opts
  if @opts.nil?
    @opts = OptionParser.new
    @opts.banner = <<-eos
usage: #{@opts.program_name} [options] <command> [<args>]

The most commonly used commands are:
retrieve  Cache remote files to the datastore for later processing
process   Process cached files into machine-readable data
config    Print the current configuration
    eos

    @opts.separator ''
    @opts.separator 'Specific options:'
    specific_options
    extract_configuration @app

    @opts.separator ''
    @opts.separator 'General options:'
    general_options
    @opts.on_tail('-h', '--help', 'Display this screen') do
      puts @opts
      exit
    end
  end
  @opts
end

#parse(temp_object_or_uid, encoding = 'utf-8') ⇒ Object

Parses a JSON, HTML, XML, or YAML file.

Parameters:

  • temp_object_or_uid (String, Dragonfly::TempObject)

    a TempObject or record ID

  • encoding (String) (defaults to: 'utf-8')

    a file encoding

Returns:

  • the parsing, either a Ruby or Nokogiri type

Raises:

  • (LoadError)

    if the / nokogiri gem is unavailable for parsing an HTML or XML file



152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
# File 'lib/unbreakable/scraper.rb', line 152

def parse(temp_object_or_uid, encoding = 'utf-8')
  temp_object = temp_object_or_uid.is_a?(Dragonfly::TempObject) ? temp_object_or_uid : fetch(temp_object_or_uid)
  string = temp_object.data
  case File.extname temp_object.path
  when '.json'
    begin
      require 'yajl'
      Yajl::Parser.parse string
    rescue LoadError
      require 'json'
      JSON.parse string
    end
  when '.html'
    require 'nokogiri'
    Nokogiri::HTML string, nil, encoding
  when '.xml'
    require 'nokogiri'
    Nokogiri::XML string, nil, encoding
  when '.yml', '.yaml'
    require 'yaml'
    YAML.load string
  else
    string
  end
end

#process(args) ⇒ Object

Processes cached files into machine-readable data.

Parameters:

  • args (Array)

    command-line arguments



186
187
188
189
190
# File 'lib/unbreakable/scraper.rb', line 186

def process(args)
  processable.each do |record|
    fetch(record).process(:transform, :args => args).apply
  end
end

#processableArray<String>

Returns a list of record IDs to process.

Returns:

  • (Array<String>)

    a list of record IDs to process

Raises:

  • (NotImplementedError)


194
195
196
# File 'lib/unbreakable/scraper.rb', line 194

def processable
  raise NotImplementedError
end

#retrieve(args) ⇒ Object

Caches remote files to the datastore for later processing.

Parameters:

  • args (Array)

    command-line arguments

Raises:

  • (NotImplementedError)


180
181
182
# File 'lib/unbreakable/scraper.rb', line 180

def retrieve(args)
  raise NotImplementedError
end

#run(args) ⇒ Object

Note:

Only call this method once per scraper instance.

Runs the command. Most often run from a command-line script as:

scraper.run(ARGV)

Parameters:

  • args (Array)

    command-line arguments



117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# File 'lib/unbreakable/scraper.rb', line 117

def run(args)
  opts.parse!(args)
  command = args.shift
  case command
  when 'retrieve'
    retrieve(args)
  when 'process'
    process(args)
  when 'config'
    print_configuration @app
  when nil
    puts opts
  else
    # Allow subclasses to add more commands.
    if self.commands.include? command.to_sym
      send command, args
    else
      opts.abort "'#{command}' is not a #{opts.program_name} command. See '#{opts.program_name} --help'."
    end
  end
end

#specific_optionsObject

This method is abstract.

Override to add specific options to the option parser.

def specific_options

  @opts.on('--echo ARG', 'Write a string to standard output') do |x|
    puts x
  end
end


100
# File 'lib/unbreakable/scraper.rb', line 100

def specific_options; end

#store(opts = {}, &block) ⇒ Object

Stores a record in the datastore.

Parameters:

  • opts (Hash) (defaults to: {})

    options to pass to the datastore

  • block (Proc)

    a block that yields the contents of the file



142
143
144
# File 'lib/unbreakable/scraper.rb', line 142

def store(opts = {}, &block)
  datastore.defer_store(opts, &block)
end