Class: Unbreakable::Scraper

Inherits:
Object
  • Object
show all
Extended by:
Forwardable
Defined in:
lib/unbreakable/scraper.rb

Overview

You may implement a scraper by subclassing this class:

require 'open-uri'
class MyScraper < Unbreakable::Scraper
  # Stores the contents of +http://www.example.com/+ in +index.html+.
  def retrieve(args)
    store(:path => 'index.html'){ open('http://www.example.com/').read }
  end

  # Processes +index.html+.
  def process(args)
    fetch('index.html').process(:transform).apply
  end

  # Alternatively, you can just set the files to fetch, which will be
  # processed using a +:transform+ processor which you must implement.
  def processable
    ['index.html']
  end
end

To configure:

scraper.configure do |c|
  c.datastore = MyDataStore.new       # default Unbreakable::DataStorage::FileDataStore.new(scraper)
  c.log = Logger.new('/path/to/file') # default Logger.new(STDOUT)
  c.datastore.store_meta = true       # default false
end

The following instance methods must be implemented in sub-classes:

  • retrieve

  • process or processable

Instance Method Summary collapse

Constructor Details

#initializeScraper

Initializes a Dragonfly app for storage and processing.



48
49
50
51
52
53
54
55
56
57
58
# File 'lib/unbreakable/scraper.rb', line 48

def initialize
  @app = Dragonfly[SecureRandom.hex.to_sym]
  # defaults to Logger.new('/var/tmp/dragonfly.log')
  @app.log = Logger.new(STDOUT)
  # defaults to Dragonfly::DataStorage::FileDataStore.new
  @app.datastore = Unbreakable::DataStorage::FileDataStore.new(self)
  # defaults to '/var/tmp/dragonfly'
  @app.datastore.root_path = '/var/tmp/unbreakable'
  # defaults to true
  @app.datastore.store_meta = false
end

Instance Method Details

#optsOptionParser

Returns an option parser.

Returns:

  • (OptionParser)

    an option parser



62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# File 'lib/unbreakable/scraper.rb', line 62

def opts
  if @opts.nil?
    @opts = OptionParser.new
    @opts.banner = <<-eos
usage: #{@opts.program_name} [options] <command> [<args>]

The most commonly used commands are:
retrieve  Cache remote files to the datastore for later processing
process   Process cached files into machine-readable data
config    Print the current configuration
    eos

    @opts.separator ''
    @opts.separator 'Specific options:'
    extract_configuration @app

    @opts.separator ''
    @opts.separator 'General options:'
    @opts.on_tail('-h', '--help', 'Display this screen') do
      puts @opts
      exit
    end
  end
  @opts
end

#parse(temp_object_or_uid, encoding = 'utf-8') ⇒ Object

Parses a JSON, HTML, XML, or YAML file.

Parameters:

  • temp_object_or_uid (String, Dragonfly::TempObject)

    a TempObject or record ID

  • encoding (String) (defaults to: 'utf-8')

    a file encoding

Returns:

  • the parsing, either a Ruby or Nokogiri type

Raises:

  • (LoadError)

    if the / nokogiri gem is unavailable for parsing an HTML or XML file



124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# File 'lib/unbreakable/scraper.rb', line 124

def parse(temp_object_or_uid, encoding = 'utf-8')
  temp_object = temp_object_or_uid.is_a?(Dragonfly::TempObject) ? temp_object_or_uid : fetch(temp_object_or_uid)
  string = temp_object.data
  case File.extname temp_object.path
  when '.json'
    begin
      require 'yajl'
      Yajl::Parser.parse string
    rescue LoadError
      require 'json'
      JSON.parse string
    end
  when '.html'
    require 'nokogiri'
    Nokogiri::HTML string, nil, encoding
  when '.xml'
    require 'nokogiri'
    Nokogiri::XML string, nil, encoding
  when '.yml', '.yaml'
    require 'yaml'
    YAML.load string
  else
    string
  end
end

#process(args) ⇒ Object

Processes cached files into machine-readable data.

Parameters:

  • args (Array)

    command-line arguments



158
159
160
161
162
# File 'lib/unbreakable/scraper.rb', line 158

def process(args)
  processable.each do |record|
    fetch(record).process(:transform, :args => args).apply
  end
end

#processableArray<String>

Returns a list of record IDs to process.

Returns:

  • (Array<String>)

    a list of record IDs to process

Raises:

  • (NotImplementedError)


166
167
168
# File 'lib/unbreakable/scraper.rb', line 166

def processable
  raise NotImplementedError
end

#retrieve(args) ⇒ Object

Caches remote files to the datastore for later processing.

Parameters:

  • args (Array)

    command-line arguments

Raises:

  • (NotImplementedError)


152
153
154
# File 'lib/unbreakable/scraper.rb', line 152

def retrieve(args)
  raise NotImplementedError
end

#run(args) ⇒ Object

Note:

Only call this method once per scraper instance.

Runs the command. Most often run from a command-line script as:

scraper.run(ARGV)

Parameters:

  • args (Array)

    command-line arguments



94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
# File 'lib/unbreakable/scraper.rb', line 94

def run(args)
  opts.parse!(args)
  command = args.shift
  case command
  when 'retrieve'
    retrieve(args)
  when 'process'
    process(args)
  when 'config'
    print_configuration @app
  when nil
    puts opts
  else
    opts.abort "'#{command}' is not a #{opts.program_name} command. See '#{opts.program_name} --help'."
  end
end

#store(opts = {}, &block) ⇒ Object

Stores a record in the datastore.

Parameters:

  • opts (Hash) (defaults to: {})

    options to pass to the datastore

  • block (Proc)

    a block that yields the contents of the file



114
115
116
# File 'lib/unbreakable/scraper.rb', line 114

def store(opts = {}, &block)
  datastore.defer_store(opts, &block)
end