Class: Unbreakable::Scraper
- Inherits:
-
Object
- Object
- Unbreakable::Scraper
- Extended by:
- Forwardable
- Defined in:
- lib/unbreakable/scraper.rb
Overview
You may implement a scraper by subclassing this class:
require 'open-uri'
class MyScraper < Unbreakable::Scraper
# Stores the contents of +http://www.example.com/+ in +index.html+.
def retrieve(args)
store(:path => 'index.html'){ open('http://www.example.com/').read }
end
# Processes +index.html+.
def process(args)
fetch('index.html').process(:transform).apply
end
# Alternatively, you can just set the files to fetch, which will be
# processed using a +:transform+ processor which you must implement.
def processable
['index.html']
end
end
To configure:
scraper.configure do |c|
c.datastore = MyDataStore.new # default Unbreakable::DataStorage::FileDataStore.new(scraper)
c.log = Logger.new('/path/to/file') # default Logger.new(STDOUT)
c.datastore. = true # default false
end
The following instance methods must be implemented in sub-classes:
-
retrieve
-
process
orprocessable
Constant Summary collapse
- @@commands =
[]
Instance Method Summary collapse
-
#general_options ⇒ Object
abstract
def general_options @opts.on(‘–echo ARG’, ‘Write a string to standard output’) do |x| puts x end end.
-
#initialize ⇒ Scraper
constructor
Initializes a Dragonfly app for storage and processing.
-
#opts ⇒ OptionParser
Returns an option parser.
-
#parse(temp_object_or_uid, encoding = 'utf-8') ⇒ Object
Parses a JSON, HTML, XML, or YAML file.
-
#process(args) ⇒ Object
Processes cached files into machine-readable data.
-
#processable ⇒ Array<String>
Returns a list of record IDs to process.
-
#retrieve(args) ⇒ Object
Caches remote files to the datastore for later processing.
-
#run(args) ⇒ Object
Runs the command.
-
#specific_options ⇒ Object
abstract
def specific_options @opts.on(‘–echo ARG’, ‘Write a string to standard output’) do |x| puts x end end.
-
#store(opts = {}, &block) ⇒ Object
Stores a record in the datastore.
Constructor Details
#initialize ⇒ Scraper
Initializes a Dragonfly app for storage and processing.
51 52 53 54 55 56 57 58 59 60 61 |
# File 'lib/unbreakable/scraper.rb', line 51 def initialize @app = Dragonfly[SecureRandom.hex.to_sym] # defaults to Logger.new('/var/tmp/dragonfly.log') @app.log = Logger.new(STDOUT) # defaults to Dragonfly::DataStorage::FileDataStore.new @app.datastore = Unbreakable::DataStorage::FileDataStore.new(self) # defaults to '/var/tmp/dragonfly' @app.datastore.root_path = '/var/tmp/unbreakable' # defaults to true @app.datastore. = false end |
Instance Method Details
#general_options ⇒ Object
Override to add general options to the option parser.
def general_options
@opts.on('--echo ARG', 'Write a string to standard output') do |x|
puts x
end
end
109 |
# File 'lib/unbreakable/scraper.rb', line 109 def ; end |
#opts ⇒ OptionParser
Returns an option parser.
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'lib/unbreakable/scraper.rb', line 65 def opts if @opts.nil? @opts = OptionParser.new @opts. = <<-eos usage: #{@opts.program_name} [options] <command> [<args>] The most commonly used commands are: retrieve Cache remote files to the datastore for later processing process Process cached files into machine-readable data config Print the current configuration eos @opts.separator '' @opts.separator 'Specific options:' extract_configuration @app @opts.separator '' @opts.separator 'General options:' @opts.on_tail('-h', '--help', 'Display this screen') do puts @opts exit end end @opts end |
#parse(temp_object_or_uid, encoding = 'utf-8') ⇒ Object
Parses a JSON, HTML, XML, or YAML file.
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 |
# File 'lib/unbreakable/scraper.rb', line 152 def parse(temp_object_or_uid, encoding = 'utf-8') temp_object = temp_object_or_uid.is_a?(Dragonfly::TempObject) ? temp_object_or_uid : fetch(temp_object_or_uid) string = temp_object.data case File.extname temp_object.path when '.json' begin require 'yajl' Yajl::Parser.parse string rescue LoadError require 'json' JSON.parse string end when '.html' require 'nokogiri' Nokogiri::HTML string, nil, encoding when '.xml' require 'nokogiri' Nokogiri::XML string, nil, encoding when '.yml', '.yaml' require 'yaml' YAML.load string else string end end |
#process(args) ⇒ Object
Processes cached files into machine-readable data.
186 187 188 189 190 |
# File 'lib/unbreakable/scraper.rb', line 186 def process(args) processable.each do |record| fetch(record).process(:transform, :args => args).apply end end |
#processable ⇒ Array<String>
Returns a list of record IDs to process.
194 195 196 |
# File 'lib/unbreakable/scraper.rb', line 194 def processable raise NotImplementedError end |
#retrieve(args) ⇒ Object
Caches remote files to the datastore for later processing.
180 181 182 |
# File 'lib/unbreakable/scraper.rb', line 180 def retrieve(args) raise NotImplementedError end |
#run(args) ⇒ Object
Only call this method once per scraper instance.
Runs the command. Most often run from a command-line script as:
scraper.run(ARGV)
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
# File 'lib/unbreakable/scraper.rb', line 117 def run(args) opts.parse!(args) command = args.shift case command when 'retrieve' retrieve(args) when 'process' process(args) when 'config' print_configuration @app when nil puts opts else # Allow subclasses to add more commands. if self.commands.include? command.to_sym send command, args else opts.abort "'#{command}' is not a #{opts.program_name} command. See '#{opts.program_name} --help'." end end end |
#specific_options ⇒ Object
Override to add specific options to the option parser.
def specific_options
@opts.on('--echo ARG', 'Write a string to standard output') do |x|
puts x
end
end
100 |
# File 'lib/unbreakable/scraper.rb', line 100 def ; end |
#store(opts = {}, &block) ⇒ Object
Stores a record in the datastore.
142 143 144 |
# File 'lib/unbreakable/scraper.rb', line 142 def store(opts = {}, &block) datastore.defer_store(opts, &block) end |