Module: Unbreakable
- Defined in:
- lib/unbreakable.rb,
lib/unbreakable/scraper.rb,
lib/unbreakable/version.rb,
lib/unbreakable/observers/log.rb,
lib/unbreakable/decorators/timeout.rb,
lib/unbreakable/observers/observer.rb,
lib/unbreakable/processors/transform.rb,
lib/unbreakable/data_storage/file_data_store.rb
Overview
When using this gem, you’ll start by defining a Scraper, with methods for retrieving and processing data. The data will be stored in DataStorage; this gem currently provides only a FileDataStore. You may enhance a datastore with Decorators and Observers: for example, a Timeout decorator to retry on timeout with exponential backoff and a Log observer which logs retrieval progress. Of course, you must also define a Processor to turn your raw data into machine-readable data.
A simple skeleton scraper:
require 'unbreakable'
class MyScraper < Unbreakable::Scraper
def retrieve(args)
# download all the documents
end
def processable
# return a list of documents to process
end
end
class MyProcessor < Unbreakable::Processors::Transform
def perform
# return the transformed record as a hash, array, etc.
end
def persist(arg)
# store the hash/array/etc. in Mongo, MySQL, YAML, etc.
end
end
scraper = MyScraper.new
scraper.processor.register MyProcessor
scraper.configure do |c|
# configure the scraper
end
scraper.run(ARGV)
Every scraper script can run as a command-line script. Try it!
$ ruby myscraper.rb
usage: irb [options] <command> [<args>]
The most commonly used commands are:
retrieve Cache remote files to the datastore for later processing
process Process cached files into machine-readable data
config Print the current configuration
Specific options:
--root_path ARG default "/var/tmp/unbreakable"
--[no-]store_meta default true
--cache_duration ARG default 31536000
--fallback_mime_type ARG default "application/octet-stream"
--secret ARG default "secret yo"
--[no-]trust_file_extensions default true
General options:
-h, --help Display this screen
Defined Under Namespace
Modules: DataStorage, Decorators, Observers, Processors Classes: InvalidRemoteFile, Scraper, UnbreakableError
Constant Summary collapse
- VERSION =
"0.0.6"