Extraloop Redis Storage

Description

Persistence layer for the ExtraLoop data extraction toolkit. This module is implemented as a wrapper around Ohm, an object-hash mapping library which makes easy storing structured data into Redis. Includes a convinent command line tool that allows to list, filter, and delete harvested datasets, as well as exporting them on local files or remote data stores (i.e Google Fusion tables).

Installation

gem install extraloop-redis-storage

Usage

Extraloop’s Redis storage module decorates ExtraLoop::ScraperBase and ExtraLoop::IterativeScraper instances with the set_storage method: a helper method that allows to specify how the scraped data should be stored.

require "extraloop/redis-storage"

class AmazonReview < ExtraLoop::Storage::Record
  attribute :title
  attribute :rank
  attribute :date

  def validate
    assert (0..5).include?(rank.to_i), "Rank not in range"
  end
end

scraper = AmazonReviewScraper.new("0262560992").
  .set_storage(AmazonReview, "Amazon reviews of 'The Little Schemer'")
  .run()

At each scraper run, the ExtraLoop storage module internally instantiates a session (see ExtraLoop::Storage::ScrapingSession) and associates the extracted records to it. The ‘AmazonReview` records just created, can now be accessed by calling the `#records` metod on scraper session object.

reviews = scraper.session.records

#set_storage

The set_storage method accepts the following arguments:

  • model A Ruby constant or a symbol specifying the model to be used for storing the extracted data. If a symbol is passed, it is assumed that a model does not exist and the storage module dynamically generates one by subclassing ExtraLoop::Storage::Record.

  • session_title A human readable title for the extracted dataset (optional).

Command line interface

Once installed, the gem will also add to your system path the extraloop executable: a command line interface to the datasets harvested through ExtraLoop. A list of datasets can be obtained by running:

extraloop datastore list

This will generate a table like the following one:

 id | title                              | model           | records
--------------------------------------------------------------------
 48 | 1330106699 GoogleNewsStory Dataset | GoogleNewsStory | 110    
 49 | 1330106948 AmazonReview Dataset    | AmazonReview    | 0      
 51 | 1330107087 GoogleNewsStory Dataset | GoogleNewsStory | 110    
 52 | 1330111630 AmazonReview Dataset    | AmazonReview    | 10

Datasets can be removed using the delete subcommand:

extraloop datastore delete [id]

Where id is either a single scraping session id, or a session id range (e.g. 48..52).

From the Redis datastore, ExtraLoop datasets can be exported to disk as CSV, JSON, or YAML documents:

extraloop datastore export 51..52 -f csv

Similarly, stored datasets can be uploaded to a remote datastore:

extraloop datastore push 51..48 fusion_tables -c google_username:password

While Google’s Fusion Tables is currently the only one implemented, support for pushing dataset to other remote datastores (e.g. couchDB, cartoDB, and CKAN Webstore) will be added soon.