An utility to archive webpages through time.

Takes snapshots and make incremental backups of webpages assets so you can follow the pages’ evolutions through time.

Assets are stored in a git respository to simplify incremental storage and easy retrieval
Snapshots and thumbails are stored in a plain repository so they can easily be served by a webserver
List of webpages and archives instances are stored in an SQL database
Some caching data are stored in the same database

Required tools:

An SQL database supported by Sequel
Git
GraphicsMagick
PhantomJS [code.google.com/p/phantomjs]

Installation

Install the required tools
Install the gem
All configuration items have default value, have a look bellow if you want to customize them (default database configuration require the sqlite3 gem)
Use it !: all the required files and database structure will be created at first call

API

The public API is provided by WebpageArchivist::WebpageArchivist, example:

require 'webpage-archivist'
archivist = WebpageArchivist::WebpageArchivist.new
webpage = archivist.add_webpage('http://www.nytimes.com/' , 'The New York Times')
archivist.fetch_webpages [webpage.id]

Models are available in the lib/webpag-archivist/models.rb file, have a look at the Sequel API if you want to querry them.

Configuration

Basic configuration is done through environment variables:

DATABASE_URL : database url, default to sqlite://#{Dir.pwd}/webpage-archivist.sqlite3 syntax is described here, remember to add the required database gem
ARCHIVIST_ASSETS_PATH : path to store the assets, default to ./archivist_assets
ARCHIVIST_SNAPSHOTS_PATH : path to store the thumbnail, default to ./archivist_snapshots
ARCHIVIST_MAX_RUNNING_REQUESTS : number of elements requests running in parallel (not so important since requests are run using EventMachine, default to 20
PHANTOMJS_PATH: path to PhantomJS executable if they aren’t in the path
GRAPHICS_MAGICK_PATH : path to GraphicsMagick executable if it isn’t in the path
BACKGROUND_THREAD_POOL_SIZE: EventMachine pool size for background tasks like taking the snapshots (default to 20)

Configuration for snapshoting is done through the WebpageArchivist::Snapshoter class.

To enable debugging use

WebpageArchivist.log= true

Connect to the database / run migrations

The database connection is available as WebpageArchivist::DATABASE and if you want to run your own migrations use

require 'webpage-archivist/migrations'
WebpageArchivist::Migrations.migration 'create table foo' do
  WebpageArchivist::DATABASE.create_table :foos do
    primary_key :id
    # ...
  end
end

WebpageArchivist::Migrations.new.run

this way your migrations will be run when the corresponding class is loaded

Released under the MIT license