An utility to archive webpages through time.
Takes snapshots and make incremental backups of webpages assets so you can follow the pages’ evolutions through time.
-
Assets are stored in a git respository to simplify incremental storage and easy retrieval
-
Snapshots and thumbails are stored in a plain repository so they can easily be served by a webserver
-
List of webpages and archives instances are stored in an SQL database
-
Some caching data are stored in the same database
Required tools:
-
An SQL database supported by Sequel
-
PhantomJS [code.google.com/p/phantomjs]
Installation
-
Install the required tools
-
Install the gem
-
All configuration items have default value, have a look bellow if you want to customize them (default database configuration require the sqlite3 gem)
-
Use it !: all the required files and database structure will be created at first call
API
The public API is provided by WebpageArchivist::WebpageArchivist, example:
require 'webpage-archivist'
archivist = WebpageArchivist::WebpageArchivist.new
webpage = archivist.add_webpage('http://www.nytimes.com/' , 'The New York Times')
archivist.fetch_webpages [webpage.id]
Models are available in the lib/webpag-archivist/models.rb file, have a look at the Sequel API if you want to querry them.
Configuration
Basic configuration is done through environment variables:
-
DATABASE_URL
: database url, default tosqlite://#{Dir.pwd}/webpage-archivist.sqlite3
syntax is described here, remember to add the required database gem -
ARCHIVIST_ASSETS_PATH
: path to store the assets, default to./archivist_assets
-
ARCHIVIST_SNAPSHOTS_PATH
: path to store the thumbnail, default to./archivist_snapshots
-
ARCHIVIST_MAX_RUNNING_REQUESTS
: number of elements requests running in parallel (not so important since requests are run using EventMachine, default to 20 -
PHANTOMJS_PATH
: path to PhantomJS executable if they aren’t in the path -
GRAPHICS_MAGICK_PATH
: path to GraphicsMagick executable if it isn’t in the path -
BACKGROUND_THREAD_POOL_SIZE
: EventMachine pool size for background tasks like taking the snapshots (default to 20)
Configuration for snapshoting is done through the WebpageArchivist::Snapshoter class.
To enable debugging use
WebpageArchivist.log= true
Connect to the database / run migrations
The database connection is available as WebpageArchivist::DATABASE
and if you want to run your own migrations use
require 'webpage-archivist/migrations'
WebpageArchivist::Migrations.migration 'create table foo' do
WebpageArchivist::DATABASE.create_table :foos do
primary_key :id
# ...
end
end
WebpageArchivist::Migrations.new.run
this way your migrations will be run when the corresponding class is loaded