Scrapes

Scrapes is a framework for crawling and scraping multi-page web sites.

Unlike other scraping frameworks, Scrapes is designed to work with “dirty” web sites. That is, web sites that were not designed to have their data extracted programmatically.

It includes features for both the initial development of a scraper, and the continued maintenance of that scraper. These features include:

Rule based selection and extraction of data that can use CSS selectors or pseudo XPath expressions
Caching system so that during development you don’t have to continuously download pages from a web server while you experiment with your selectors and extractors
Validation system that helps detect web site changes that would otherwise invalidate your extraction rules
Support for initiating a session with the web server, and passing session cookies back to the web server
When all else fails, you can run a web page through the xsltproc XSLT processor to generate an XML document that can then be run through your rule based parser
Useful set of post-processing methods such as normalize_name

Installing Scrapes

gem install scrapes --include-dependencies

Dependencies

Hpricot: code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase
Rextra: rubyforge.org/projects/rextra2/

Quick Start

You start by writing a class for parsing a single page:

# process the Google.com index.html page
class GoogleMain < Scrapes::Page
  # make sure that the :about_link rule matched the web page
  validates_presence_of(:about_link)

  # extract the link to the about page
  rule(:about_link, 'a[@href*="about"]', '@href', 1)
end

# process the Google.com about page
class GoogleAbout < Scrapes::Page
  # ensure the :title rule below matches the web page
  validates_presence_of(:title)

  # extract the text inside the <title></title> tag
  rule(:title, 'title', 'text()', 1)
end

Then you start a scraping session and use those classes to process the web site:

Scrapes::Session.start do |session|
  session.page(GoogleMain, 'http://google.com') do |main_page|
    session.page(GoogleAbout, main_page.about_link) do |about_page|
      puts about_page.title + ': ' + session.absolute_uri(main_page.about_link)
    end
  end
end

On my machine, this code produces:

About Google: http://www.google.com/intl/en/about.html

For more information, please review the following classes:

Scrapes::Session
Scrapes::Page
Scrapes::RuleParser
Scrapes::Hpricot::Extractors

Development Tips

Add something like this to your .irbrc:

require 'rubygems'
require 'yaml'
require 'open-uri'
require 'hpricot'
require 'scrapes'
def h(url) Hpricot(open(url)) end

Then use like this in irb to understand how Hpricot selectors work:

doc = h 'http://www.foobar.com/'
links = doc.search('table/a[@href]')  # for example

To understand the text extractors:

texts(links)
word(links.first)  # etc..

Converting normal Xpath to Hpricot Xpath, sort of:

There are various add-ons to firefox, for example, that display the Xpath to a selected node. Hpricot uses a different sytanx however, (code.whytheluckystiff.net/hpricot/wiki/SupportedXpathExpressions). The following method is a first try at the conversion:

def xpath_to_hpricot path
  path.split('/').reject{|e|e=~/^(html|tbody)$/ or e.blank?}.map do |e|
    res = e.sub(/\[/,':eq(').sub(/\]/,')')
    res.sub(/\d+/, (/(\d+)/.match(res).to_s.to_i - 1).to_s)
  end.join('//')
end

Hpricot bugs

This selector will hang, ‘a’ and this one won’t, ‘a’. Just make sure you have the ‘@’ in front of the attribute name.

Credits

Peter Jones, author and maintainer
Michael Garriss, author and maintainer
Bob Showalter, continuous improvements and maintenance
Assaf Arkin, rule inspiration from trac.labnotes.org/cgi-bin/trac.cgi/wiki/Ruby/MicroformatParser