Scrapes
Scrapes is a framework for crawling and scraping multi-page web sites.
Unlike other scraping frameworks, Scrapes is designed to work with “dirty” web sites. That is, web sites that were not designed to have their data extracted programmatically.
It includes features for both the initial development of a scraper, and the continued maintenance of that scraper. These features include:
-
Rule based selection and extraction of data that can use CSS selectors or pseudo XPath expressions
-
Caching system so that during development you don’t have to continuously download pages from a web server while you experiment with your selectors and extractors
-
Validation system that helps detect web site changes that would otherwise invalidate your extraction rules
-
Support for initiating a session with the web server, and passing session cookies back to the web server
-
When all else fails, you can run a web page through the xsltproc XSLT processor to generate an XML document that can then be run through your rule based parser
-
Useful set of post-processing methods such as normalize_name
Installing Scrapes
gem install scrapes --include-dependencies
Dependencies
-
Hpricot: code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase
-
Rextra: rubyforge.org/projects/rextra2/
Quick Start
You start by writing a class for parsing a single page:
# process the Google.com index.html page
class GoogleMain < Scrapes::Page
# make sure that the :about_link rule matched the web page
validates_presence_of(:about_link)
# extract the link to the about page
rule(:about_link, 'a[@href*="about"]', '@href', 1)
end
# process the Google.com about page
class GoogleAbout < Scrapes::Page
# ensure the :title rule below matches the web page
validates_presence_of(:title)
# extract the text inside the <title></title> tag
rule(:title, 'title', 'text()', 1)
end
Then you start a scraping session and use those classes to process the web site:
Scrapes::Session.start do |session|
session.page(GoogleMain, 'http://google.com') do |main_page|
session.page(GoogleAbout, main_page.about_link) do |about_page|
puts about_page.title + ': ' + session.absolute_uri(main_page.about_link)
end
end
end
On my machine, this code produces:
About Google: http://www.google.com/intl/en/about.html
For more information, please review the following classes:
-
Scrapes::Session
-
Scrapes::Page
-
Scrapes::RuleParser
-
Scrapes::Hpricot::Extractors
Development Tips
Add something like this to your .irbrc:
require 'rubygems'
require 'yaml'
require 'open-uri'
require 'hpricot'
require 'scrapes'
def h(url) Hpricot(open(url)) end
Then use like this in irb to understand how Hpricot selectors work:
doc = h 'http://www.foobar.com/'
links = doc.search('table/a[@href]') # for example
To understand the text extractors:
texts(links)
word(links.first) # etc..
Converting normal Xpath to Hpricot Xpath, sort of:
There are various add-ons to firefox, for example, that display the Xpath to a selected node. Hpricot uses a different sytanx however, (code.whytheluckystiff.net/hpricot/wiki/SupportedXpathExpressions). The following method is a first try at the conversion:
def xpath_to_hpricot path
path.split('/').reject{|e|e=~/^(html|tbody)$/ or e.blank?}.map do |e|
res = e.sub(/\[/,':eq(').sub(/\]/,')')
res.sub(/\d+/, (/(\d+)/.match(res).to_s.to_i - 1).to_s)
end.join('//')
end
Hpricot bugs
-
This selector will hang, ‘a’ and this one won’t, ‘a’. Just make sure you have the ‘@’ in front of the attribute name.
Credits
-
Peter Jones, author and maintainer
-
Michael Garriss, author and maintainer
-
Bob Showalter, continuous improvements and maintenance
-
Assaf Arkin, rule inspiration from trac.labnotes.org/cgi-bin/trac.cgi/wiki/Ruby/MicroformatParser