Scrapouille
Scrapouille is a declarative XPath driven HTML scraper with an interactive mode as a bonus
Why XPath ? XPath is powerful enough to get any data on a HTML document (see http://www.w3schools.com/xpath/xpath_axes.asp)
Scrapouille run XPath queries using the nokogiri gem
Install
gem install 'scrapouille'
Test
rake
Usage
Interactive mode
From the command line you can interact with a remote web page as if it was local
$ scrapouille http://tennis.com/player.html # launch scrapouille on the command line with a provided URI
> //div[@class='player-name']/h1/child::text() # You will get a prompt. Enter a xpath query
Richard Gasquest # Get the result string
>
Behind the scene - during the session - the remote web page is stored in a Tempfile
for fast xpath interaction
You can also directly interact with a local file
$ scrapouille /Users/simon/web/player.html # launch scrapouille on the command line with a provided filepath
> //div[@class='player-name']/h1/child::text() # enter your xpath query
Richard Gasquest # Get the result String
>
Scraping programatically
Define a scraper
scraper = Scrapouille.new do
scrap 'fullname', at: "//div[@class='player-name']/h1/child::text()"
scrap 'image_url', at: "//div[@id='basic']//img/attribute::src"
scrap 'rank', at: "//div[@class='position']/text()" do |c|
Integer(c.sub('#', ''))
end
end
Use the scraper instance on an URI (as defined by open-uri
: filepath, http, ...)
results = scraper.scrap!('http://tennis-player.com/richard-gasquet')
results['fullname'] # => 'Richard Gasquest'
You can also run your scraper using a local HTML filepath for testing purposes
scraper.scrap!(File.join('..', 'player.html'))