Unbreakable
Unbreakable is a Ruby gem that abstracts and bulletproofs common web scraping tasks. It forces a separation of concerns for maximum flexibility. Loose coupling allows for easier modification and re-use of component parts.
Installation
gem install unbreakable
What's the problem?
A common web scraping project involves four steps. As an illustrative example, we'll scrape the language with the most articles on Wikipedia using standard command-line tools:
Retrieve some raw HTML
# Download the list of Wikipedias curl -s -o in.html http://s23.org/wikistats/wikipedias_html
Process the raw HTML into a machine-readable format
# Extract the language with the most articles grep '><td class="number">1<' in.html | sed 's/.*e">\([^<]*\).*/\1/' > out.html
Release the data to the community through an API and/or as a download
# Upload the machine-readable data to a public server curl http://pastie.org/pastes -F "paste[parser_id]=6" -F "paste[authorization]=burger" \ -F "paste[body]=`cat out.txt`" -s -o /dev/null -L -w "%{url_effective}"
Use the data as you like
echo "The most popular language is `curl -s http://pastie.org/pastes/2487244/download`."
In most web scraping projects, at least one step is tightly coupled to another, making modification or re-use of individual steps by the community difficult. It is especially common for authors to tailor the workflow to their specific use of the data. The coupling produces esoteric code, with the domain logic of the author's use case slipping into the otherwise generic code for retrieving and processing data. Because the scrapers are embedded in a larger project, they are often undiscoverable.
Furthermore, how the first two steps store data may be incompatible with some environments. If the processor code stores data in a database, but you prefer flat files for your use case, you may have to do a long refactor.
What's the solution?
Web scraping projects should write standalone downloaders, processors, APIs and apps.
Retrieving should be separate from processing, if only to avoid hammering remote servers while developing or tweaking a processor. This separation also allows the community to develop multiple processors of the same raw data without duplication of effort.
Standalone components are easier for the community to discover, modify and re-use, as they do not need to concern themselves with the other parts of the workflow or expose themselves to the use case of the original author.
The code for retrieving and processing data should delegate the persistence of data to a storage layer. The community can then develop various, swappable storage adapters and will not be bound to any single solution.
Unbreakable helps you write standalone downloaders and processor and provides an extensible persistence layer.
Getting started
For now, the best way to learn how to use this gem is to read the documentation.
rake yard
open doc/index.html
Bugs? Questions?
This gem's main repository is on GitHub: http://github.com/opennorth/unbreakable, where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.
Copyright (c) 2011 Open North Inc., released under the MIT license