Unbreakable

Unbreakable is a Ruby gem that abstracts and bulletproofs common web scraping tasks. It forces a separation of concerns for maximum flexibility. Loose coupling allows for easier modification and re-use of component parts.

Installation

gem install unbreakable

What's the problem?

A common web scraping project involves four steps. As an illustrative example, we'll scrape the language with the most articles on Wikipedia using standard command-line tools:

  1. Retrieve some raw HTML

    # Download the list of Wikipedias
    curl -s -o in.html http://s23.org/wikistats/wikipedias_html
    
  2. Process the raw HTML into a machine-readable format

    # Extract the language with the most articles
    grep '><td class="number">1<' in.html | sed 's/.*e">\([^<]*\).*/\1/' > out.html
    
  3. Release the data to the community through an API and/or as a download

    # Upload the machine-readable data to a public server
    curl http://pastie.org/pastes -F "paste[parser_id]=6" -F "paste[authorization]=burger" \
      -F "paste[body]=`cat out.txt`" -s -o /dev/null -L -w "%{url_effective}"
    
  4. Use the data as you like

    echo "The most popular language is `curl -s http://pastie.org/pastes/2487244/download`."
    

In most web scraping projects, at least one step is tightly coupled to another, making modification or re-use of individual steps by the community difficult. It is especially common for authors to tailor the workflow to their specific use of the data. The coupling produces esoteric code, with the domain logic of the author's use case slipping into the otherwise generic code for retrieving and processing data. Because the scrapers are embedded in a larger project, they are often undiscoverable.

Furthermore, how the first two steps store data may be incompatible with some environments. If the processor code stores data in a database, but you prefer flat files for your use case, you may have to do a long refactor.

What's the solution?

Web scraping projects should write standalone downloaders, processors, APIs and apps.

Retrieving should be separate from processing, if only to avoid hammering remote servers while developing or tweaking a processor. This separation also allows the community to develop multiple processors of the same raw data without duplication of effort.

Standalone components are easier for the community to discover, modify and re-use, as they do not need to concern themselves with the other parts of the workflow or expose themselves to the use case of the original author.

The code for retrieving and processing data should delegate the persistence of data to a storage layer. The community can then develop various, swappable storage adapters and will not be bound to any single solution.

Unbreakable helps you write standalone downloaders and processor and provides an extensible persistence layer.

Getting started

For now, the best way to learn how to use this gem is to read the documentation.

rake yard
open doc/index.html

Bugs? Questions?

This gem's main repository is on GitHub: http://github.com/opennorth/unbreakable, where your contributions, forks, bug reports, feature requests, and feedback are greatly welcomed.

Copyright (c) 2011 Open North Inc., released under the MIT license