Ariel release 0.1.0

About - Ariel: A Ruby Information Extraction Library

Ariel is a library that allows you to extract information from semi-structured documents (such as websites). It is different to existing tools because rather than expecting the developer to write rules to extract the desired information, Ariel will use a small number of labeled examples to generate and learn effective extraction rules. It is developed by Alex Bradbury and released under the MIT license. Ariel was started as a Google Summer of Code project mentored by Austin Ziegler in 2006.

Install

gem install ariel

Announcement

I’m happy to announce the release of Ariel 0.1.0, the result of my Summer of Code work. This release should be easy to use, very functional, and hopefully useful - so it’s worth trying out. I’ve put a lot of effort in to writing clear and straightforward documentation to get your started, so take a look at the docs available at ariel.rubyforge.org. In particular, flick through the tutorial and quick start guide. If you’re interested, you may also want to take a look at the theory page where I’ve made a good start on describing the method Ariel uses to learn extraction rules. If you have any problems or find any bugs, just send me an email or add it to the issue tracker (see link below). Enjoy. See the FAQ for a vim snippet to make labeling examples a little easier.

Quickstart/Basic usage

  • @require ‘ariel’@

  • Define a structure for the information you wish to extract:

    structure = Ariel::Node::Structure.new do |r|
      r.item :title
      r.item :body
      r.list :comments do |c|
        c.list_item :comment do |d|
          d.item :author
          d.item :body
        end
      end
     end
    
  • Collect a few examples of the sort of document you wish to extract information from (pages from the same website for instance).

  • Label each example with tags such as <l:title>, <l:comment> and so on in the relevant places.

  • Ariel.learn structure, labeled_file1, labeled_file2, labeled_file3

  • Find the documents you want to extract information from.

  • extractions = Ariel.extract structure, unlabeled_file1,

unlabeled_file2
  • extractions.search(‘comments/*/body’).each {|e| puts e.extracted_text} =>

"Great stuff, loving it", "I love life", .....
  • extractions.at(‘comments/34’) => nil</tt> (there is no 34th comment, #at

returns the first result rather than an array of matches).

Credits

Ariel is developed by Alex Bradbury as a Google Summer of Code project under the mentoring of Austin Ziegler.

SVN Repository: rubyforge.org/projects/ariel Issue tracker: code.google.com/p/ariel/issues/ Documentation/homepage: ariel.rubyforge.org RDoc: ariel.rubyforge.org/rdoc/