pismo (Web page content analyzer and metadata extractor)

STATUS:

pismo is a VERY NEW project developed for use on coder.io/ - my forthcoming developer news aggregator. pismo is FAR FROM COMPLETE. If you’re brave, you can have a PLAY with it as the examples below and those in the test suite/corpus do work - all tests pass.

The prime missing features so far are the “external attributes” - where calls are made to external services like Delicious, Yahoo, Bing, etc, for getting third party data about documents. The structures are there but I’m still deciding how best to integrate these ideas.

DESCRIPTION:

Pismo extracts metadata and machine-usable data from otherwise unstructured HTML documents, including titles, body text, graphics, date, and keywords.

For example, if you have a blog post HTML file, Pismo should, in theory, be able to extract the title, the actual “content”, images relating to the content, look up Delicious tags, and analyze for keywords.

Pismo only understands English. Je suis desolé.

SYNOPSIS:

  • Basic demo:

    require 'open-uri'
    require 'pismo'
    doc = Pismo::Document.new(open('http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html'))
    doc.title     # => "Cramp: Asychronous Event-Driven Ruby Web App Framework"
    doc.author    # => "Peter Cooper"
    doc.lede      # => "Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals (and the Rails core team). It's built around Ruby's EventMachine library and was designed to use event-driven I/O throughout - making it ideal for situations where you need to handle a large number of open connections (such as Comet systems or streaming APIs.)"
    doc.keywords  # => [["cramp", 7], ["controllers", 3], ["app", 3], ["basic", 2], ..., ... ]
    

COMMAND LINE TOOL:

A command line tool called “pismo” is included so that you can get metadata about a page from the command line. This is great for testing, or perhaps calling it from a non Ruby script. The output is currently in YAML.

  • Usage:

    ./bin/pismo http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html title lede author datetime
    
  • Output:

    --- 
    :url: http://www.rubyinside.com/cramp-asychronous-event-driven-ruby-web-app-framework-2928.html
    :title: "Cramp: Asychronous Event-Driven Ruby Web App Framework"
    :lede: Cramp (GitHub repo)is a new, asynchronous evented Web app framework by Pratik Naik of 37signals
    :author: Peter Cooper
    :datetime: 2010-01-07 12:00:00 +00:00
    

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Commit, do not mess with Rakefile, version, or history.

  • Send me a pull request. I may or may not accept it.

COPYRIGHT AND LICENSE

Apache 2.0 License - See LICENSE for details.

All except lib/pismo/readability.rb is Copyright © 2009, 2010 Peter Cooper lib/pismo/readability.rb is Copyright © 2009, 2010 Arc90 Inc, starrhorne, and iterationlabs

The readability stuff was ganked from github.com/iterationlabs/ruby-readability