Version: 1.06

  1. September, 2003

This is a Ruby library for building trees representing HTML structure.

See the file INSTALL for installation instructions.

Copyright © 2003, Johannes Brodwall <[email protected]> Copyright © 2002, Ned Konz <[email protected]>

License: Ruby’s

See rubyforge.org/projects/ruby-htmltools for the most recent version.

This project includes SGML-parser, ported from Python by Takahiro Maebashi <[email protected]> (see: www.jin.gr.jp/~nahi/Ruby/html-parser/README.html)

PREREQUISITES

Ruby 1.8


Test::Unit


The tests run using Test::Unit. Test::Unit is part of the standard Ruby install as of 1.8


REXML


XPath support requires REXML. REXML is part of the standard Ruby install as of 1.8

CHANGES


Changes from 1.09:


  • Some minor bugfixes

  • SGMLParser.src_range makes it very easy to write applications which parse HTML files into components and manipulate the corresponding source code without altering it. (by Philip Dorrell)


Changes from 1.08:


  • Fixed xpath script and added tests

  • Fixed bug #681 (xhtml)

  • Added GemSpec


Changes from 1.07:


  • Fixed tc_xpath test_match_all after it was broken by upgrade of REXML.

  • Refactored utility code for printing node paths into rexml-nodepath.rb


Changes from 1.06:


  • Included stuff that I had forgot to package into the tarball.


Changes from 1.05:


  • Updated everything to work with Ruby 1.8.


Changes from 1.04:


  • Made sure that unknown entities and characters are not discarded, in both html/tree.rb and html/xmltree.rb

  • Added handling of DOCTYPE to html/xmltree.rb


Changes from 1.03:


  • Added HTMLTree::XMLParser, which makes a REXML document from the given HTML.

  • Changed HTMLTree::Element::print_on() to write()

  • Made it so that a string or IO can be passed to HTMLTree::Element::dump()

  • Made it so that a string or IO can be passed to HTMLTree::Element::write()


Changes from 1.02:


  • added XPath and XML conversion (needs REXML)

  • Wrapped all code in namespaces. The following class names have changed:

    – in html/element.rb HTMLDocument => HTMLTree::Document HTMLElement => HTMLTree::Element HTMLData => HTMLTree::Data HTMLComment => HTMLTree::Comment HTMLSpecial => HTMLTree::Special

    – in html/tags.rb HTMLTag => HTML::Tag HTMLBlockTag => HTML::BlockTag HTMLInlineTag => HTML::InlineTag HTMLBlockOrInlineTag => HTML::BlockOrInlineTag HTMLEmptyTag => HTML::EmptyTag

    – in html/tree.rb HTMLTreeParser => HTMLTree::Parser

    – in html/stparser.rb StackingParser => HTML::StackingParser

  • added HTMLTree::Element.root()


Changes from 1.01:


  • documented change to sgml-parser.

  • added bin/ebaySearch.rb example


Changes from 1.0:


  • attributes now maintain their order. Though this probably isn’t strictly necessary under HTML, it may make it easier to compare document versions.

  • the generated tree now has a top-level node for the document itself, so the DTD can be stored. THIS WILL REQUIRE CODE CHANGES if you have code that assumes that the root node is always <html>. To find the <html> node, you can use the new methods HTMLTreeParser#html() or HTMLDocument#html_node():

    html = parser.html()
    

    Or, querying the tree:

    html = parser.tree.html_node()
    
  • comments are stored in the tree

  • added HTMLElement#print_on() to print a (sub)tree to an IO stream

vim: ts=2 sw=2 et