Ever wanted to parse XML, but hated all the hassle?
Rusty is here to help. Lets start with a small example:
require "rusty"
# A simple RSS parser
module SimpleRSS
extend Rusty::RuleSet
helper Rusty::Helpers::Text
on "*" do end
on "rss channel *" do rss[node.name] = text(node) end
on "rss channel item" do rss.items << item end
on "rss channel item *" do item[node.name] = text(node) end
end
doc = Nokogiri.XML File.read("stressfaktor.xml")
data = SimpleRSS.transform! doc
p data.rss.to_ruby
Interested? Read on.
Transforming nodes
Each XML and HTML document, after being parsed by Nokogiri, is represented as a tree of nodes. A transformation would visit all the nodes in the input document and do something with the data in it. A trivial 1:1 transformation would recreate the tree with the same data. This is obviously not what you want; what you want is probably to build a different tree, with some information, and/or to do something else entirely.
rusty is here to help you.
- It let you define procedures to run on nodes, specified by CSS selectors, and
- it provides a simple name lookup when in fact creating a data structure.
And this works miracles.
Defining callbacks
Rusty knows two different kind of callbacks. The on
callback, which is run before processing a node's children, and the after
callback, which is run once all children have been visited.
module SimpleRSS
extend Rusty::RuleSet
on "rss channel item" do puts "Hu! An item node" end
on "rss channel item *" do puts "A child of an item" end
after "rss channel item" do puts "Now I have seen all of the item's children" end
end
There is an additional way to define a callback, which makes some sense if you need both an "on" and an "after" callback for the same nodes, and probably want to share some information between these:
module SimpleRSS
extend Rusty::RuleSet
on "rss channel item" do
start = Time.now
callback do
puts "Parsing the item took #{Time.now - start} secs."
end
end
end
Defining callbacks
Rusty knows two different kind of callbacks. The on
callback, which is run before processing a node's children, and the after
callback, which is run once all children have been visited.
module SimpleRSS
extend Rusty::RuleSet
on "rss channel item" do puts "Hu! An item node" end
on "rss channel item *" do puts "A child of an item" end
after "rss channel item" do puts "Now I have seen all of the item's children" end
end
There is an additional way to define a callback, which makes some sense if you need both an "on" and an "after" callback for the same nodes, and probably want to share some information between these:
module SimpleRSS
extend Rusty::RuleSet
on "rss channel item" do
puts "Hu! An item node"
callback do
puts "Now I have seen all of the item's children" end
end
end
end
after
and callback
callbacks can coexist.
Creating output data
One case to parse XML is to recreate some kind of data structure which resembles some or all of the XML's input. To support this mode of operation rusty "mirrors" input nodes with output data nodes. To further help you rusty comes with a nimble name lookup scheme in its callbacks. Whenever you use an undeclared name in a callback, rusty goes up to the parent of the document to find a node with a matching name:
module SimpleRSS
extend Rusty::RuleSet
on "rss" do
rss.item_count = 0
callback do
puts "There are #{rss.item_count} items in the input"
end
end
on "rss channel item" do
rss.count += 1
end
end
What happens with the resulting data is up to you. By default rusty throws away all resulting data except what belongs to the top node of the document. In the above example SimpleRSS.transform! would return a hash
{ count => <some_number> }
If you want to keep a node's data you must put it somewhere, as in the following example:
on "rss channel item" do rss.items << item end
Node names
What is a matching name? While XML documents may come with names that might make sense, HTML usually does not. After all, a <div>
is a <div>
no matter what.
For that reason rusty matches both node names and node classes when looking up a node by name. (And yes, that means a node might have multiple names.) And as of yet node names that are not valid ruby identifiers cannot be used in the callback block.
There is one special name, document
, which refers to the top-most node.
Rusty data nodes
A Rusty data node (of type Rusty::DX), is a mongrel of a Hash
, an Array
, and nil
. Unless set to something - i.e. as long as being nil
- it might turn into an Array or a Hash-like structure, depending on what you do to them.
The following makes rss
a hash:
rss.key?(:foo)
rss.item_count = 0 # Hash entries are automatically created
while the following makes it an array
rss << 1
rss[5] = 25
To get back a stupid ruby object use the .to_ruby
method, i.e.
rss.to_ruby # => [ 1, nil, nil, nil, nil, 5 ]
..or something else?
Of course you are free to do whatever. After all, each callback is just a piece of ruby code.
module SimpleRSS
extend Rusty::RuleSet
helper Rusty::Helpers::Text
on "rss channel item *" do item[node.name] = text(node) end
after "rss channel item" do puts "Found an item: #{item.to_ruby}" end
end
The *
callback
You will usually see a rule like this:
on "*" do end
The "*" selector has a very low weight, meaning it matches all nodes that are not matched by any other rule. This is done to prevent rusty from warning about nodes without a matching rule.
During development you should not use a rule like that. Add it only after you feel confident you get all the data you need form the input.
Speeding up
Especially when parsing HTML you might find a number of nodes that belong to a subtree in the document which is completely irrelevant. For example, a page like http://www.google.com/movies contains tons of UI elements, which - assuming you would be interested in theater schedules - is just irrelevant. By skipping the entire subtree you might gain some speed when parsing the input:
on "#navbar, #left_nav" do
skip!
end
Helpers and the callback scope
Note that callbacks get a special scope. This scope - a Rusty::CallbackBinding - is responsible for looking up names up the node tree. The only value defined there - apart from things like object_id
, class
, etc. is node
, which refers to the input node.
If you need special functionality you should define helper methods and modules, as in the following example:
module SimpleRSS
extend Rusty::RuleSet
helper Rusty::Helpers::Text
helper do
def a_helper_method(*args)
end
end
on "rss channel item *" do a_helper_method 1, 2, 3 end
end
Rusty comes with the Rusty::Helpers::Text module, which provides a single helper method, text
, which returns a node's text after cleaning it up.
That is all.
Rusty does have a number of shortcomings.
- It does not support namespaces,
- it's CSS selector matching could be faster,
- the selector weighting could be more correct,
Don't hesitate to fork away and send pull requests!