Module: MicroformatParser
- Included in:
- Microformats, Microformats::HCalendar
- Defined in:
- lib/uformatparser.rb
Overview
Implements a microformat parser by extending a class that includes this module.
The Basics
To create a microformat parser, extend a class with this module and use the rule
method to define parsing rules for that class. Call parse
to parse the content, returning a new instance of the class holding all values extracted from parsing. You can parse a document or an element.
For example:
class Microformats
include MicroformatParser
class HCalendar
include MicroformatParser
# Extract ISO date/time
extractor :dt_extractor do |node|
value = node.attributes['title'] if node.name == 'abbr'
value = text(node) unless value
value ? Time.parse(value) : nil
end
rule_1 :dtstart, nil, :dt_extractor
rule_1 :dtend, nil, :dt_extractor
rule_1 :summary, nil, :text
rule_1 :description, nil, :xml
rule_1 :url, nil, "a@href"
end
rule :tags, "a[rel~=tag]", "text()"
rule :events, ".vevent", HCalendar
end
content = Microformats.parse(doc)
puts content.
puts content.events
Defined Under Namespace
Classes: Extractor, InvalidExtractorException, InvalidRuleException, InvalidSelectorException, Rule, Selector
Class Method Summary collapse
-
.text(node) ⇒ Object
Returns the text value of a node.
-
.xml(node) ⇒ Object
Returns the XML value of a node (the node itself).
Instance Method Summary collapse
-
#default_extractor ⇒ Object
Returns the default extractor.
-
#extractor(name, extractor = nil, &proc) ⇒ Object
Creates a new extractor.
-
#parse(node, context = nil, rules = nil) ⇒ Object
Called to parse a node.
-
#rule(name, selector = nil, extractor = nil, limit = -1,, &proc) ⇒ Object
Create a new rule.
-
#rule_1(name, selector = nil, extractor = nil, &proc) ⇒ Object
Create a new rule that extracts at most one value.
-
#rules ⇒ Object
Returns all the rules for this class.
-
#selector(name, selector = nil, &proc) ⇒ Object
Creates a new selector.
Class Method Details
.text(node) ⇒ Object
Returns the text value of a node.
330 331 332 333 334 335 336 337 338 339 340 |
# File 'lib/uformatparser.rb', line 330 def text(node) value = '' for child in node.children if child.instance_of? REXML::Text value += child.value elsif child.instance_of? REXML::Element value += text(child) end end value end |
.xml(node) ⇒ Object
Returns the XML value of a node (the node itself).
343 344 345 |
# File 'lib/uformatparser.rb', line 343 def xml(node) node end |
Instance Method Details
#default_extractor ⇒ Object
Returns the default extractor.
271 272 273 |
# File 'lib/uformatparser.rb', line 271 def default_extractor() return DEFAULT_EXTRACTOR end |
#extractor(name, extractor = nil, &proc) ⇒ Object
Creates a new extractor.
There are two ways to create an extractor:
* extractor name, statement
* extractor selector name { block }
The name
argument (string or symbol) specifies the extractor name, defining a class method with that name that can be used to extract the value of a node.
The extractor can be an expression (string) or a block that accepts a single argument (element) and returns the extracted value, or nil.
For example:
selector :select_link { |node| node.name == 'a' }
extractor :extract_link { |node| node.attributes['href'] }
rule :links, :select_link, :extract_link
The expression takes the form of:
extractor := extract (|extract)*
extract := element | @attribute | element@attribute | method()
If multiple extracts are specified, the first extracted value is used.
If an element is specified, the text value is extracted only if the selected node is an element of that type. If an attribute is specified, the extracted value is the attribute’s value. If both element and attribute are used, the attribute value is extracted only if the selected node is an element of that type.
If a method is specified, that method is called for the node. There are two methods available in any class: text
and xml
.
251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 |
# File 'lib/uformatparser.rb', line 251 def extractor(name, extractor = nil, &proc) raise InvalidExtractorException, "First argument (rule name) is required" unless name extractor = case extractor when NilClass # Absent extractor: either block if provided, otherwise default extractor proc ? proc : default_extractor when String # Extractor expression Extractor.new(extractor) else raise InvalidExtractorException, "Invalid extractor type: must be a string, parser class, block or nil" end # Create a class method using the extractor name that calls the # extractor's extract method. class << self self end.instance_eval { define_method(name) { |node| extractor.call(node) } } end |
#parse(node, context = nil, rules = nil) ⇒ Object
Called to parse a node.
The node may be an element (REXML::Element) or a document (REXML::Document).
For example:
class ParseLinks
include MicroformatParser
rule :links, "a", "@href"
rule :ids, "a[@id]", "a@id"
end
parsed = ParseLinks.parse(doc)
puts parsed.links
puts parsed.ids
290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 |
# File 'lib/uformatparser.rb', line 290 def parse(node, context = nil, rules = nil) # Create a new object unless one is provided. This method can be # called on the class (creating a new instance) or on an object (recursive) context = self.new() unless context # Obtain the rules for this class unless provided by caller. rules = self.rules unless rules # Rules are reduced during processing. If a rule matches a node, that rule # is not applied to any child nodes (structured rules will process child nodes # directly). However, other rules are allowed to process the child nodes. # Removing a rule modifies the ruleset, requiring it to be cloned. less_rules = nil # We must have rules and the node must be an element/document if rules and node.kind_of? REXML::Element # Iterate over all the rules and process them. Remove any matching rules # from this ruleset -- the new ruleset will be used on child nodes. rules.each_with_index do |rule, index| if rule and rule.process(node, context) less_rules = rules.clone unless less_rules less_rules[index] = nil end end rules = less_rules if less_rules node.elements.each { |child| parse(child, context, less_rules) } end context end |
#rule(name, selector = nil, extractor = nil, limit = -1,, &proc) ⇒ Object
Create a new rule.
There are two ways to define a rule:
* rule name, selector?, extractor?, limit?
* rule name, limit? { block }
The name
argument specifies an instance variable that holds the value (or values) extracted from processing this rule. It can be a string or a symbol. An attribute accessor is created with that name.
The selector
argument identifies all nodes that match the rule. It can be an CSS-style selector (string) or a method/proc. A symbol specifies a method to use from this class. The method/proc receives a single argument with the node and must return true/false.
If selector is absent, the default selector will match any element with a class of the same name as the name argument. For example:
rule :dtstart
Matches all elements with the class dtstart.
The extractor
argument specifies how to extract a value from a selected node. It can be a list of extract rules (string), a method/proc, or a class. A symbol specifies a method to use from this class. The method/proce receives a single argument with the node and returns the extracted value, or nil.
If the extractor is a class, it references a microformat parser which is then called to parse the content of a matching element.
If extractor is absent, the default extractor is used:
abbr@title|a@href|text()
The limit
argument specifies the cardinality of the rule’s value:
0 The rule is never applied
1 The rule is applied once, the first extracted value is set
-1 The rule is applied multiple times, extracted values are set in an array
n The rule is applied up to _n_ times, extracted values are set in an array
In the second form, a block is specified instead of the selector/extractor. The block is called with a node and returns the extracted value, or nil.
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
# File 'lib/uformatparser.rb', line 95 def rule(name, selector = nil, extractor = nil, limit = -1, &proc) raise InvalidRuleException, "First argument (rule name) is required" unless name if proc # The rule processing is taken from the block, everything else must be nil raise InvalidRuleException, "Can't specify selector/extractor in combination with proc" if selector or extractor rule = Rule.new(name, nil, proc, limit) else # Determine the selector. selector = case selector when NilClass # Absent selector: create a selector that matches element with the same # class as the rule name match = Regexp.new("\\b#{name.to_s}\\b") proc { |node| node.attributes['class'] =~ match } when String # CSS-style selector Selector.create(selector) when Proc, Method # Use as is selector when Symbol # Find named method and use that as the selector # Since the instance method is unbound, we bind it to this class selector = method(selector) raise InvalidSelectorException, "Method #{name.to_s} is not a valid selector" unless selector selector else raise InvalidSelectorException, "Invalid selector type: must be a string, symbol, proc/method or nil" end # Determine the extractor extractor = case extractor when NilClass # Absent extractor: either block if provided, otherwise default extractor default_extractor when String # Extractor expression Extractor.new(self, extractor) when Proc, Method # Use as is extractor when Symbol # Find named method and use that as the extractor # Since the instance method is unbound, we bind it to this class extractor = method(extractor) raise InvalidExtractorException, "Method #{name.to_s} is not a valid extractor" unless extractor extractor when Class # Extractor is a class, generally another ruleset, so we call # its parse method (must exist). begin extractor.method(:parse) rescue NameError=>error raise InvalidExtractorException, "Extractor class must implement the method parse", error.backtrace end extractor else raise InvalidExtractorException, "Invalid extractor type: must be a string, parser class, block or nil" end # Create a new rule, to invoke its process method rule = Rule.new(name, selector, extractor, limit) end # Create an accessor for an attribute with the same name as the rule # The accessor will hold the rule value attr name, true # Add this rule to class's ruleset self.rules << rule end |
#rule_1(name, selector = nil, extractor = nil, &proc) ⇒ Object
Create a new rule that extracts at most one value.
Same as calling rule
with limit
=1
170 171 172 173 |
# File 'lib/uformatparser.rb', line 170 def rule_1(name, selector = nil, extractor = nil, &proc) # Rule with limit of one value rule(name, selector, extractor, 1, &proc) end |
#rules ⇒ Object
Returns all the rules for this class.
Returns an array of rules defined with rule
.
You can use this method to inspect rules, add/remove rules, etc. Rules are processed in the order in which they are added.
323 324 325 326 327 |
# File 'lib/uformatparser.rb', line 323 def rules rules = @microparser_rules @microparser_rules = rules = Array.new() unless rules rules end |
#selector(name, selector = nil, &proc) ⇒ Object
Creates a new selector.
There are two ways to create a selector:
* selector name, statement
* selector name { block }
The name
argument (a string or symbol) specifies the selector name, defining a class method with that name that can be used to identify matching element.
The selector can be a CSS-style selector (string) or a block that accepts a single argument (element) and returns true or false.
For example:
selector :select_link { |node| node.name == 'a' }
extractor :extract_link { |node| node.attributes['href'] }
rule :links, :select_link, :extract_link
193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 |
# File 'lib/uformatparser.rb', line 193 def selector(name, selector = nil, &proc) raise InvalidSelectorException, "First argument (rule name) is required" unless name selector = case selector when NilClass # Absent selector: either block is provided, or we create a selector # that matches element with the same class as the selector name if proc proc else match = Regexp.new("\\b#{name.to_s}\\b") proc { |node| node.attributes['class'] =~ match } end when String # CSS-style selector Selector.create(selector) else raise InvalidSelectorException, "Invalid selector type: must be a string, block or nil" end # Create a class method using the selector name that calls the # selector's match method. class << self self end.instance_eval { define_method(name) { |node| selector.call(node) } } end |