Module: MicroformatParser

Included in:
Microformats, Microformats::HCalendar
Defined in:
lib/uformatparser.rb

Overview

Implements a microformat parser by extending a class that includes this module.

The Basics

To create a microformat parser, extend a class with this module and use the rule method to define parsing rules for that class. Call parse to parse the content, returning a new instance of the class holding all values extracted from parsing. You can parse a document or an element.

For example:

class Microformats
  include MicroformatParser

  class HCalendar
    include MicroformatParser

    # Extract ISO date/time
    extractor :dt_extractor do |node|
      value = node.attributes['title'] if node.name == 'abbr'
      value = text(node) unless value
      value ? Time.parse(value) : nil
    end

    rule_1 :dtstart, nil, :dt_extractor
    rule_1 :dtend, nil, :dt_extractor
    rule_1 :summary, nil, :text
    rule_1 :description, nil, :xml
    rule_1 :url, nil, "a@href"
  end

  rule :tags, "a[rel~=tag]", "text()"
  rule :events, ".vevent", HCalendar
end

content = Microformats.parse(doc)
puts content.tags
puts content.events

Defined Under Namespace

Classes: Extractor, InvalidExtractorException, InvalidRuleException, InvalidSelectorException, Rule, Selector

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.text(node) ⇒ Object

Returns the text value of a node.



330
331
332
333
334
335
336
337
338
339
340
# File 'lib/uformatparser.rb', line 330

def text(node)
    value = ''
    for child in node.children
        if child.instance_of? REXML::Text
            value += child.value
        elsif child.instance_of? REXML::Element
            value += text(child)
        end
    end
    value
end

.xml(node) ⇒ Object

Returns the XML value of a node (the node itself).



343
344
345
# File 'lib/uformatparser.rb', line 343

def xml(node)
    node
end

Instance Method Details

#default_extractorObject

Returns the default extractor.



271
272
273
# File 'lib/uformatparser.rb', line 271

def default_extractor()
    return DEFAULT_EXTRACTOR
end

#extractor(name, extractor = nil, &proc) ⇒ Object

Creates a new extractor.

There are two ways to create an extractor:

* extractor name, statement
* extractor selector name { block }

The name argument (string or symbol) specifies the extractor name, defining a class method with that name that can be used to extract the value of a node.

The extractor can be an expression (string) or a block that accepts a single argument (element) and returns the extracted value, or nil.

For example:

selector :select_link { |node| node.name == 'a' }
extractor :extract_link { |node| node.attributes['href'] }
rule :links, :select_link, :extract_link

The expression takes the form of:

extractor := extract (|extract)*
extract   := element | @attribute | element@attribute | method()

If multiple extracts are specified, the first extracted value is used.

If an element is specified, the text value is extracted only if the selected node is an element of that type. If an attribute is specified, the extracted value is the attribute’s value. If both element and attribute are used, the attribute value is extracted only if the selected node is an element of that type.

If a method is specified, that method is called for the node. There are two methods available in any class: text and xml.



251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
# File 'lib/uformatparser.rb', line 251

def extractor(name, extractor = nil, &proc)
    raise InvalidExtractorException, "First argument (rule name) is required" unless name
    extractor = case extractor
    when NilClass
        # Absent extractor: either block if provided, otherwise default extractor
        proc ? proc : default_extractor
    when String
        # Extractor expression
        Extractor.new(extractor)
    else
        raise InvalidExtractorException, "Invalid extractor type: must be a string, parser class, block or nil"
    end
    # Create a class method using the extractor name that calls the
    # extractor's extract method.
    class << self
        self
    end.instance_eval { define_method(name) { |node| extractor.call(node) } }
end

#parse(node, context = nil, rules = nil) ⇒ Object

Called to parse a node.

The node may be an element (REXML::Element) or a document (REXML::Document).

For example:

class ParseLinks
  include MicroformatParser

  rule :links, "a", "@href"
  rule :ids, "a[@id]", "a@id"
 end

 parsed = ParseLinks.parse(doc)
 puts parsed.links
 puts parsed.ids


290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
# File 'lib/uformatparser.rb', line 290

def parse(node, context = nil, rules = nil)
    # Create a new object unless one is provided. This method can be
    # called on the class (creating a new instance) or on an object (recursive)
    context = self.new() unless context
    # Obtain the rules for this class unless provided by caller.
    rules = self.rules unless rules
    # Rules are reduced during processing. If a rule matches a node, that rule
    # is not applied to any child nodes (structured rules will process child nodes
    # directly). However, other rules are allowed to process the child nodes.
    # Removing a rule modifies the ruleset, requiring it to be cloned.
    less_rules = nil
    # We must have rules and the node must be an element/document
    if rules and node.kind_of? REXML::Element
        # Iterate over all the rules and process them. Remove any matching rules
        # from this ruleset -- the new ruleset will be used on child nodes.
        rules.each_with_index do |rule, index|
            if rule and rule.process(node, context)
                less_rules = rules.clone unless less_rules
                less_rules[index] = nil
            end
        end
        rules = less_rules if less_rules
        node.elements.each { |child| parse(child, context, less_rules) }
    end
    context
end

#rule(name, selector = nil, extractor = nil, limit = -1,, &proc) ⇒ Object

Create a new rule.

There are two ways to define a rule:

* rule name, selector?, extractor?, limit?
* rule name, limit? { block }

The name argument specifies an instance variable that holds the value (or values) extracted from processing this rule. It can be a string or a symbol. An attribute accessor is created with that name.

The selector argument identifies all nodes that match the rule. It can be an CSS-style selector (string) or a method/proc. A symbol specifies a method to use from this class. The method/proc receives a single argument with the node and must return true/false.

If selector is absent, the default selector will match any element with a class of the same name as the name argument. For example:

rule :dtstart

Matches all elements with the class dtstart.

The extractor argument specifies how to extract a value from a selected node. It can be a list of extract rules (string), a method/proc, or a class. A symbol specifies a method to use from this class. The method/proce receives a single argument with the node and returns the extracted value, or nil.

If the extractor is a class, it references a microformat parser which is then called to parse the content of a matching element.

If extractor is absent, the default extractor is used:

abbr@title|a@href|text()

The limit argument specifies the cardinality of the rule’s value:

0  The rule is never applied
1  The rule is applied once, the first extracted value is set
-1 The rule is applied multiple times, extracted values are set in an array
n  The rule is applied up to _n_ times, extracted values are set in an array

In the second form, a block is specified instead of the selector/extractor. The block is called with a node and returns the extracted value, or nil.



95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
# File 'lib/uformatparser.rb', line 95

def rule(name, selector = nil, extractor = nil, limit = -1, &proc)
    raise InvalidRuleException, "First argument (rule name) is required" unless name
    if proc
        # The rule processing is taken from the block, everything else must be nil
        raise InvalidRuleException, "Can't specify selector/extractor in combination with proc" if selector or extractor
        rule = Rule.new(name, nil, proc, limit)
    else
        # Determine the selector.
        selector = case selector
        when NilClass
            # Absent selector: create a selector that matches element with the same
            # class as the rule name
            match = Regexp.new("\\b#{name.to_s}\\b")
            proc { |node| node.attributes['class'] =~ match }
        when String
            # CSS-style selector
            Selector.create(selector)
        when Proc, Method
            # Use as is
            selector
        when Symbol
            # Find named method and use that as the selector
            # Since the instance method is unbound, we bind it to this class
            selector = method(selector)
            raise InvalidSelectorException, "Method #{name.to_s} is not a valid selector" unless selector
            selector
        else
            raise InvalidSelectorException, "Invalid selector type: must be a string, symbol, proc/method or nil"
        end

        # Determine the extractor
        extractor = case extractor
        when NilClass
            # Absent extractor: either block if provided, otherwise default extractor
            default_extractor
        when String
            # Extractor expression
            Extractor.new(self, extractor)
        when Proc, Method
            # Use as is
            extractor
        when Symbol
            # Find named method and use that as the extractor
            # Since the instance method is unbound, we bind it to this class
            extractor = method(extractor)
            raise InvalidExtractorException, "Method #{name.to_s} is not a valid extractor" unless extractor
            extractor
        when Class
            # Extractor is a class, generally another ruleset, so we call
            # its parse method (must exist).
            begin
                extractor.method(:parse)
            rescue NameError=>error
                raise InvalidExtractorException, "Extractor class must implement the method parse", error.backtrace
            end
            extractor
        else
            raise InvalidExtractorException, "Invalid extractor type: must be a string, parser class, block or nil"
        end

        # Create a new rule, to invoke its process method
        rule = Rule.new(name, selector, extractor, limit)
    end

    # Create an accessor for an attribute with the same name as the rule
    # The accessor will hold the rule value
    attr name, true
    # Add this rule to class's ruleset
    self.rules << rule
end

#rule_1(name, selector = nil, extractor = nil, &proc) ⇒ Object

Create a new rule that extracts at most one value.

Same as calling rule with limit=1



170
171
172
173
# File 'lib/uformatparser.rb', line 170

def rule_1(name, selector = nil, extractor = nil, &proc)
    # Rule with limit of one value
    rule(name, selector, extractor, 1, &proc)
end

#rulesObject

Returns all the rules for this class.

Returns an array of rules defined with rule.

You can use this method to inspect rules, add/remove rules, etc. Rules are processed in the order in which they are added.



323
324
325
326
327
# File 'lib/uformatparser.rb', line 323

def rules
    rules = @microparser_rules
    @microparser_rules = rules = Array.new() unless rules
    rules
end

#selector(name, selector = nil, &proc) ⇒ Object

Creates a new selector.

There are two ways to create a selector:

* selector name, statement
* selector name { block }

The name argument (a string or symbol) specifies the selector name, defining a class method with that name that can be used to identify matching element.

The selector can be a CSS-style selector (string) or a block that accepts a single argument (element) and returns true or false.

For example:

selector :select_link { |node| node.name == 'a' }
extractor :extract_link { |node| node.attributes['href'] }
rule :links, :select_link, :extract_link


193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
# File 'lib/uformatparser.rb', line 193

def selector(name, selector = nil, &proc)
    raise InvalidSelectorException, "First argument (rule name) is required" unless name
    selector = case selector
    when NilClass
        # Absent selector: either block is provided, or we create a selector
        # that matches element with the same class as the selector name
        if proc
            proc
        else
            match = Regexp.new("\\b#{name.to_s}\\b")
            proc { |node| node.attributes['class'] =~ match }
        end
    when String
        # CSS-style selector
        Selector.create(selector)
    else
        raise InvalidSelectorException, "Invalid selector type: must be a string, block or nil"
    end
    # Create a class method using the selector name that calls the
    # selector's match method.
    class << self
        self
    end.instance_eval { define_method(name) { |node| selector.call(node) } }
end