Module: Scrapes::RuleParser

Included in:
Page
Defined in:
lib/scrapes/rule_parser.rb

Overview

The methods defined here are available at the class scope level of a Scrapes::Page subclass. For example:

class Foobar < Scrapes::Page
  rule :foo, 'foo'
  rule_1 :bar, 'bar', 'text()'
end

Using rule

Using rule_1

Using selector

Using extractor

++

Defined Under Namespace

Classes: Extractor, InvalidRuleException, Rule

Instance Method Summary collapse

Instance Method Details

#extractor(name, extract = nil, &block) ⇒ Object

Creates a standalone extractor that can later be used in a rule. Example:

class Foobar < Scrapes::Page
  extractor :mailto_extract do |elem|
    elem.attributes['href'].sub(/mailto:/,'') # remove the mailto: string
  end
  rule :emails, 'a[@href^="mailto:"]', :mailto_extract
end
name

the name later used to invoke this selector

extract

the extractor to use, String or NilClass

block

a block extractor, must not be defined if extract is non-nil

A block extractor is yielded each object that matched the rules’s selector.

Extractors passed to rule or rule_1 are interpreted based on the class of the extractor as follows

NilClass

The result of the selector is just re-returned. Thus foo.my_rule would just return the selector results defined on the :my_rule rule.

Symbol

An custom extractor is used. See above docs for this method for an example.

Class

A nested class of the given name is used as a new inner-parser. An instance of that class is returned from each invocation of the extractor. Example:

class Outer < Scrapes::Page
  class Inner < Scrapes::Page
   rule_1 :bold_text, 'b', 'text()'
   rule_1 :img_src, 'img[@src]', '@src'
  end
  rule :items, 'tr', Inner
end

Now calling my_page.items returns an Array of Inner objects that each separately parses out the bold text and image source of each table row in the document.

String

Two patterns:

@foobar

extract out the contents of an attibute named ‘foobar’

foobar()

invoke the foobar builtin extractor, see Scrapes::Hpricot::Extractors



150
151
152
# File 'lib/scrapes/rule_parser.rb', line 150

def extractor(name, extract = nil, &block)
  tor '@extractor', name, extract, &block
end

#parse(node, context = nil, rules = nil) ⇒ Object

:nodoc:



155
156
157
158
159
160
161
162
163
164
165
166
167
# File 'lib/scrapes/rule_parser.rb', line 155

def parse(node, context = nil, rules = nil) # :nodoc:
  context = self.new() unless context                               
  rules   = self.rules unless rules
  if rules
    rules.each_with_index do |rule, index|
      if rule and rule.process(node, context)
        less_rules = rules.clone unless less_rules
        less_rules[index] = nil
      end
    end
  end
  context
end

#rule(name, select = '', extract = nil, limit = -1,, &block) ⇒ Object

name

the name later used to invoke this rule

select

the selector to use, String or Symbol

extract

the extractor to use, String, Symbol, or Class. See RuleParser#extractor

limit

the limit of nodes to send to extractor

block

a block extractor, must not be defined if extract is non-nil

Example:

class Foobar < Scrapes::Page
  rule :foo, 'foo'
end

Later it’s used as an instance method on the Scrapes::Page objects like this:

foobar.foo.each do |foo|
  example.attr << foo
end


66
67
68
69
70
# File 'lib/scrapes/rule_parser.rb', line 66

def rule(name, select = '', extract = nil, limit = -1, &block)
  raise InvalidRuleException, "First argument (rule name) is required" unless name
  attr name, true
  self.rules << Rule.new(name, selector(nil,select), extractor(nil,extract,&block), limit)
end

#rule_1(name, selector = '', extractor = nil, &block) ⇒ Object

Almost the same as rule except forces limit to be 1. The other difference is that RuleParser#rule returns collections of mathes (an Array or size 1 even) where as RuleParser#rule_1 just returns the match.

name

the name later used to invoke this rule

select

the selector to use, String or Symbol

extract

the extractor to use, String, Symbol, or Class

block

a block extractor, must not be defined if extract is non-nil

Example:

class Foobar < Scrapes::Page
  rule_1 :bar, 'tr'
end

Later it’s used as an instance method on the Scrapes::Page objects like this:

example.attr = foobar.bar


86
87
88
# File 'lib/scrapes/rule_parser.rb', line 86

def rule_1(name, selector = '', extractor = nil, &block)
  rule(name, selector, extractor, 1, &block)
end

#rulesObject

:nodoc:



170
171
172
# File 'lib/scrapes/rule_parser.rb', line 170

def rules() # :nodoc:
  @microparser_rules ||= []
end

#selector(name, select = nil, &block) ⇒ Object

Creates a standalone selector that can later be used in a rule. Example:

class Foobar < Scrapes::Page
  selector :foo_select, 'table'
  rule_1 :bar, :foo_select # a Symbol triggers use of the selector
end
name

the name later used to invoke this selector

select

the selector to use, String or NilClass

block

a block selector, must not be defined if select is non-nil

A block selector is yielded the Hpricot doc object just once. The collection it returns is interated over and each match is passed to the extractor. Example:

class Foobar < Scrapes::Page
  selector :foo_select_2 do |hpricot_doc|
    doc.search('table')
  end
  rule_1 :bar, :foo_select_2 # a Symbol triggers use of the selector
end

String selectors passed to rule or rule_1 are interpreted as Hpricot search strings. See code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase



109
110
111
# File 'lib/scrapes/rule_parser.rb', line 109

def selector(name, select = nil, &block)
  tor '@selector', name, select, &block
end