Module: Scrapes::RuleParser

Included in:: Page

Defined in:: lib/scrapes/rule_parser.rb

Overview

The methods defined here are available at the class scope level of a Scrapes::Page subclass. For example:

class Foobar < Scrapes::Page
  rule :foo, 'foo'
  rule_1 :bar, 'bar', 'text()'
end

–

Using `rule`

Using `rule_1`

Using `selector`

Using `extractor`

Defined Under Namespace

Classes: Extractor, InvalidRuleException, Rule

Instance Method Summary collapse

#extractor(name, extract = nil, &block) ⇒ Object

Creates a standalone extractor that can later be used in a rule.
#parse(node, context = nil, rules = nil) ⇒ Object

:nodoc:.
#rule(name, select = '', extract = nil, limit = -1,, &block) ⇒ Object
name
the name later used to invoke this rule select
the selector to use, String or Symbol extract

the extractor to use, String, Symbol, or Class.
#rule_1(name, selector = '', extractor = nil, &block) ⇒ Object

Almost the same as rule except forces limit to be 1.
#rules ⇒ Object

:nodoc:.
#selector(name, select = nil, &block) ⇒ Object

Creates a standalone selector that can later be used in a rule.

Instance Method Details

#extractor(name, extract = nil, &block) ⇒ `Object`

Creates a standalone extractor that can later be used in a rule. Example:

class Foobar < Scrapes::Page
  extractor :mailto_extract do |elem|
    elem.attributes['href'].sub(/mailto:/,'') # remove the mailto: string
  end
  rule :emails, 'a[@href^="mailto:"]', :mailto_extract
end

name: the name later used to invoke this selector
extract: the extractor to use, String or NilClass
block: a block extractor, must not be defined if extract is non-nil

A block extractor is yielded each object that matched the rules’s selector.

Extractors passed to rule or rule_1 are interpreted based on the class of the extractor as follows

NilClass

The result of the selector is just re-returned. Thus foo.my_rule would just return the selector results defined on the :my_rule rule.

Symbol

An custom extractor is used. See above docs for this method for an example.

Class

A nested class of the given name is used as a new inner-parser. An instance of that class is returned from each invocation of the extractor. Example:

class Outer < Scrapes::Page
  class Inner < Scrapes::Page
   rule_1 :bold_text, 'b', 'text()'
   rule_1 :img_src, 'img[@src]', '@src'
  end
  rule :items, 'tr', Inner
end

Now calling my_page.items returns an Array of Inner objects that each separately parses out the bold text and image source of each table row in the document.

String

Two patterns:

@foobar: extract out the contents of an attibute named ‘foobar’
foobar(): invoke the foobar builtin extractor, see Scrapes::Hpricot::Extractors



150
151
152

# File 'lib/scrapes/rule_parser.rb', line 150

def extractor(name, extract = nil, &block)
  tor '@extractor', name, extract, &block
end

#parse(node, context = nil, rules = nil) ⇒ `Object`

:nodoc:

# File 'lib/scrapes/rule_parser.rb', line 155

def parse(node, context = nil, rules = nil) # :nodoc:
  context = self.new() unless context                               
  rules   = self.rules unless rules
  if rules
    rules.each_with_index do |rule, index|
      if rule and rule.process(node, context)
        less_rules = rules.clone unless less_rules
        less_rules[index] = nil
      end
    end
  end
  context
end

#rule(name, select = '', extract = nil, limit = -1,, &block) ⇒ `Object`

name: the name later used to invoke this rule
select: the selector to use, String or Symbol
extract: the extractor to use, String, Symbol, or Class. See RuleParser#extractor
limit: the limit of nodes to send to extractor
block: a block extractor, must not be defined if extract is non-nil

Example:

class Foobar < Scrapes::Page
  rule :foo, 'foo'
end

Later it’s used as an instance method on the Scrapes::Page objects like this:

foobar.foo.each do |foo|
  example.attr << foo
end

Raises:

(InvalidRuleException)

# File 'lib/scrapes/rule_parser.rb', line 66

def rule(name, select = '', extract = nil, limit = -1, &block)
  raise InvalidRuleException, "First argument (rule name) is required" unless name
  attr name, true
  self.rules << Rule.new(name, selector(nil,select), extractor(nil,extract,&block), limit)
end

#rule_1(name, selector = '', extractor = nil, &block) ⇒ `Object`

Almost the same as rule except forces limit to be 1. The other difference is that RuleParser#rule returns collections of mathes (an Array or size 1 even) where as RuleParser#rule_1 just returns the match.

name: the name later used to invoke this rule
select: the selector to use, String or Symbol
extract: the extractor to use, String, Symbol, or Class
block: a block extractor, must not be defined if extract is non-nil

Example:

class Foobar < Scrapes::Page
  rule_1 :bar, 'tr'
end

Later it’s used as an instance method on the Scrapes::Page objects like this:

example.attr = foobar.bar



86
87
88

# File 'lib/scrapes/rule_parser.rb', line 86

def rule_1(name, selector = '', extractor = nil, &block)
  rule(name, selector, extractor, 1, &block)
end

#rules ⇒ `Object`

:nodoc:



170
171
172

# File 'lib/scrapes/rule_parser.rb', line 170

def rules() # :nodoc:
  @microparser_rules ||= []
end

#selector(name, select = nil, &block) ⇒ `Object`

Creates a standalone selector that can later be used in a rule. Example:

class Foobar < Scrapes::Page
  selector :foo_select, 'table'
  rule_1 :bar, :foo_select # a Symbol triggers use of the selector
end

name: the name later used to invoke this selector
select: the selector to use, String or NilClass
block: a block selector, must not be defined if select is non-nil

A block selector is yielded the Hpricot doc object just once. The collection it returns is interated over and each match is passed to the extractor. Example:

class Foobar < Scrapes::Page
  selector :foo_select_2 do |hpricot_doc|
    doc.search('table')
  end
  rule_1 :bar, :foo_select_2 # a Symbol triggers use of the selector
end

String selectors passed to rule or rule_1 are interpreted as Hpricot search strings. See code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase



109
110
111

# File 'lib/scrapes/rule_parser.rb', line 109

def selector(name, select = nil, &block)
  tor '@selector', name, select, &block
end

Module: Scrapes::RuleParser

Overview

Using rule

Using rule_1

Using selector

Using extractor