Module: Scrapes::RuleParser
- Included in:
- Page
- Defined in:
- lib/scrapes/rule_parser.rb
Overview
Defined Under Namespace
Classes: Extractor, InvalidRuleException, Rule
Instance Method Summary collapse
-
#extractor(name, extract = nil, &block) ⇒ Object
Creates a standalone extractor that can later be used in a rule.
-
#parse(node, context = nil, rules = nil) ⇒ Object
:nodoc:.
-
#rule(name, select = '', extract = nil, limit = -1,, &block) ⇒ Object
- name
- the name later used to invoke this rule select
- the selector to use, String or Symbol extract
-
the extractor to use, String, Symbol, or Class.
-
#rule_1(name, selector = '', extractor = nil, &block) ⇒ Object
Almost the same as rule except forces limit to be 1.
-
#rules ⇒ Object
:nodoc:.
-
#selector(name, select = nil, &block) ⇒ Object
Creates a standalone selector that can later be used in a rule.
Instance Method Details
#extractor(name, extract = nil, &block) ⇒ Object
Creates a standalone extractor that can later be used in a rule. Example:
class Foobar < Scrapes::Page
extractor :mailto_extract do |elem|
elem.attributes['href'].sub(/mailto:/,'') # remove the mailto: string
end
rule :emails, 'a[@href^="mailto:"]', :mailto_extract
end
- name
-
the name later used to invoke this selector
- extract
-
the extractor to use, String or NilClass
- block
-
a block extractor, must not be defined if extract is non-nil
A block extractor is yielded each object that matched the rules’s selector.
Extractors passed to rule
or rule_1
are interpreted based on the class of the extractor as follows
NilClass
The result of the selector is just re-returned. Thus foo.my_rule
would just return the selector results defined on the :my_rule rule.
Symbol
An custom extractor is used. See above docs for this method for an example.
Class
A nested class of the given name is used as a new inner-parser. An instance of that class is returned from each invocation of the extractor. Example:
class Outer < Scrapes::Page
class Inner < Scrapes::Page
rule_1 :bold_text, 'b', 'text()'
rule_1 :img_src, 'img[@src]', '@src'
end
rule :items, 'tr', Inner
end
Now calling my_page.items
returns an Array of Inner objects that each separately parses out the bold text and image source of each table row in the document.
String
Two patterns:
- @foobar
-
extract out the contents of an attibute named ‘foobar’
- foobar()
-
invoke the foobar builtin extractor, see Scrapes::Hpricot::Extractors
150 151 152 |
# File 'lib/scrapes/rule_parser.rb', line 150 def extractor(name, extract = nil, &block) tor '@extractor', name, extract, &block end |
#parse(node, context = nil, rules = nil) ⇒ Object
:nodoc:
155 156 157 158 159 160 161 162 163 164 165 166 167 |
# File 'lib/scrapes/rule_parser.rb', line 155 def parse(node, context = nil, rules = nil) # :nodoc: context = self.new() unless context rules = self.rules unless rules if rules rules.each_with_index do |rule, index| if rule and rule.process(node, context) less_rules = rules.clone unless less_rules less_rules[index] = nil end end end context end |
#rule(name, select = '', extract = nil, limit = -1,, &block) ⇒ Object
- name
-
the name later used to invoke this rule
- select
-
the selector to use, String or Symbol
- extract
-
the extractor to use, String, Symbol, or Class. See RuleParser#extractor
- limit
-
the limit of nodes to send to extractor
- block
-
a block extractor, must not be defined if extract is non-nil
Example:
class Foobar < Scrapes::Page
rule :foo, 'foo'
end
Later it’s used as an instance method on the Scrapes::Page objects like this:
.foo.each do |foo|
example.attr << foo
end
66 67 68 69 70 |
# File 'lib/scrapes/rule_parser.rb', line 66 def rule(name, select = '', extract = nil, limit = -1, &block) raise InvalidRuleException, "First argument (rule name) is required" unless name attr name, true self.rules << Rule.new(name, selector(nil,select), extractor(nil,extract,&block), limit) end |
#rule_1(name, selector = '', extractor = nil, &block) ⇒ Object
Almost the same as rule except forces limit to be 1. The other difference is that RuleParser#rule returns collections of mathes (an Array or size 1 even) where as RuleParser#rule_1 just returns the match.
- name
-
the name later used to invoke this rule
- select
-
the selector to use, String or Symbol
- extract
-
the extractor to use, String, Symbol, or Class
- block
-
a block extractor, must not be defined if extract is non-nil
Example:
class Foobar < Scrapes::Page
rule_1 :bar, 'tr'
end
Later it’s used as an instance method on the Scrapes::Page objects like this:
example.attr = .
86 87 88 |
# File 'lib/scrapes/rule_parser.rb', line 86 def rule_1(name, selector = '', extractor = nil, &block) rule(name, selector, extractor, 1, &block) end |
#rules ⇒ Object
:nodoc:
170 171 172 |
# File 'lib/scrapes/rule_parser.rb', line 170 def rules() # :nodoc: @microparser_rules ||= [] end |
#selector(name, select = nil, &block) ⇒ Object
Creates a standalone selector that can later be used in a rule. Example:
class Foobar < Scrapes::Page
selector :foo_select, 'table'
rule_1 :bar, :foo_select # a Symbol triggers use of the selector
end
- name
-
the name later used to invoke this selector
- select
-
the selector to use, String or NilClass
- block
-
a block selector, must not be defined if select is non-nil
A block selector is yielded the Hpricot doc object just once. The collection it returns is interated over and each match is passed to the extractor. Example:
class Foobar < Scrapes::Page
selector :foo_select_2 do |hpricot_doc|
doc.search('table')
end
rule_1 :bar, :foo_select_2 # a Symbol triggers use of the selector
end
String selectors passed to rule
or rule_1
are interpreted as Hpricot search strings. See code.whytheluckystiff.net/hpricot/wiki/AnHpricotShowcase
109 110 111 |
# File 'lib/scrapes/rule_parser.rb', line 109 def selector(name, select = nil, &block) tor '@selector', name, select, &block end |