Class: Grubby::Scraper
- Inherits:
-
Object
- Object
- Grubby::Scraper
- Defined in:
- lib/grubby/scraper.rb
Direct Known Subclasses
Defined Under Namespace
Classes: Error
Instance Attribute Summary collapse
-
#errors ⇒ Hash{Symbol => StandardError}
readonly
Collected errors raised during #initialize by Scraper.scrapes blocks, indexed by field name.
-
#source ⇒ Object
readonly
The object being scraped.
Class Method Summary collapse
-
.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ void
Iterates a series of pages, starting at
start
. -
.fields ⇒ Array<Symbol>
Fields defined via Scraper.scrapes.
-
.scrape(url, agent = $grubby) ⇒ Grubby::Scraper
Instantiates the Scraper class with the resource indicated by
url
. -
.scrapes(field, **options, &block) ⇒ void
Defines an attribute reader method named by
field
.
Instance Method Summary collapse
-
#[](field) ⇒ Object
Returns the scraped value named by
field
. -
#initialize(source) ⇒ Scraper
constructor
A new instance of Scraper.
-
#to_h ⇒ Hash{Symbol => Object}
Returns all scraped values as a Hash.
Constructor Details
#initialize(source) ⇒ Scraper
Returns a new instance of Scraper.
228 229 230 231 232 233 234 235 236 237 238 239 240 241 |
# File 'lib/grubby/scraper.rb', line 228 def initialize(source) @source = source @scraped = {} @errors = {} self.class.fields.each do |field| begin self.send(field) rescue FieldScrapeFailedError end end raise Error.new(self) unless @errors.empty? end |
Instance Attribute Details
#errors ⇒ Hash{Symbol => StandardError} (readonly)
Collected errors raised during #initialize by scrapes blocks, indexed by field name. This Hash will be empty if #initialize did not raise a Grubby::Scraper::Error
.
223 224 225 |
# File 'lib/grubby/scraper.rb', line 223 def errors @errors end |
#source ⇒ Object (readonly)
The object being scraped. Typically an instance of a Mechanize pluggable parser such as Mechanize::Page
.
216 217 218 |
# File 'lib/grubby/scraper.rb', line 216 def source @source end |
Class Method Details
.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ void
This method returns an undefined value.
Iterates a series of pages, starting at start
. The Scraper class is instantiated with each page, and each Scraper instance is passed to the given block. Subsequent pages in the series are determined by invoking the next_method
method on each Scraper instance.
Iteration stops when the next_method
method returns falsy. If the next_method
method returns a String or URI, that value will be treated as the URL of the next page. Otherwise that value will be treated as the page itself.
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
# File 'lib/grubby/scraper.rb', line 196 def self.each(start, agent = $grubby, next_method: :next) unless self.method_defined?(next_method) raise NoMethodError.new(nil, next_method), "#{self} does not define `#{next_method}`" end return to_enum(:each, start, agent, next_method: next_method) unless block_given? current = start while current current = agent.get(current) if current.is_a?(String) || current.is_a?(URI) scraper = self.new(current) yield scraper current = scraper.send(next_method) end end |
.fields ⇒ Array<Symbol>
Fields defined via scrapes.
105 106 107 |
# File 'lib/grubby/scraper.rb', line 105 def self.fields @fields ||= self == Grubby::Scraper ? [] : self.superclass.fields.dup end |
.scrape(url, agent = $grubby) ⇒ Grubby::Scraper
Instantiates the Scraper class with the resource indicated by url
. This method acts as a default factory method, and provides a standard interface for overrides.
139 140 141 |
# File 'lib/grubby/scraper.rb', line 139 def self.scrape(url, agent = $grubby) self.new(agent.get(url)) end |
.scrapes(field, **options, &block) ⇒ void
This method returns an undefined value.
Defines an attribute reader method named by field
. During #initialize, the given block is called, and the attribute is set to the block’s return value.
By default, raises an exception if the block’s return value is nil. To prevent this behavior, set the :optional
option to true. Alternatively, the block can be conditionally evaluated, based on another method’s return value, using the :if
or :unless
options.
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
# File 'lib/grubby/scraper.rb', line 68 def self.scrapes(field, **, &block) field = field.to_sym (self.fields << field).uniq! define_method(field) do raise "#{self.class}#initialize does not invoke `super`" unless defined?(@scraped) if !@scraped.key?(field) && !@errors.key?(field) begin skip = ([:if] && !self.send([:if])) || ([:unless] && self.send([:unless])) if skip @scraped[field] = nil else @scraped[field] = instance_eval(&block) if @scraped[field].nil? raise FieldValueRequiredError.new(field) unless [:optional] $log.debug("#{self.class}##{field} is nil") end end rescue RuntimeError, IndexError => e @errors[field] = e end end if @errors.key?(field) raise FieldScrapeFailedError.new(field, @errors[field]) else @scraped[field] end end end |
Instance Method Details
#[](field) ⇒ Object
Returns the scraped value named by field
.
249 250 251 |
# File 'lib/grubby/scraper.rb', line 249 def [](field) @scraped.fetch(field.to_sym) end |
#to_h ⇒ Hash{Symbol => Object}
Returns all scraped values as a Hash.
256 257 258 |
# File 'lib/grubby/scraper.rb', line 256 def to_h @scraped.dup end |