Method: Grubby::Scraper.each

Defined in:
lib/grubby/scraper.rb

.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ void

This method returns an undefined value.

Iterates a series of pages, starting at start. The Scraper class is instantiated with each page, and each Scraper instance is passed to the given block. Subsequent pages in the series are determined by invoking the next_method method on each Scraper instance.

Iteration stops when the next_method method returns falsy. If the next_method method returns a String or URI, that value will be treated as the URL of the next page. Otherwise that value will be treated as the page itself.

Examples:

Iterate from page object

class PostsIndexScraper < Grubby::PageScraper
  def next
    page.link_with(text: "Next >")&.click
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1") do |scraper|
  scraper.page.uri.query  # == "page=1", "page=2", "page=3", ...
end

Iterate from URI

class PostsIndexScraper < Grubby::PageScraper
  def next
    page.link_with(text: "Next >")&.to_absolute_uri
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1") do |scraper|
  scraper.page.uri.query  # == "page=1", "page=2", "page=3", ...
end

Specifying the iteration method

class PostsIndexScraper < Grubby::PageScraper
  scrapes(:next_uri, optional: true) do
    page.link_with(text: "Next >")&.to_absolute_uri
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1", next_method: :next_uri) do |scraper|
  scraper.page.uri.query  # == "page=1", "page=2", "page=3", ...
end

Parameters:

  • start (String, URI, Mechanize::Page, Mechanize::File)
  • agent (Mechanize) (defaults to: $grubby)
  • next_method (Symbol) (defaults to: :next)

Yield Parameters:

Raises:

  • (NoMethodError)

    if the Scraper class does not define the method indicated by next_method

  • (Grubby::Scraper::Error)

    if any scrapes blocks fail



196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
# File 'lib/grubby/scraper.rb', line 196

def self.each(start, agent = $grubby, next_method: :next)
  unless self.method_defined?(next_method)
    raise NoMethodError.new(nil, next_method), "#{self} does not define `#{next_method}`"
  end

  return to_enum(:each, start, agent, next_method: next_method) unless block_given?

  current = start
  while current
    current = agent.get(current) if current.is_a?(String) || current.is_a?(URI)
    scraper = self.new(current)
    yield scraper
    current = scraper.send(next_method)
  end
end