Class: Grubby::Scraper

Inherits:

Object

Object
Grubby::Scraper

show all

Defined in:: lib/grubby/scraper.rb

Direct Known Subclasses

JsonScraper, PageScraper

Defined Under Namespace

Classes: Error

Instance Attribute Summary collapse

#errors ⇒ Hash{Symbol => StandardError} readonly

Collected errors raised during #initialize by Scraper.scrapes blocks, indexed by field name.
#source ⇒ Object readonly

The object being scraped.

Class Method Summary collapse

.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ void

Iterates a series of pages, starting at start.
.fields ⇒ Array<Symbol>

Fields defined via Scraper.scrapes.
.scrape(url, agent = $grubby) ⇒ Grubby::Scraper

Instantiates the Scraper class with the resource indicated by url.
.scrapes(field, **options, &block) ⇒ void

Defines an attribute reader method named by field.

Instance Method Summary collapse

#[](field) ⇒ Object

Returns the scraped value named by field.
#initialize(source) ⇒ Scraper constructor

A new instance of Scraper.
#to_h ⇒ Hash{Symbol => Object}

Returns all scraped values as a Hash.

Constructor Details

permalink #initialize(source) ⇒ `Scraper`

Returns a new instance of Scraper.

Parameters:

source

Raises:

(Grubby::Scraper::Error) —

if any scrapes blocks fail

[View source]

# File 'lib/grubby/scraper.rb', line 228

def initialize(source)
  @source = source
  @scraped = {}
  @errors = {}

  self.class.fields.each do |field|
    begin
      self.send(field)
    rescue FieldScrapeFailedError
    end
  end

  raise Error.new(self) unless @errors.empty?
end

Instance Attribute Details

permalink #errors ⇒ `Hash{Symbol => StandardError}` (readonly)

Collected errors raised during #initialize by scrapes blocks, indexed by field name. This Hash will be empty if #initialize did not raise a Grubby::Scraper::Error.

Returns:

(Hash{Symbol => StandardError})


223
224
225

# File 'lib/grubby/scraper.rb', line 223

def errors
  @errors
end

permalink #source ⇒ `Object` (readonly)

The object being scraped. Typically an instance of a Mechanize pluggable parser such as Mechanize::Page.

Returns:

(Object)


216
217
218

# File 'lib/grubby/scraper.rb', line 216

def source
  @source
end

Class Method Details

permalink .each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ `void`

This method returns an undefined value.

Iterates a series of pages, starting at start. The Scraper class is instantiated with each page, and each Scraper instance is passed to the given block. Subsequent pages in the series are determined by invoking the next_method method on each Scraper instance.

Iteration stops when the next_method method returns falsy. If the next_method method returns a String or URI, that value will be treated as the URL of the next page. Otherwise that value will be treated as the page itself.

Examples:

Iterate from page object

class PostsIndexScraper < Grubby::PageScraper
  def next
    page.link_with(text: "Next >")&.click
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1") do |scraper|
  scraper.page.uri.query  # == "page=1", "page=2", "page=3", ...
end

Iterate from URI

class PostsIndexScraper < Grubby::PageScraper
  def next
    page.link_with(text: "Next >")&.to_absolute_uri
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1") do |scraper|
  scraper.page.uri.query  # == "page=1", "page=2", "page=3", ...
end

Specifying the iteration method

class PostsIndexScraper < Grubby::PageScraper
  scrapes(:next_uri, optional: true) do
    page.link_with(text: "Next >")&.to_absolute_uri
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1", next_method: :next_uri) do |scraper|
  scraper.page.uri.query  # == "page=1", "page=2", "page=3", ...
end

Parameters:

start (String, URI, Mechanize::Page, Mechanize::File)
agent (Mechanize) (defaults to: $grubby)
next_method (Symbol) (defaults to: :next)

Yield Parameters:

scraper (Grubby::Scraper)

Raises:

(NoMethodError) —

if the Scraper class does not define the method indicated by next_method
(Grubby::Scraper::Error) —

if any scrapes blocks fail

[View source]

# File 'lib/grubby/scraper.rb', line 196

def self.each(start, agent = $grubby, next_method: :next)
  unless self.method_defined?(next_method)
    raise NoMethodError.new(nil, next_method), "#{self} does not define `#{next_method}`"
  end

  return to_enum(:each, start, agent, next_method: next_method) unless block_given?

  current = start
  while current
    current = agent.get(current) if current.is_a?(String) || current.is_a?(URI)
    scraper = self.new(current)
    yield scraper
    current = scraper.send(next_method)
  end
end

permalink .fields ⇒ `Array<Symbol>`

Fields defined via scrapes.

Returns:

(Array<Symbol>)

[View source]


105
106
107

# File 'lib/grubby/scraper.rb', line 105

def self.fields
  @fields ||= self == Grubby::Scraper ? [] : self.superclass.fields.dup
end

permalink .scrape(url, agent = $grubby) ⇒ `Grubby::Scraper`

Instantiates the Scraper class with the resource indicated by url. This method acts as a default factory method, and provides a standard interface for overrides.

Examples:

Default factory method

class PostPageScraper < Grubby::PageScraper
  # ...
end

PostPageScraper.scrape("https://example.com/posts/42")
  # == PostPageScraper.new($grubby.get("https://example.com/posts/42"))

Override factory method

class PostApiScraper < Grubby::JsonScraper
  # ...

  def self.scrape(url, agent = $grubby)
    api_url = url.to_s.sub(%r"//example.com/(.+)", '//api.example.com/\1.json')
    super(api_url, agent)
  end
end

PostApiScraper.scrape("https://example.com/posts/42")
  # == PostApiScraper.new($grubby.get("https://api.example.com/posts/42.json"))

Parameters:

url (String, URI)
agent (Mechanize) (defaults to: $grubby)

Returns:

(Grubby::Scraper)

Raises:

(Grubby::Scraper::Error) —

if any scrapes blocks fail

[View source]


139
140
141

# File 'lib/grubby/scraper.rb', line 139

def self.scrape(url, agent = $grubby)
  self.new(agent.get(url))
end

permalink .scrapes(field, **options, &block) ⇒ `void`

This method returns an undefined value.

Defines an attribute reader method named by field. During #initialize, the given block is called, and the attribute is set to the block’s return value.

By default, raises an exception if the block’s return value is nil. To prevent this behavior, set the :optional option to true. Alternatively, the block can be conditionally evaluated, based on another method’s return value, using the :if or :unless options.

Examples:

Default behavior

class GreetingScraper < Grubby::Scraper
  scrapes(:name) do
    source[/Hello (\w+)/, 1]
  end
end

scraper = GreetingScraper.new("Hello World!")
scraper.name  # == "World"

scraper = GreetingScraper.new("Hello!")  # raises Grubby::Scraper::Error

Optional scraped value

class GreetingScraper < Grubby::Scraper
  scrapes(:name, optional: true) do
    source[/Hello (\w+)/, 1]
  end
end

scraper = GreetingScraper.new("Hello World!")
scraper.name  # == "World"

scraper = GreetingScraper.new("Hello!")
scraper.name  # == nil

Conditional scraped value

class GreetingScraper < Grubby::Scraper
  def hello?
    source.start_with?("Hello ")
  end

  scrapes(:name, if: :hello?) do
    source[/Hello (\w+)/, 1]
  end
end

scraper = GreetingScraper.new("Hello World!")
scraper.name  # == "World"

scraper = GreetingScraper.new("Hello!")  # raises Grubby::Scraper::Error

scraper = GreetingScraper.new("How are you?")
scraper.name  # == nil

Parameters:

field (Symbol, String)
options (Hash)

Options Hash (**options):

:optional (Boolean) — default: false —

Whether the block should be allowed to return a nil value
:if (Symbol) — default: nil —

Name of predicate method that determines if the block should be evaluated
:unless (Symbol) — default: nil —

Name of predicate method that determines if the block should not be evaluated

Yield Returns:

(Object)

[View source]

# File 'lib/grubby/scraper.rb', line 68

def self.scrapes(field, **options, &block)
  field = field.to_sym
  (self.fields << field).uniq!

  define_method(field) do
    raise "#{self.class}#initialize does not invoke `super`" unless defined?(@scraped)

    if !@scraped.key?(field) && !@errors.key?(field)
      begin
        skip = (options[:if] && !self.send(options[:if])) ||
          (options[:unless] && self.send(options[:unless]))

        if skip
          @scraped[field] = nil
        else
          @scraped[field] = instance_eval(&block)
          if @scraped[field].nil?
            raise FieldValueRequiredError.new(field) unless options[:optional]
            $log.debug("#{self.class}##{field} is nil")
          end
        end
      rescue RuntimeError, IndexError => e
        @errors[field] = e
      end
    end

    if @errors.key?(field)
      raise FieldScrapeFailedError.new(field, @errors[field])
    else
      @scraped[field]
    end
  end
end

Instance Method Details

permalink #[](field) ⇒ `Object`

Returns the scraped value named by field.

Parameters:

field (Symbol, String)

Returns:

(Object)

Raises:

(RuntimeError) —

if field is not a valid name

[View source]


249
250
251

# File 'lib/grubby/scraper.rb', line 249

def [](field)
  @scraped.fetch(field.to_sym)
end

permalink #to_h ⇒ `Hash{Symbol => Object}`

Returns all scraped values as a Hash.

Returns:

(Hash{Symbol => Object})

[View source]


256
257
258

# File 'lib/grubby/scraper.rb', line 256

def to_h
  @scraped.dup
end

Class: Grubby::Scraper

Direct Known Subclasses

Defined Under Namespace

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

permalink #initialize(source) ⇒ Scraper

Instance Attribute Details

permalink #errors ⇒ Hash{Symbol => StandardError} (readonly)

permalink #source ⇒ Object (readonly)

Class Method Details

permalink .each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ void

Examples:

Iterate from page object

Iterate from URI

Specifying the iteration method

permalink .fields ⇒ Array<Symbol>

permalink .scrape(url, agent = $grubby) ⇒ Grubby::Scraper

Examples:

Default factory method

Override factory method

permalink .scrapes(field, **options, &block) ⇒ void

Examples:

Default behavior

Optional scraped value

Conditional scraped value

Instance Method Details

permalink #[](field) ⇒ Object

permalink #to_h ⇒ Hash{Symbol => Object}

permalink #initialize(source) ⇒ `Scraper`

permalink #errors ⇒ `Hash{Symbol => StandardError}` (readonly)

permalink #source ⇒ `Object` (readonly)

permalink .each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ `void`

permalink .fields ⇒ `Array<Symbol>`

permalink .scrape(url, agent = $grubby) ⇒ `Grubby::Scraper`

permalink .scrapes(field, **options, &block) ⇒ `void`

permalink #[](field) ⇒ `Object`

permalink #to_h ⇒ `Hash{Symbol => Object}`