Class: Grubby::Scraper

Inherits:
Object
  • Object
show all
Defined in:
lib/grubby/scraper.rb

Direct Known Subclasses

JsonScraper, PageScraper

Defined Under Namespace

Classes: Error

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(source) ⇒ Scraper

Returns a new instance of Scraper.

Parameters:

  • source

Raises:



228
229
230
231
232
233
234
235
236
237
238
239
240
241
# File 'lib/grubby/scraper.rb', line 228

def initialize(source)
  @source = source
  @scraped = {}
  @errors = {}

  self.class.fields.each do |field|
    begin
      self.send(field)
    rescue FieldScrapeFailedError
    end
  end

  raise Error.new(self) unless @errors.empty?
end

Instance Attribute Details

#errorsHash{Symbol => StandardError} (readonly)

Collected errors raised during #initialize by scrapes blocks, indexed by field name. This Hash will be empty if #initialize did not raise a Grubby::Scraper::Error.

Returns:

  • (Hash{Symbol => StandardError})


223
224
225
# File 'lib/grubby/scraper.rb', line 223

def errors
  @errors
end

#sourceObject (readonly)

The object being scraped. Typically an instance of a Mechanize pluggable parser such as Mechanize::Page.

Returns:

  • (Object)


216
217
218
# File 'lib/grubby/scraper.rb', line 216

def source
  @source
end

Class Method Details

.each(start, agent = $grubby, next_method: :next) {|scraper| ... } ⇒ void

This method returns an undefined value.

Iterates a series of pages, starting at start. The Scraper class is instantiated with each page, and each Scraper instance is passed to the given block. Subsequent pages in the series are determined by invoking the next_method method on each Scraper instance.

Iteration stops when the next_method method returns falsy. If the next_method method returns a String or URI, that value will be treated as the URL of the next page. Otherwise that value will be treated as the page itself.

Examples:

Iterate from page object

class PostsIndexScraper < Grubby::PageScraper
  def next
    page.link_with(text: "Next >")&.click
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1") do |scraper|
  scraper.page.uri.query  # == "page=1", "page=2", "page=3", ...
end

Iterate from URI

class PostsIndexScraper < Grubby::PageScraper
  def next
    page.link_with(text: "Next >")&.to_absolute_uri
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1") do |scraper|
  scraper.page.uri.query  # == "page=1", "page=2", "page=3", ...
end

Specifying the iteration method

class PostsIndexScraper < Grubby::PageScraper
  scrapes(:next_uri, optional: true) do
    page.link_with(text: "Next >")&.to_absolute_uri
  end
end

PostsIndexScraper.each("https://example.com/posts?page=1", next_method: :next_uri) do |scraper|
  scraper.page.uri.query  # == "page=1", "page=2", "page=3", ...
end

Parameters:

  • start (String, URI, Mechanize::Page, Mechanize::File)
  • agent (Mechanize) (defaults to: $grubby)
  • next_method (Symbol) (defaults to: :next)

Yield Parameters:

Raises:

  • (NoMethodError)

    if the Scraper class does not define the method indicated by next_method

  • (Grubby::Scraper::Error)

    if any scrapes blocks fail



196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
# File 'lib/grubby/scraper.rb', line 196

def self.each(start, agent = $grubby, next_method: :next)
  unless self.method_defined?(next_method)
    raise NoMethodError.new(nil, next_method), "#{self} does not define `#{next_method}`"
  end

  return to_enum(:each, start, agent, next_method: next_method) unless block_given?

  current = start
  while current
    current = agent.get(current) if current.is_a?(String) || current.is_a?(URI)
    scraper = self.new(current)
    yield scraper
    current = scraper.send(next_method)
  end
end

.fieldsArray<Symbol>

Fields defined via scrapes.

Returns:

  • (Array<Symbol>)


105
106
107
# File 'lib/grubby/scraper.rb', line 105

def self.fields
  @fields ||= self == Grubby::Scraper ? [] : self.superclass.fields.dup
end

.scrape(url, agent = $grubby) ⇒ Grubby::Scraper

Instantiates the Scraper class with the resource indicated by url. This method acts as a default factory method, and provides a standard interface for overrides.

Examples:

Default factory method

class PostPageScraper < Grubby::PageScraper
  # ...
end

PostPageScraper.scrape("https://example.com/posts/42")
  # == PostPageScraper.new($grubby.get("https://example.com/posts/42"))

Override factory method

class PostApiScraper < Grubby::JsonScraper
  # ...

  def self.scrape(url, agent = $grubby)
    api_url = url.to_s.sub(%r"//example.com/(.+)", '//api.example.com/\1.json')
    super(api_url, agent)
  end
end

PostApiScraper.scrape("https://example.com/posts/42")
  # == PostApiScraper.new($grubby.get("https://api.example.com/posts/42.json"))

Parameters:

  • url (String, URI)
  • agent (Mechanize) (defaults to: $grubby)

Returns:

Raises:



139
140
141
# File 'lib/grubby/scraper.rb', line 139

def self.scrape(url, agent = $grubby)
  self.new(agent.get(url))
end

.scrapes(field, **options, &block) ⇒ void

This method returns an undefined value.

Defines an attribute reader method named by field. During #initialize, the given block is called, and the attribute is set to the block’s return value.

By default, raises an exception if the block’s return value is nil. To prevent this behavior, set the :optional option to true. Alternatively, the block can be conditionally evaluated, based on another method’s return value, using the :if or :unless options.

Examples:

Default behavior

class GreetingScraper < Grubby::Scraper
  scrapes(:name) do
    source[/Hello (\w+)/, 1]
  end
end

scraper = GreetingScraper.new("Hello World!")
scraper.name  # == "World"

scraper = GreetingScraper.new("Hello!")  # raises Grubby::Scraper::Error

Optional scraped value

class GreetingScraper < Grubby::Scraper
  scrapes(:name, optional: true) do
    source[/Hello (\w+)/, 1]
  end
end

scraper = GreetingScraper.new("Hello World!")
scraper.name  # == "World"

scraper = GreetingScraper.new("Hello!")
scraper.name  # == nil

Conditional scraped value

class GreetingScraper < Grubby::Scraper
  def hello?
    source.start_with?("Hello ")
  end

  scrapes(:name, if: :hello?) do
    source[/Hello (\w+)/, 1]
  end
end

scraper = GreetingScraper.new("Hello World!")
scraper.name  # == "World"

scraper = GreetingScraper.new("Hello!")  # raises Grubby::Scraper::Error

scraper = GreetingScraper.new("How are you?")
scraper.name  # == nil

Parameters:

  • field (Symbol, String)
  • options (Hash)

Options Hash (**options):

  • :optional (Boolean) — default: false

    Whether the block should be allowed to return a nil value

  • :if (Symbol) — default: nil

    Name of predicate method that determines if the block should be evaluated

  • :unless (Symbol) — default: nil

    Name of predicate method that determines if the block should not be evaluated

Yield Returns:

  • (Object)


68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# File 'lib/grubby/scraper.rb', line 68

def self.scrapes(field, **options, &block)
  field = field.to_sym
  (self.fields << field).uniq!

  define_method(field) do
    raise "#{self.class}#initialize does not invoke `super`" unless defined?(@scraped)

    if !@scraped.key?(field) && !@errors.key?(field)
      begin
        skip = (options[:if] && !self.send(options[:if])) ||
          (options[:unless] && self.send(options[:unless]))

        if skip
          @scraped[field] = nil
        else
          @scraped[field] = instance_eval(&block)
          if @scraped[field].nil?
            raise FieldValueRequiredError.new(field) unless options[:optional]
            $log.debug("#{self.class}##{field} is nil")
          end
        end
      rescue RuntimeError, IndexError => e
        @errors[field] = e
      end
    end

    if @errors.key?(field)
      raise FieldScrapeFailedError.new(field, @errors[field])
    else
      @scraped[field]
    end
  end
end

Instance Method Details

#[](field) ⇒ Object

Returns the scraped value named by field.

Parameters:

Returns:

  • (Object)

Raises:

  • (RuntimeError)

    if field is not a valid name



249
250
251
# File 'lib/grubby/scraper.rb', line 249

def [](field)
  @scraped.fetch(field.to_sym)
end

#to_hHash{Symbol => Object}

Returns all scraped values as a Hash.

Returns:

  • (Hash{Symbol => Object})


256
257
258
# File 'lib/grubby/scraper.rb', line 256

def to_h
  @scraped.dup
end