Class: ScraperUtils::HostThrottler

Inherits:
Object
  • Object
show all
Defined in:
lib/scraper_utils/host_throttler.rb

Overview

Tracks per-host next-allowed-request time so that time spent parsing and saving records counts toward the crawl delay rather than being added on top of it.

Usage:

throttler = HostThrottler.new(crawl_delay: 1.0, max_load: 50.0)
throttler.before_request(hostname)   # sleep until ready
# ... make request ...
throttler.after_request(hostname)    # record timing, schedule next slot
throttler.after_request(hostname, overloaded: true)  # double delay + 5s

Constant Summary collapse

MAX_DELAY =
120.0

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(crawl_delay: 0.0, max_load: nil) ⇒ HostThrottler

Returns a new instance of HostThrottler.

Parameters:

  • crawl_delay (Float) (defaults to: 0.0)

    minimum seconds between requests per host

  • max_load (Float) (defaults to: nil)

    target server load percentage (10..100); 50 means response_time == pause_time



20
21
22
23
24
25
26
# File 'lib/scraper_utils/host_throttler.rb', line 20

def initialize(crawl_delay: 0.0, max_load: nil)
  @crawl_delay = crawl_delay.to_f
  # Clamp between 10 (delay 9x response) and 100 (no extra delay)
  @max_load = max_load ? max_load.to_f.clamp(10.0, 100.0) : nil
  @next_request_at = {}   # hostname => Time
  @request_started_at = {} # hostname => Time
end

Class Method Details

.overload_error?(error) ⇒ Boolean

Duck-type check for HTTP overload errors across Mechanize, HTTParty, etc.

Parameters:

  • error (Exception)

Returns:

  • (Boolean)


77
78
79
80
81
82
83
84
# File 'lib/scraper_utils/host_throttler.rb', line 77

def self.overload_error?(error)
  code = if error.respond_to?(:response) && error.response.respond_to?(:code)
           error.response.code.to_i          # HTTParty style
         elsif error.respond_to?(:response_code)
           error.response_code.to_i          # Mechanize style
         end
  [429, 500, 503].include?(code)
end

Instance Method Details

#after_request(hostname, overloaded: false) ⇒ void

This method returns an undefined value.

Calculate and store the next allowed request time for this host.

Parameters:

  • hostname (String)
  • overloaded (Boolean) (defaults to: false)

    true when the server signalled overload (HTTP 429/500/503); doubles the normal delay and adds 5 seconds.



50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# File 'lib/scraper_utils/host_throttler.rb', line 50

def after_request(hostname, overloaded: false)
  started = @request_started_at[hostname] || Time.now
  response_time = Time.now - started

  delay = @crawl_delay
  if @max_load
    delay += (100.0 - @max_load) * response_time / @max_load
  end

  if overloaded
    delay = delay + response_time * 2 + 5.0
  end

  delay = delay.round(3).clamp(0.0, MAX_DELAY)
  @next_request_at[hostname] = Time.now + delay

  if DebugUtils.basic?
    msg = "HostThrottler: #{hostname} response=#{response_time.round(3)}s"
    msg += " OVERLOADED" if overloaded
    msg += ", Will delay #{delay}s before next request"
    LogUtils.log(msg)
  end
end

#before_request(hostname) ⇒ void

This method returns an undefined value.

Sleep until this host’s throttle window has elapsed. Records when the request actually started.

Parameters:

  • hostname (String)


36
37
38
39
40
41
42
43
# File 'lib/scraper_utils/host_throttler.rb', line 36

def before_request(hostname)
  target = @next_request_at[hostname]
  if target
    remaining = target - Time.now
    sleep(remaining) if remaining > 0
  end
  @request_started_at[hostname] = Time.now
end

#will_pause_till(hostname) ⇒ Object



28
29
30
# File 'lib/scraper_utils/host_throttler.rb', line 28

def will_pause_till(hostname)
  @next_request_at[hostname]
end