Class: ScraperUtils::HostThrottler
- Inherits:
-
Object
- Object
- ScraperUtils::HostThrottler
- Defined in:
- lib/scraper_utils/host_throttler.rb
Overview
Tracks per-host next-allowed-request time so that time spent parsing and saving records counts toward the crawl delay rather than being added on top of it.
Usage:
throttler = HostThrottler.new(crawl_delay: 1.0, max_load: 50.0)
throttler.before_request(hostname) # sleep until ready
# ... make request ...
throttler.after_request(hostname) # record timing, schedule next slot
throttler.after_request(hostname, overloaded: true) # double delay + 5s
Constant Summary collapse
- MAX_DELAY =
120.0
Class Method Summary collapse
-
.overload_error?(error) ⇒ Boolean
Duck-type check for HTTP overload errors across Mechanize, HTTParty, etc.
Instance Method Summary collapse
-
#after_request(hostname, overloaded: false) ⇒ void
Calculate and store the next allowed request time for this host.
-
#before_request(hostname) ⇒ void
Sleep until this host’s throttle window has elapsed.
-
#initialize(crawl_delay: 0.0, max_load: nil) ⇒ HostThrottler
constructor
A new instance of HostThrottler.
- #will_pause_till(hostname) ⇒ Object
Constructor Details
#initialize(crawl_delay: 0.0, max_load: nil) ⇒ HostThrottler
Returns a new instance of HostThrottler.
20 21 22 23 24 25 26 |
# File 'lib/scraper_utils/host_throttler.rb', line 20 def initialize(crawl_delay: 0.0, max_load: nil) @crawl_delay = crawl_delay.to_f # Clamp between 10 (delay 9x response) and 100 (no extra delay) @max_load = max_load ? max_load.to_f.clamp(10.0, 100.0) : nil @next_request_at = {} # hostname => Time @request_started_at = {} # hostname => Time end |
Class Method Details
.overload_error?(error) ⇒ Boolean
Duck-type check for HTTP overload errors across Mechanize, HTTParty, etc.
77 78 79 80 81 82 83 84 |
# File 'lib/scraper_utils/host_throttler.rb', line 77 def self.overload_error?(error) code = if error.respond_to?(:response) && error.response.respond_to?(:code) error.response.code.to_i # HTTParty style elsif error.respond_to?(:response_code) error.response_code.to_i # Mechanize style end [429, 500, 503].include?(code) end |
Instance Method Details
#after_request(hostname, overloaded: false) ⇒ void
This method returns an undefined value.
Calculate and store the next allowed request time for this host.
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
# File 'lib/scraper_utils/host_throttler.rb', line 50 def after_request(hostname, overloaded: false) started = @request_started_at[hostname] || Time.now response_time = Time.now - started delay = @crawl_delay if @max_load delay += (100.0 - @max_load) * response_time / @max_load end if overloaded delay = delay + response_time * 2 + 5.0 end delay = delay.round(3).clamp(0.0, MAX_DELAY) @next_request_at[hostname] = Time.now + delay if DebugUtils.basic? msg = "HostThrottler: #{hostname} response=#{response_time.round(3)}s" msg += " OVERLOADED" if overloaded msg += ", Will delay #{delay}s before next request" LogUtils.log(msg) end end |
#before_request(hostname) ⇒ void
This method returns an undefined value.
Sleep until this host’s throttle window has elapsed. Records when the request actually started.
36 37 38 39 40 41 42 43 |
# File 'lib/scraper_utils/host_throttler.rb', line 36 def before_request(hostname) target = @next_request_at[hostname] if target remaining = target - Time.now sleep(remaining) if remaining > 0 end @request_started_at[hostname] = Time.now end |
#will_pause_till(hostname) ⇒ Object
28 29 30 |
# File 'lib/scraper_utils/host_throttler.rb', line 28 def will_pause_till(hostname) @next_request_at[hostname] end |