Class: ScraperUtils::MechanizeUtils::AgentConfig

Inherits:
Object
  • Object
show all
Defined in:
lib/scraper_utils/mechanize_utils/agent_config.rb

Overview

Configuration for a Mechanize agent with sensible defaults and configurable settings. Supports global configuration through AgentConfig.configure and per-instance overrides.

Examples:

Setting global defaults

ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
  config.default_timeout = 500
end

Creating an instance with defaults

config = ScraperUtils::MechanizeUtils::AgentConfig.new

Overriding specific settings

config = ScraperUtils::MechanizeUtils::AgentConfig.new(
  timeout: 120,
)

Constant Summary collapse

DEFAULT_TIMEOUT =
60
DEFAULT_CRAWL_DELAY =
0.5
DEFAULT_MAX_LOAD =
50.0

Class Attribute Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(timeout: nil, compliant_mode: nil, max_load: nil, crawl_delay: nil, disable_ssl_certificate_check: nil, australian_proxy: nil, user_agent: nil) ⇒ AgentConfig

Creates Mechanize agent configuration with sensible defaults overridable via configure

Parameters:

  • timeout (Integer, nil) (defaults to: nil)

    Timeout for agent connections (default: 60)

  • disable_ssl_certificate_check (Boolean, nil) (defaults to: nil)

    Skip SSL verification (default: false)

  • australian_proxy (Boolean, nil) (defaults to: nil)

    Use proxy if available (default: false)

  • user_agent (String, nil) (defaults to: nil)

    Configure Mechanize user agent



87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 87

def initialize(timeout: nil,
               compliant_mode: nil,
               max_load: nil,
               crawl_delay: nil,
               disable_ssl_certificate_check: nil,
               australian_proxy: nil,
               user_agent: nil)
  @timeout = timeout.nil? ? self.class.default_timeout : timeout
  @user_agent = user_agent.nil? ? self.class.default_user_agent : user_agent

  @disable_ssl_certificate_check = if disable_ssl_certificate_check.nil?
                                     self.class.default_disable_ssl_certificate_check
                                   else
                                     disable_ssl_certificate_check
                                   end
  @australian_proxy = if australian_proxy.nil?
                        self.class.default_australian_proxy
                      else
                        australian_proxy
                      end
  @crawl_delay = crawl_delay.nil? ? self.class.default_crawl_delay : crawl_delay.to_f
  # Clamp between 10 (delay 9 x response) and 100 (no delay)
  @max_load = (max_load.nil? ? self.class.default_max_load : max_load).to_f.clamp(10.0, 100.0)
  @throttler = HostThrottler.new(crawl_delay: @crawl_delay, max_load: @max_load)

  # Validate proxy URL format if proxy will be used
  @australian_proxy &&= !ScraperUtils.australian_proxy.to_s.empty?
  if @australian_proxy
    uri = begin
            URI.parse(ScraperUtils.australian_proxy.to_s)
          rescue URI::InvalidURIError => e
            raise URI::InvalidURIError, "Invalid proxy URL format: #{e}"
          end
    unless uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
      raise URI::InvalidURIError, "Proxy URL must start with http:// or https://"
    end
    unless !uri.host.to_s.empty? && uri.port&.positive?
      raise URI::InvalidURIError, "Proxy URL must include host and port"
    end
  end

  today = Date.today.strftime("%Y-%m-%d")
  @user_agent = ENV.fetch("MORPH_USER_AGENT", nil)&.sub("TODAY", today)
  version = ScraperUtils::VERSION
  @user_agent ||= "Mozilla/5.0 (compatible; ScraperUtils/#{version} #{today}; +https://github.com/ianheggie-oaf/scraper_utils)"

  display_options
end

Class Attribute Details

.default_australian_proxyBoolean

Returns Default flag for Australian proxy preference.

Returns:

  • (Boolean)

    Default flag for Australian proxy preference



38
39
40
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 38

def default_australian_proxy
  @default_australian_proxy
end

.default_crawl_delayFloat?

Returns Default Crawl delay between requests in seconds.

Returns:

  • (Float, nil)

    Default Crawl delay between requests in seconds



44
45
46
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 44

def default_crawl_delay
  @default_crawl_delay
end

.default_disable_ssl_certificate_checkBoolean

Returns Default setting for SSL certificate verification.

Returns:

  • (Boolean)

    Default setting for SSL certificate verification



35
36
37
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 35

def default_disable_ssl_certificate_check
  @default_disable_ssl_certificate_check
end

.default_max_loadFloat?

50 will result in a pause the same length as the response (ie 50% of total time will be the response, 50% pausing)

Returns:

  • (Float, nil)

    Default Max load presented to an external server as a percentage



48
49
50
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 48

def default_max_load
  @default_max_load
end

.default_timeoutInteger

Returns Default timeout in seconds for agent connections.

Returns:

  • (Integer)

    Default timeout in seconds for agent connections



32
33
34
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 32

def default_timeout
  @default_timeout
end

.default_user_agentString?

Returns Default Mechanize user agent.

Returns:

  • (String, nil)

    Default Mechanize user agent



41
42
43
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 41

def default_user_agent
  @default_user_agent
end

Instance Attribute Details

#crawl_delayObject (readonly)

Give access for testing



80
81
82
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 80

def crawl_delay
  @crawl_delay
end

#max_loadObject (readonly)

Give access for testing



80
81
82
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 80

def max_load
  @max_load
end

#throttlerObject (readonly)

Give access for testing



80
81
82
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 80

def throttler
  @throttler
end

#user_agentString (readonly)

Returns User agent string.

Returns:

  • (String)

    User agent string



77
78
79
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 77

def user_agent
  @user_agent
end

Class Method Details

.configure {|self| ... } ⇒ void

This method returns an undefined value.

Configure default settings for all AgentConfig instances

Examples:

AgentConfig.configure do |config|
  config.default_timeout = 300
end

Yields:

  • (self)

    Yields self for configuration



57
58
59
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 57

def configure
  yield self if block_given?
end

.reset_defaults!void

This method returns an undefined value.

Reset all configuration options to their default values



63
64
65
66
67
68
69
70
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 63

def reset_defaults!
  @default_timeout = ENV.fetch('MORPH_CLIENT_TIMEOUT', DEFAULT_TIMEOUT).to_i # 60
  @default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
  @default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
  @default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
  @default_crawl_delay = ENV.fetch('MORPH_CLIENT_CRAWL_DELAY', DEFAULT_CRAWL_DELAY)
  @default_max_load = ENV.fetch('MORPH_MAX_LOAD', DEFAULT_MAX_LOAD)
end

Instance Method Details

#configure_agent(agent) ⇒ void

This method returns an undefined value.

Configures a Mechanize agent with these settings

Parameters:

  • agent (Mechanize)

    The agent to configure



139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 139

def configure_agent(agent)
  agent.verify_mode = OpenSSL::SSL::VERIFY_NONE if @disable_ssl_certificate_check

  if @timeout
    agent.open_timeout = @timeout
    agent.read_timeout = @timeout
  end
  agent.user_agent = user_agent
  agent.request_headers ||= {}
  agent.request_headers["Accept"] =
    "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
  agent.request_headers["Upgrade-Insecure-Requests"] = "1"
  if @australian_proxy
    agent.agent.set_proxy(ScraperUtils.australian_proxy)
    agent.request_headers["Accept-Language"] = "en-AU,en-US;q=0.9,en;q=0.8"
    verify_proxy_works(agent)
  end

  agent.pre_connect_hooks << method(:pre_connect_hook)
  agent.post_connect_hooks << method(:post_connect_hook)
  agent.error_hooks << method(:error_hook) if agent.respond_to?(:error_hooks)
end