Class: ScraperUtils::MechanizeUtils::AgentConfig

Inherits:
Object
  • Object
show all
Defined in:
lib/scraper_utils/mechanize_utils/agent_config.rb

Overview

Configuration for a Mechanize agent with sensible defaults and configurable settings. Supports global configuration through AgentConfig.configure and per-instance overrides.

Examples:

Setting global defaults

ScraperUtils::MechanizeUtils::AgentConfig.configure do |config|
  config.default_timeout = 500
end

Creating an instance with defaults

config = ScraperUtils::MechanizeUtils::AgentConfig.new

Overriding specific settings

config = ScraperUtils::MechanizeUtils::AgentConfig.new(
  timeout: 120,
)

Constant Summary collapse

DEFAULT_TIMEOUT =
60
DEFAULT_CRAWL_DELAY =
0.5
DEFAULT_MAX_LOAD =
50.0

Class Attribute Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(timeout: nil, compliant_mode: nil, max_load: nil, crawl_delay: nil, disable_ssl_certificate_check: nil, australian_proxy: nil, user_agent: nil) ⇒ AgentConfig

Creates Mechanize agent configuration with sensible defaults overridable via configure

Parameters:

  • timeout (Integer, nil) (defaults to: nil)

    Timeout for agent connections (default: 60)

  • disable_ssl_certificate_check (Boolean, nil) (defaults to: nil)

    Skip SSL verification (default: false)

  • australian_proxy (Boolean, nil) (defaults to: nil)

    Use proxy if available (default: false)

  • user_agent (String, nil) (defaults to: nil)

    Configure Mechanize user agent



87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 87

def initialize(timeout: nil,
               compliant_mode: nil,
               max_load: nil,
               crawl_delay: nil,
               disable_ssl_certificate_check: nil,
               australian_proxy: nil,
               user_agent: nil)
  @timeout = timeout.nil? ? self.class.default_timeout : timeout
  @user_agent = user_agent.nil? ? self.class.default_user_agent : user_agent

  @disable_ssl_certificate_check = if disable_ssl_certificate_check.nil?
                                     self.class.default_disable_ssl_certificate_check
                                   else
                                     disable_ssl_certificate_check
                                   end
  @australian_proxy = if australian_proxy.nil?
                        self.class.default_australian_proxy
                      else
                        australian_proxy
                      end
  @crawl_delay = crawl_delay.nil? ? self.class.default_crawl_delay : crawl_delay.to_f
  # Clamp between 10 (delay 9 x response) and 100 (no delay)
  @max_load = (max_load.nil? ? self.class.default_max_load : max_load).to_f.clamp(10.0, 100.0)

  # Validate proxy URL format if proxy will be used
  @australian_proxy &&= !ScraperUtils.australian_proxy.to_s.empty?
  if @australian_proxy
    uri = begin
            URI.parse(ScraperUtils.australian_proxy.to_s)
          rescue URI::InvalidURIError => e
            raise URI::InvalidURIError, "Invalid proxy URL format: #{e}"
          end
    unless uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS)
      raise URI::InvalidURIError, "Proxy URL must start with http:// or https://"
    end
    unless !uri.host.to_s.empty? && uri.port&.positive?
      raise URI::InvalidURIError, "Proxy URL must include host and port"
    end
  end

  today = Date.today.strftime("%Y-%m-%d")
  @user_agent = ENV.fetch("MORPH_USER_AGENT", nil)&.sub("TODAY", today)
  version = ScraperUtils::VERSION
  @user_agent ||= "Mozilla/5.0 (compatible; ScraperUtils/#{version} #{today}; +https://github.com/ianheggie-oaf/scraper_utils)"

  display_options
end

Class Attribute Details

.default_australian_proxyBoolean

Returns Default flag for Australian proxy preference.

Returns:

  • (Boolean)

    Default flag for Australian proxy preference



37
38
39
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 37

def default_australian_proxy
  @default_australian_proxy
end

.default_crawl_delayFloat?

Returns Default Crawl delay between requests in seconds.

Returns:

  • (Float, nil)

    Default Crawl delay between requests in seconds



43
44
45
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 43

def default_crawl_delay
  @default_crawl_delay
end

.default_disable_ssl_certificate_checkBoolean

Returns Default setting for SSL certificate verification.

Returns:

  • (Boolean)

    Default setting for SSL certificate verification



34
35
36
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 34

def default_disable_ssl_certificate_check
  @default_disable_ssl_certificate_check
end

.default_max_loadFloat?

50 will result in a pause the same length as the response (ie 50% of total time will be the response, 50% pausing)

Returns:

  • (Float, nil)

    Default Max load presented to an external server as a percentage



47
48
49
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 47

def default_max_load
  @default_max_load
end

.default_timeoutInteger

Returns Default timeout in seconds for agent connections.

Returns:

  • (Integer)

    Default timeout in seconds for agent connections



31
32
33
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 31

def default_timeout
  @default_timeout
end

.default_user_agentString?

Returns Default Mechanize user agent.

Returns:

  • (String, nil)

    Default Mechanize user agent



40
41
42
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 40

def default_user_agent
  @default_user_agent
end

Instance Attribute Details

#crawl_delayObject (readonly)

Give access for testing



80
81
82
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 80

def crawl_delay
  @crawl_delay
end

#max_loadObject (readonly)

Give access for testing



80
81
82
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 80

def max_load
  @max_load
end

#user_agentString (readonly)

Returns User agent string.

Returns:

  • (String)

    User agent string



76
77
78
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 76

def user_agent
  @user_agent
end

Class Method Details

.configure {|self| ... } ⇒ void

This method returns an undefined value.

Configure default settings for all AgentConfig instances

Examples:

AgentConfig.configure do |config|
  config.default_timeout = 300
end

Yields:

  • (self)

    Yields self for configuration



56
57
58
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 56

def configure
  yield self if block_given?
end

.reset_defaults!void

This method returns an undefined value.

Reset all configuration options to their default values



62
63
64
65
66
67
68
69
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 62

def reset_defaults!
  @default_timeout = ENV.fetch('MORPH_CLIENT_TIMEOUT', DEFAULT_TIMEOUT).to_i # 60
  @default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false
  @default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false
  @default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent
  @default_crawl_delay = ENV.fetch('MORPH_CLIENT_CRAWL_DELAY', DEFAULT_CRAWL_DELAY)
  @default_max_load = ENV.fetch('MORPH_MAX_LOAD', DEFAULT_MAX_LOAD)
end

Instance Method Details

#configure_agent(agent) ⇒ void

This method returns an undefined value.

Configures a Mechanize agent with these settings

Parameters:

  • agent (Mechanize)

    The agent to configure



138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 138

def configure_agent(agent)
  agent.verify_mode = OpenSSL::SSL::VERIFY_NONE if @disable_ssl_certificate_check

  if @timeout
    agent.open_timeout = @timeout
    agent.read_timeout = @timeout
  end
  agent.user_agent = user_agent
  agent.request_headers ||= {}
  agent.request_headers["Accept"] =
    "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
  agent.request_headers["Upgrade-Insecure-Requests"] = "1"
  if @australian_proxy
    agent.agent.set_proxy(ScraperUtils.australian_proxy)
    agent.request_headers["Accept-Language"] = "en-AU,en-US;q=0.9,en;q=0.8"
    verify_proxy_works(agent)
  end

  agent.pre_connect_hooks << method(:pre_connect_hook)
  agent.post_connect_hooks << method(:post_connect_hook)
end