Class: ScraperUtils::MechanizeUtils::AgentConfig
- Inherits:
-
Object
- Object
- ScraperUtils::MechanizeUtils::AgentConfig
- Defined in:
- lib/scraper_utils/mechanize_utils/agent_config.rb
Overview
Configuration for a Mechanize agent with sensible defaults and configurable settings. Supports global configuration through AgentConfig.configure and per-instance overrides.
Constant Summary collapse
- DEFAULT_TIMEOUT =
60- DEFAULT_CRAWL_DELAY =
0.5
- DEFAULT_MAX_LOAD =
50.0
Class Attribute Summary collapse
-
.default_australian_proxy ⇒ Boolean
Default flag for Australian proxy preference.
-
.default_crawl_delay ⇒ Float?
Default Crawl delay between requests in seconds.
-
.default_disable_ssl_certificate_check ⇒ Boolean
Default setting for SSL certificate verification.
-
.default_max_load ⇒ Float?
50 will result in a pause the same length as the response (ie 50% of total time will be the response, 50% pausing).
-
.default_timeout ⇒ Integer
Default timeout in seconds for agent connections.
-
.default_user_agent ⇒ String?
Default Mechanize user agent.
Instance Attribute Summary collapse
-
#crawl_delay ⇒ Object
readonly
Give access for testing.
-
#max_load ⇒ Object
readonly
Give access for testing.
-
#user_agent ⇒ String
readonly
User agent string.
Class Method Summary collapse
-
.configure {|self| ... } ⇒ void
Configure default settings for all AgentConfig instances.
-
.reset_defaults! ⇒ void
Reset all configuration options to their default values.
Instance Method Summary collapse
-
#configure_agent(agent) ⇒ void
Configures a Mechanize agent with these settings.
-
#initialize(timeout: nil, compliant_mode: nil, max_load: nil, crawl_delay: nil, disable_ssl_certificate_check: nil, australian_proxy: nil, user_agent: nil) ⇒ AgentConfig
constructor
Creates Mechanize agent configuration with sensible defaults overridable via configure.
Constructor Details
#initialize(timeout: nil, compliant_mode: nil, max_load: nil, crawl_delay: nil, disable_ssl_certificate_check: nil, australian_proxy: nil, user_agent: nil) ⇒ AgentConfig
Creates Mechanize agent configuration with sensible defaults overridable via configure
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 87 def initialize(timeout: nil, compliant_mode: nil, max_load: nil, crawl_delay: nil, disable_ssl_certificate_check: nil, australian_proxy: nil, user_agent: nil) @timeout = timeout.nil? ? self.class.default_timeout : timeout @user_agent = user_agent.nil? ? self.class.default_user_agent : user_agent @disable_ssl_certificate_check = if disable_ssl_certificate_check.nil? self.class.default_disable_ssl_certificate_check else disable_ssl_certificate_check end @australian_proxy = if australian_proxy.nil? self.class.default_australian_proxy else australian_proxy end @crawl_delay = crawl_delay.nil? ? self.class.default_crawl_delay : crawl_delay.to_f # Clamp between 10 (delay 9 x response) and 100 (no delay) @max_load = (max_load.nil? ? self.class.default_max_load : max_load).to_f.clamp(10.0, 100.0) # Validate proxy URL format if proxy will be used @australian_proxy &&= !ScraperUtils.australian_proxy.to_s.empty? if @australian_proxy uri = begin URI.parse(ScraperUtils.australian_proxy.to_s) rescue URI::InvalidURIError => e raise URI::InvalidURIError, "Invalid proxy URL format: #{e}" end unless uri.is_a?(URI::HTTP) || uri.is_a?(URI::HTTPS) raise URI::InvalidURIError, "Proxy URL must start with http:// or https://" end unless !uri.host.to_s.empty? && uri.port&.positive? raise URI::InvalidURIError, "Proxy URL must include host and port" end end today = Date.today.strftime("%Y-%m-%d") @user_agent = ENV.fetch("MORPH_USER_AGENT", nil)&.sub("TODAY", today) version = ScraperUtils::VERSION @user_agent ||= "Mozilla/5.0 (compatible; ScraperUtils/#{version} #{today}; +https://github.com/ianheggie-oaf/scraper_utils)" end |
Class Attribute Details
.default_australian_proxy ⇒ Boolean
Returns Default flag for Australian proxy preference.
37 38 39 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 37 def default_australian_proxy @default_australian_proxy end |
.default_crawl_delay ⇒ Float?
Returns Default Crawl delay between requests in seconds.
43 44 45 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 43 def default_crawl_delay @default_crawl_delay end |
.default_disable_ssl_certificate_check ⇒ Boolean
Returns Default setting for SSL certificate verification.
34 35 36 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 34 def default_disable_ssl_certificate_check @default_disable_ssl_certificate_check end |
.default_max_load ⇒ Float?
50 will result in a pause the same length as the response (ie 50% of total time will be the response, 50% pausing)
47 48 49 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 47 def default_max_load @default_max_load end |
.default_timeout ⇒ Integer
Returns Default timeout in seconds for agent connections.
31 32 33 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 31 def default_timeout @default_timeout end |
.default_user_agent ⇒ String?
Returns Default Mechanize user agent.
40 41 42 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 40 def default_user_agent @default_user_agent end |
Instance Attribute Details
#crawl_delay ⇒ Object (readonly)
Give access for testing
80 81 82 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 80 def crawl_delay @crawl_delay end |
#max_load ⇒ Object (readonly)
Give access for testing
80 81 82 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 80 def max_load @max_load end |
#user_agent ⇒ String (readonly)
Returns User agent string.
76 77 78 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 76 def user_agent @user_agent end |
Class Method Details
.configure {|self| ... } ⇒ void
This method returns an undefined value.
Configure default settings for all AgentConfig instances
56 57 58 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 56 def configure yield self if block_given? end |
.reset_defaults! ⇒ void
This method returns an undefined value.
Reset all configuration options to their default values
62 63 64 65 66 67 68 69 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 62 def reset_defaults! @default_timeout = ENV.fetch('MORPH_CLIENT_TIMEOUT', DEFAULT_TIMEOUT).to_i # 60 @default_disable_ssl_certificate_check = !ENV.fetch('MORPH_DISABLE_SSL_CHECK', nil).to_s.empty? # false @default_australian_proxy = !ENV.fetch('MORPH_USE_PROXY', nil).to_s.empty? # false @default_user_agent = ENV.fetch('MORPH_USER_AGENT', nil) # Uses Mechanize user agent @default_crawl_delay = ENV.fetch('MORPH_CLIENT_CRAWL_DELAY', DEFAULT_CRAWL_DELAY) @default_max_load = ENV.fetch('MORPH_MAX_LOAD', DEFAULT_MAX_LOAD) end |
Instance Method Details
#configure_agent(agent) ⇒ void
This method returns an undefined value.
Configures a Mechanize agent with these settings
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
# File 'lib/scraper_utils/mechanize_utils/agent_config.rb', line 138 def configure_agent(agent) agent.verify_mode = OpenSSL::SSL::VERIFY_NONE if @disable_ssl_certificate_check if @timeout agent.open_timeout = @timeout agent.read_timeout = @timeout end agent.user_agent = user_agent agent.request_headers ||= {} agent.request_headers["Accept"] = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" agent.request_headers["Upgrade-Insecure-Requests"] = "1" if @australian_proxy agent.agent.set_proxy(ScraperUtils.australian_proxy) agent.request_headers["Accept-Language"] = "en-AU,en-US;q=0.9,en;q=0.8" verify_proxy_works(agent) end agent.pre_connect_hooks << method(:pre_connect_hook) agent.post_connect_hooks << method(:post_connect_hook) end |