Class: Spidr::Agent
- Inherits:
-
Object
- Object
- Spidr::Agent
- Includes:
- Settings::UserAgent
- Defined in:
- lib/spidr/agent.rb,
lib/spidr/agent/events.rb,
lib/spidr/agent/robots.rb,
lib/spidr/agent/actions.rb,
lib/spidr/agent/filters.rb,
lib/spidr/agent/sanitizers.rb
Defined Under Namespace
Modules: Actions
Instance Attribute Summary collapse
-
#authorized ⇒ AuthStore
HTTP Authentication credentials.
-
#cookies ⇒ CookieJar
readonly
Cached cookies.
-
#default_headers ⇒ Hash{String => String}
readonly
HTTP Headers to use for every request.
-
#delay ⇒ Integer
Delay in between fetching pages.
-
#failures ⇒ Set<URI::HTTP>
List of unreachable URLs.
-
#history ⇒ Set<URI::HTTP>
(also: #visited_urls)
History containing visited URLs.
-
#host_header ⇒ String
HTTP Host
Header
to use. -
#host_headers ⇒ Hash{String,Regexp => String}
readonly
HTTP
Host
Headers to use for specific hosts. -
#levels ⇒ Hash{URI::HTTP => Integer}
readonly
The visited URLs and their depth within a site.
-
#limit ⇒ Integer
readonly
Maximum number of pages to visit.
-
#max_depth ⇒ Integer
readonly
Maximum depth.
-
#queue ⇒ Array<URI::HTTP>
(also: #pending_urls)
Queue of URLs to visit.
-
#referer ⇒ String
Referer to use.
-
#schemes ⇒ Object
List of acceptable URL schemes to follow.
-
#sessions ⇒ SessionCache
readonly
The session cache.
-
#strip_fragments ⇒ Object
Specifies whether the Agent will strip URI fragments.
-
#strip_query ⇒ Object
Specifies whether the Agent will strip URI queries.
Attributes included from Settings::UserAgent
Class Method Summary collapse
-
.default_schemes ⇒ Array<String>
protected
Determines the default URI schemes to follow.
-
.domain(name, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and spiders the entire domain.
-
.host(name, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and spiders the given host.
-
.site(url, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and spiders the web-site located at the given URL.
-
.start_at(url, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and begin spidering at the given URL.
Instance Method Summary collapse
-
#all_headers {|headers| ... } ⇒ Object
Pass the headers from every response the agent receives to a given block.
-
#clear ⇒ Object
Clears the history of the agent.
-
#continue! {|page| ... } ⇒ Object
Continue spidering.
-
#dequeue ⇒ URI::HTTP
protected
Dequeues a URL that will later be visited.
-
#enqueue(url, level = 0) ⇒ Boolean
Enqueues a given URL for visiting, only if it passes all of the agent's rules for visiting a given URL.
-
#every_atom_doc {|doc| ... } ⇒ Object
Pass every Atom document that the agent parses to a given block.
-
#every_atom_page {|feed| ... } ⇒ Object
Pass every Atom feed that the agent visits to a given block.
-
#every_bad_request_page {|page| ... } ⇒ Object
Pass every Bad Request page that the agent visits to a given block.
-
#every_css_page {|page| ... } ⇒ Object
Pass every CSS page that the agent visits to a given block.
-
#every_doc {|doc| ... } ⇒ Object
Pass every HTML or XML document that the agent parses to a given block.
-
#every_failed_url {|url| ... } ⇒ Object
Pass each URL that could not be requested to the given block.
-
#every_forbidden_page {|page| ... } ⇒ Object
Pass every Forbidden page that the agent visits to a given block.
-
#every_html_doc {|doc| ... } ⇒ Object
Pass every HTML document that the agent parses to a given block.
-
#every_html_page {|page| ... } ⇒ Object
Pass every HTML page that the agent visits to a given block.
-
#every_internal_server_error_page {|page| ... } ⇒ Object
Pass every Internal Server Error page that the agent visits to a given block.
-
#every_javascript_page {|page| ... } ⇒ Object
Pass every JavaScript page that the agent visits to a given block.
-
#every_link {|origin, dest| ... } ⇒ Object
Passes every origin and destination URI of each link to a given block.
-
#every_missing_page {|page| ... } ⇒ Object
Pass every Missing page that the agent visits to a given block.
-
#every_ms_word_page {|page| ... } ⇒ Object
Pass every MS Word page that the agent visits to a given block.
-
#every_ok_page {|page| ... } ⇒ Object
Pass every OK page that the agent visits to a given block.
-
#every_page {|page| ... } ⇒ Object
Pass every page that the agent visits to a given block.
-
#every_pdf_page {|page| ... } ⇒ Object
Pass every PDF page that the agent visits to a given block.
-
#every_redirect_page {|page| ... } ⇒ Object
Pass every Redirect page that the agent visits to a given block.
-
#every_rss_doc {|doc| ... } ⇒ Object
Pass every RSS document that the agent parses to a given block.
-
#every_rss_page {|feed| ... } ⇒ Object
Pass every RSS feed that the agent visits to a given block.
-
#every_timedout_page {|page| ... } ⇒ Object
Pass every Timeout page that the agent visits to a given block.
-
#every_txt_page {|page| ... } ⇒ Object
Pass every Plain Text page that the agent visits to a given block.
-
#every_unauthorized_page {|page| ... } ⇒ Object
Pass every Unauthorized page that the agent visits to a given block.
-
#every_url {|url| ... } ⇒ Object
Pass each URL from each page visited to the given block.
-
#every_url_like(pattern) {|url| ... } ⇒ Object
Pass every URL that the agent visits, and matches a given pattern, to a given block.
-
#every_xml_doc {|doc| ... } ⇒ Object
Pass every XML document that the agent parses to a given block.
-
#every_xml_page {|page| ... } ⇒ Object
Pass every XML page that the agent visits to a given block.
-
#every_xsl_doc {|doc| ... } ⇒ Object
Pass every XML Stylesheet (XSL) that the agent parses to a given block.
-
#every_xsl_page {|page| ... } ⇒ Object
Pass every XML Stylesheet (XSL) page that the agent visits to a given block.
-
#every_zip_page {|page| ... } ⇒ Object
Pass every ZIP page that the agent visits to a given block.
-
#failed(url) ⇒ Object
protected
Adds a given URL to the failures list.
-
#failed?(url) ⇒ Boolean
Determines whether a given URL could not be visited.
-
#get_page(url) {|page| ... } ⇒ Page?
Requests and creates a new Page object from a given URL.
-
#ignore_exts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match URI path extensions to not visit.
-
#ignore_exts_like(pattern = nil) {|ext| ... } ⇒ Object
Adds a given pattern to the #ignore_exts.
-
#ignore_hosts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match host-names to not visit.
-
#ignore_hosts_like(pattern = nil) {|host| ... } ⇒ Object
Adds a given pattern to the #ignore_hosts.
-
#ignore_links ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match links to not visit.
-
#ignore_links_like(pattern = nil) {|link| ... } ⇒ Object
Adds a given pattern to the #ignore_links.
-
#ignore_ports ⇒ Array<Integer, Regexp, Proc>
Specifies the patterns that match ports to not visit.
-
#ignore_ports_like(pattern = nil) {|port| ... } ⇒ Object
Adds a given pattern to the #ignore_ports.
-
#ignore_urls ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match URLs to not visit.
-
#ignore_urls_like(pattern = nil) {|url| ... } ⇒ Object
Adds a given pattern to the #ignore_urls.
-
#initialize(host_header: nil, host_headers: {}, default_headers: {}, user_agent: Spidr.user_agent, referer: nil, proxy: Spidr.proxy, open_timeout: Spidr.open_timeout, ssl_timeout: Spidr.ssl_timeout, read_timeout: Spidr.read_timeout, continue_timeout: Spidr.continue_timeout, keep_alive_timeout: Spidr.keep_alive_timeout, delay: 0, limit: nil, max_depth: nil, queue: nil, history: nil, strip_fragments: true, strip_query: false, schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil, robots: Spidr.robots?) {|agent| ... } ⇒ Agent
constructor
Creates a new Agent object.
- #initialize_actions ⇒ Object protected
- #initialize_events ⇒ Object protected
-
#initialize_filters(schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil) ⇒ Object
protected
Initializes filtering rules.
-
#initialize_robots ⇒ Object
Initializes the robots filter.
-
#initialize_sanitizers(strip_fragments: true, strip_query: false) ⇒ Object
protected
Initializes the Sanitizer rules.
-
#limit_reached? ⇒ Boolean
protected
Determines if the maximum limit has been reached.
-
#pause! ⇒ Object
Pauses the agent, causing spidering to temporarily stop.
-
#pause=(state) ⇒ Object
Sets the pause state of the agent.
-
#paused? ⇒ Boolean
Determines whether the agent is paused.
-
#post_page(url, post_data = '') {|page| ... } ⇒ Page?
Posts supplied form data and creates a new Page object from a given URL.
-
#prepare_request(url) {|request| ... } ⇒ Object
protected
Normalizes the request path and grabs a session to handle page get and post requests.
-
#prepare_request_headers(url) ⇒ Hash{String => String}
protected
Prepares request headers for the given URL.
-
#proxy ⇒ Proxy
The proxy information the agent uses.
-
#proxy=(new_proxy) ⇒ Proxy
Sets the proxy information that the agent uses.
-
#queued?(url) ⇒ Boolean
Determines whether a given URL has been enqueued.
-
#robot_allowed?(url) ⇒ Boolean
Determines whether a URL is allowed by the robot policy.
-
#run {|page| ... } ⇒ Object
Start spidering until the queue becomes empty or the agent is paused.
-
#running? ⇒ Boolean
Determines if the agent is running.
-
#sanitize_url(url) ⇒ URI::HTTP, URI::HTTPS
Sanitizes a URL based on filtering options.
-
#skip_link! ⇒ Object
Causes the agent to skip the link being enqueued.
-
#skip_page! ⇒ Object
Causes the agent to skip the page being visited.
-
#start_at(url) {|page| ... } ⇒ Object
Start spidering at a given URL.
-
#to_hash ⇒ Hash
Converts the agent into a Hash.
- #urls_like(pattern, &block) ⇒ Object
-
#visit?(url) ⇒ Boolean
protected
Determines if a given URL should be visited.
-
#visit_ext?(path) ⇒ Boolean
protected
Determines if a given URI path extension should be visited.
-
#visit_exts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the URI path extensions to visit.
-
#visit_exts_like(pattern = nil) {|ext| ... } ⇒ Object
Adds a given pattern to the #visit_exts.
-
#visit_host?(host) ⇒ Boolean
protected
Determines if a given host-name should be visited.
-
#visit_hosts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match host-names to visit.
-
#visit_hosts_like(pattern = nil) {|host| ... } ⇒ Object
Adds a given pattern to the #visit_hosts.
-
#visit_link?(link) ⇒ Boolean
protected
Determines if a given link should be visited.
-
#visit_links ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the links to visit.
-
#visit_links_like(pattern = nil) {|link| ... } ⇒ Object
Adds a given pattern to the #visit_links.
-
#visit_page(url) {|page| ... } ⇒ Page?
Visits a given URL, and enqueues the links recovered from the URL to be visited later.
-
#visit_port?(port) ⇒ Boolean
protected
Determines if a given port should be visited.
-
#visit_ports ⇒ Array<Integer, Regexp, Proc>
Specifies the patterns that match the ports to visit.
-
#visit_ports_like(pattern = nil) {|port| ... } ⇒ Object
Adds a given pattern to the #visit_ports.
-
#visit_scheme?(scheme) ⇒ Boolean
protected
Determines if a given URI scheme should be visited.
-
#visit_url?(link) ⇒ Boolean
protected
Determines if a given URL should be visited.
-
#visit_urls ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the URLs to visit.
-
#visit_urls_like(pattern = nil) {|url| ... } ⇒ Object
Adds a given pattern to the #visit_urls.
-
#visited?(url) ⇒ Boolean
Determines whether a URL was visited or not.
-
#visited_hosts ⇒ Array<String>
Specifies all hosts that were visited.
-
#visited_links ⇒ Array<String>
Specifies the links which have been visited.
Constructor Details
#initialize(host_header: nil, host_headers: {}, default_headers: {}, user_agent: Spidr.user_agent, referer: nil, proxy: Spidr.proxy, open_timeout: Spidr.open_timeout, ssl_timeout: Spidr.ssl_timeout, read_timeout: Spidr.read_timeout, continue_timeout: Spidr.continue_timeout, keep_alive_timeout: Spidr.keep_alive_timeout, delay: 0, limit: nil, max_depth: nil, queue: nil, history: nil, strip_fragments: true, strip_query: false, schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil, robots: Spidr.robots?) {|agent| ... } ⇒ Agent
Creates a new Agent object.
214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 |
# File 'lib/spidr/agent.rb', line 214 def initialize(# header keyword arguments host_header: nil, host_headers: {}, default_headers: {}, user_agent: Spidr.user_agent, referer: nil, # session cache keyword arguments proxy: Spidr.proxy, open_timeout: Spidr.open_timeout, ssl_timeout: Spidr.ssl_timeout, read_timeout: Spidr.read_timeout, continue_timeout: Spidr.continue_timeout, keep_alive_timeout: Spidr.keep_alive_timeout, # spidering controls keyword arguments delay: 0, limit: nil, max_depth: nil, # history keyword arguments queue: nil, history: nil, # sanitizer keyword arguments strip_fragments: true, strip_query: false, # filtering keyword arguments schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil, # robots keyword arguments robots: Spidr.robots?) @host_header = host_header @host_headers = host_headers @default_headers = default_headers @user_agent = user_agent @referer = referer @sessions = SessionCache.new( proxy: proxy, open_timeout: open_timeout, ssl_timeout: ssl_timeout, read_timeout: read_timeout, continue_timeout: continue_timeout, keep_alive_timeout: keep_alive_timeout ) @cookies = CookieJar.new @authorized = AuthStore.new @running = false @delay = delay @history = Set[] @failures = Set[] @queue = [] @limit = limit @levels = Hash.new(0) @max_depth = max_depth self.queue = queue if queue self.history = history if history initialize_sanitizers( strip_fragments: strip_fragments, strip_query: strip_query ) initialize_filters( schemes: schemes, host: host, hosts: hosts, ignore_hosts: ignore_hosts, ports: ports, ignore_ports: ignore_ports, links: links, ignore_links: ignore_links, urls: urls, ignore_urls: ignore_urls, exts: exts, ignore_exts: ignore_exts ) initialize_actions initialize_events initialize_robots if robots yield self if block_given? end |
Instance Attribute Details
#authorized ⇒ AuthStore
HTTP Authentication credentials
44 45 46 |
# File 'lib/spidr/agent.rb', line 44 def @authorized end |
#cookies ⇒ CookieJar (readonly)
Cached cookies
81 82 83 |
# File 'lib/spidr/agent.rb', line 81 def @cookies end |
#default_headers ⇒ Hash{String => String} (readonly)
HTTP Headers to use for every request
39 40 41 |
# File 'lib/spidr/agent.rb', line 39 def default_headers @default_headers end |
#delay ⇒ Integer
Delay in between fetching pages
54 55 56 |
# File 'lib/spidr/agent.rb', line 54 def delay @delay end |
#failures ⇒ Set<URI::HTTP>
List of unreachable URLs
64 65 66 |
# File 'lib/spidr/agent.rb', line 64 def failures @failures end |
#history ⇒ Set<URI::HTTP> Also known as: visited_urls
History containing visited URLs
59 60 61 |
# File 'lib/spidr/agent.rb', line 59 def history @history end |
#host_header ⇒ String
HTTP Host Header
to use
27 28 29 |
# File 'lib/spidr/agent.rb', line 27 def host_header @host_header end |
#host_headers ⇒ Hash{String,Regexp => String} (readonly)
HTTP Host
Headers to use for specific hosts
32 33 34 |
# File 'lib/spidr/agent.rb', line 32 def host_headers @host_headers end |
#levels ⇒ Hash{URI::HTTP => Integer} (readonly)
The visited URLs and their depth within a site
96 97 98 |
# File 'lib/spidr/agent.rb', line 96 def levels @levels end |
#limit ⇒ Integer (readonly)
Maximum number of pages to visit.
86 87 88 |
# File 'lib/spidr/agent.rb', line 86 def limit @limit end |
#max_depth ⇒ Integer (readonly)
Maximum depth
91 92 93 |
# File 'lib/spidr/agent.rb', line 91 def max_depth @max_depth end |
#queue ⇒ Array<URI::HTTP> Also known as: pending_urls
Queue of URLs to visit
69 70 71 |
# File 'lib/spidr/agent.rb', line 69 def queue @queue end |
#referer ⇒ String
Referer to use
49 50 51 |
# File 'lib/spidr/agent.rb', line 49 def referer @referer end |
#schemes ⇒ Object
List of acceptable URL schemes to follow
9 10 11 |
# File 'lib/spidr/agent/filters.rb', line 9 def schemes @schemes end |
#sessions ⇒ SessionCache (readonly)
The session cache
76 77 78 |
# File 'lib/spidr/agent.rb', line 76 def sessions @sessions end |
#strip_fragments ⇒ Object
Specifies whether the Agent will strip URI fragments
9 10 11 |
# File 'lib/spidr/agent/sanitizers.rb', line 9 def strip_fragments @strip_fragments end |
#strip_query ⇒ Object
Specifies whether the Agent will strip URI queries
12 13 14 |
# File 'lib/spidr/agent/sanitizers.rb', line 12 def strip_query @strip_query end |
Class Method Details
.default_schemes ⇒ Array<String> (protected)
Determines the default URI schemes to follow.
429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 |
# File 'lib/spidr/agent/filters.rb', line 429 def self.default_schemes schemes = ['http'] begin require 'net/https' schemes << 'https' rescue Gem::LoadError => e raise(e) rescue ::LoadError warn "Warning: cannot load 'net/https', https support disabled" end return schemes end |
.domain(name, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and spiders the entire domain.
418 419 420 421 422 |
# File 'lib/spidr/agent.rb', line 418 def self.domain(name,**kwargs,&block) agent = new(host: /(^|\.)#{Regexp.escape(name)}$/, **kwargs, &block) agent.start_at(URI::HTTP.build(host: name, path: '/')) return agent end |
.host(name, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and spiders the given host.
389 390 391 392 393 |
# File 'lib/spidr/agent.rb', line 389 def self.host(name,**kwargs,&block) agent = new(host: name, **kwargs, &block) agent.start_at(URI::HTTP.build(host: name, path: '/')) return agent end |
.site(url, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and spiders the web-site located at the given URL.
360 361 362 363 364 365 366 |
# File 'lib/spidr/agent.rb', line 360 def self.site(url,**kwargs,&block) url = URI(url) agent = new(host: url.host, **kwargs, &block) agent.start_at(url) return agent end |
.start_at(url, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and begin spidering at the given URL.
333 334 335 336 337 |
# File 'lib/spidr/agent.rb', line 333 def self.start_at(url,**kwargs,&block) agent = new(**kwargs,&block) agent.start_at(url) return agent end |
Instance Method Details
#all_headers {|headers| ... } ⇒ Object
Pass the headers from every response the agent receives to a given block.
70 71 72 |
# File 'lib/spidr/agent/events.rb', line 70 def all_headers every_page { |page| yield page.headers } end |
#clear ⇒ Object
Clears the history of the agent.
458 459 460 461 462 463 |
# File 'lib/spidr/agent.rb', line 458 def clear @queue.clear @history.clear @failures.clear return self end |
#continue! {|page| ... } ⇒ Object
Continue spidering.
42 43 44 45 |
# File 'lib/spidr/agent/actions.rb', line 42 def continue!(&block) @paused = false return run(&block) end |
#dequeue ⇒ URI::HTTP (protected)
Dequeues a URL that will later be visited.
922 923 924 |
# File 'lib/spidr/agent.rb', line 922 def dequeue @queue.shift end |
#enqueue(url, level = 0) ⇒ Boolean
Enqueues a given URL for visiting, only if it passes all of the agent's rules for visiting a given URL.
658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 |
# File 'lib/spidr/agent.rb', line 658 def enqueue(url,level=0) url = sanitize_url(url) if (!queued?(url) && visit?(url)) link = url.to_s begin @every_url_blocks.each { |url_block| url_block.call(url) } @every_url_like_blocks.each do |pattern,url_blocks| match = case pattern when Regexp link =~ pattern else (pattern == link) || (pattern == url) end if match url_blocks.each { |url_block| url_block.call(url) } end end rescue Actions::Paused => action raise(action) rescue Actions::SkipLink return false rescue Actions::Action end @queue << url @levels[url] = level return true end return false end |
#every_atom_doc {|doc| ... } ⇒ Object
Pass every Atom document that the agent parses to a given block.
389 390 391 392 393 394 395 396 397 |
# File 'lib/spidr/agent/events.rb', line 389 def every_atom_doc every_page do |page| if (block_given? && page.atom?) if (doc = page.doc) yield doc end end end end |
#every_atom_page {|feed| ... } ⇒ Object
Pass every Atom feed that the agent visits to a given block.
453 454 455 456 457 |
# File 'lib/spidr/agent/events.rb', line 453 def every_atom_page every_page do |page| yield page if (block_given? && page.atom?) end end |
#every_bad_request_page {|page| ... } ⇒ Object
Pass every Bad Request page that the agent visits to a given block.
142 143 144 145 146 |
# File 'lib/spidr/agent/events.rb', line 142 def every_bad_request_page every_page do |page| yield page if (block_given? && page.bad_request?) end end |
#every_css_page {|page| ... } ⇒ Object
Pass every CSS page that the agent visits to a given block.
423 424 425 426 427 |
# File 'lib/spidr/agent/events.rb', line 423 def every_css_page every_page do |page| yield page if (block_given? && page.css?) end end |
#every_doc {|doc| ... } ⇒ Object
Pass every HTML or XML document that the agent parses to a given block.
283 284 285 286 287 288 289 290 291 |
# File 'lib/spidr/agent/events.rb', line 283 def every_doc every_page do |page| if block_given? if (doc = page.doc) yield doc end end end end |
#every_failed_url {|url| ... } ⇒ Object
Pass each URL that could not be requested to the given block.
28 29 30 31 |
# File 'lib/spidr/agent/events.rb', line 28 def every_failed_url(&block) @every_failed_url_blocks << block return self end |
#every_forbidden_page {|page| ... } ⇒ Object
Pass every Forbidden page that the agent visits to a given block.
172 173 174 175 176 |
# File 'lib/spidr/agent/events.rb', line 172 def every_forbidden_page every_page do |page| yield page if (block_given? && page.forbidden?) end end |
#every_html_doc {|doc| ... } ⇒ Object
Pass every HTML document that the agent parses to a given block.
304 305 306 307 308 309 310 311 312 |
# File 'lib/spidr/agent/events.rb', line 304 def every_html_doc every_page do |page| if (block_given? && page.html?) if (doc = page.doc) yield doc end end end end |
#every_html_page {|page| ... } ⇒ Object
Pass every HTML page that the agent visits to a given block.
233 234 235 236 237 |
# File 'lib/spidr/agent/events.rb', line 233 def every_html_page every_page do |page| yield page if (block_given? && page.html?) end end |
#every_internal_server_error_page {|page| ... } ⇒ Object
Pass every Internal Server Error page that the agent visits to a given block.
203 204 205 206 207 |
# File 'lib/spidr/agent/events.rb', line 203 def every_internal_server_error_page every_page do |page| yield page if (block_given? && page.had_internal_server_error?) end end |
#every_javascript_page {|page| ... } ⇒ Object
Pass every JavaScript page that the agent visits to a given block.
408 409 410 411 412 |
# File 'lib/spidr/agent/events.rb', line 408 def every_javascript_page every_page do |page| yield page if (block_given? && page.javascript?) end end |
#every_link {|origin, dest| ... } ⇒ Object
Passes every origin and destination URI of each link to a given block.
518 519 520 521 |
# File 'lib/spidr/agent/events.rb', line 518 def every_link(&block) @every_link_blocks << block return self end |
#every_missing_page {|page| ... } ⇒ Object
Pass every Missing page that the agent visits to a given block.
187 188 189 190 191 |
# File 'lib/spidr/agent/events.rb', line 187 def every_missing_page every_page do |page| yield page if (block_given? && page.missing?) end end |
#every_ms_word_page {|page| ... } ⇒ Object
Pass every MS Word page that the agent visits to a given block.
468 469 470 471 472 |
# File 'lib/spidr/agent/events.rb', line 468 def every_ms_word_page every_page do |page| yield page if (block_given? && page.ms_word?) end end |
#every_ok_page {|page| ... } ⇒ Object
Pass every OK page that the agent visits to a given block.
97 98 99 100 101 |
# File 'lib/spidr/agent/events.rb', line 97 def every_ok_page every_page do |page| yield page if (block_given? && page.ok?) end end |
#every_page {|page| ... } ⇒ Object
Pass every page that the agent visits to a given block.
83 84 85 86 |
# File 'lib/spidr/agent/events.rb', line 83 def every_page(&block) @every_page_blocks << block return self end |
#every_pdf_page {|page| ... } ⇒ Object
Pass every PDF page that the agent visits to a given block.
483 484 485 486 487 |
# File 'lib/spidr/agent/events.rb', line 483 def every_pdf_page every_page do |page| yield page if (block_given? && page.pdf?) end end |
#every_redirect_page {|page| ... } ⇒ Object
Pass every Redirect page that the agent visits to a given block.
112 113 114 115 116 |
# File 'lib/spidr/agent/events.rb', line 112 def every_redirect_page every_page do |page| yield page if (block_given? && page.redirect?) end end |
#every_rss_doc {|doc| ... } ⇒ Object
Pass every RSS document that the agent parses to a given block.
368 369 370 371 372 373 374 375 376 |
# File 'lib/spidr/agent/events.rb', line 368 def every_rss_doc every_page do |page| if (block_given? && page.rss?) if (doc = page.doc) yield doc end end end end |
#every_rss_page {|feed| ... } ⇒ Object
Pass every RSS feed that the agent visits to a given block.
438 439 440 441 442 |
# File 'lib/spidr/agent/events.rb', line 438 def every_rss_page every_page do |page| yield page if (block_given? && page.rss?) end end |
#every_timedout_page {|page| ... } ⇒ Object
Pass every Timeout page that the agent visits to a given block.
127 128 129 130 131 |
# File 'lib/spidr/agent/events.rb', line 127 def every_timedout_page every_page do |page| yield page if (block_given? && page.timedout?) end end |
#every_txt_page {|page| ... } ⇒ Object
Pass every Plain Text page that the agent visits to a given block.
218 219 220 221 222 |
# File 'lib/spidr/agent/events.rb', line 218 def every_txt_page every_page do |page| yield page if (block_given? && page.txt?) end end |
#every_unauthorized_page {|page| ... } ⇒ Object
Pass every Unauthorized page that the agent visits to a given block.
157 158 159 160 161 |
# File 'lib/spidr/agent/events.rb', line 157 def every_page do |page| yield page if (block_given? && page.) end end |
#every_url {|url| ... } ⇒ Object
Pass each URL from each page visited to the given block.
14 15 16 17 |
# File 'lib/spidr/agent/events.rb', line 14 def every_url(&block) @every_url_blocks << block return self end |
#every_url_like(pattern) {|url| ... } ⇒ Object
Pass every URL that the agent visits, and matches a given pattern, to a given block.
48 49 50 51 |
# File 'lib/spidr/agent/events.rb', line 48 def every_url_like(pattern,&block) @every_url_like_blocks[pattern] << block return self end |
#every_xml_doc {|doc| ... } ⇒ Object
Pass every XML document that the agent parses to a given block.
325 326 327 328 329 330 331 332 333 |
# File 'lib/spidr/agent/events.rb', line 325 def every_xml_doc every_page do |page| if (block_given? && page.xml?) if (doc = page.doc) yield doc end end end end |
#every_xml_page {|page| ... } ⇒ Object
Pass every XML page that the agent visits to a given block.
248 249 250 251 252 |
# File 'lib/spidr/agent/events.rb', line 248 def every_xml_page every_page do |page| yield page if (block_given? && page.xml?) end end |
#every_xsl_doc {|doc| ... } ⇒ Object
Pass every XML Stylesheet (XSL) that the agent parses to a given block.
347 348 349 350 351 352 353 354 355 |
# File 'lib/spidr/agent/events.rb', line 347 def every_xsl_doc every_page do |page| if (block_given? && page.xsl?) if (doc = page.doc) yield doc end end end end |
#every_xsl_page {|page| ... } ⇒ Object
Pass every XML Stylesheet (XSL) page that the agent visits to a given block.
264 265 266 267 268 |
# File 'lib/spidr/agent/events.rb', line 264 def every_xsl_page every_page do |page| yield page if (block_given? && page.xsl?) end end |
#every_zip_page {|page| ... } ⇒ Object
Pass every ZIP page that the agent visits to a given block.
498 499 500 501 502 |
# File 'lib/spidr/agent/events.rb', line 498 def every_zip_page every_page do |page| yield page if (block_given? && page.zip?) end end |
#failed(url) ⇒ Object (protected)
Adds a given URL to the failures list.
963 964 965 966 967 |
# File 'lib/spidr/agent.rb', line 963 def failed(url) @failures << url @every_failed_url_blocks.each { |fail_block| fail_block.call(url) } return true end |
#failed?(url) ⇒ Boolean
Determines whether a given URL could not be visited.
607 608 609 |
# File 'lib/spidr/agent.rb', line 607 def failed?(url) @failures.include?(URI(url)) end |
#get_page(url) {|page| ... } ⇒ Page?
Requests and creates a new Page object from a given URL.
710 711 712 713 714 715 716 717 718 719 720 721 722 |
# File 'lib/spidr/agent.rb', line 710 def get_page(url) url = URI(url) prepare_request(url) do |session,path,headers| new_page = Page.new(url,session.get(path,headers)) # save any new cookies @cookies.from_page(new_page) yield new_page if block_given? return new_page end end |
#ignore_exts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match URI path extensions to not visit.
330 331 332 |
# File 'lib/spidr/agent/filters.rb', line 330 def ignore_exts @ext_rules.reject end |
#ignore_exts_like(pattern = nil) {|ext| ... } ⇒ Object
Adds a given pattern to the #ignore_exts.
346 347 348 349 350 351 352 353 354 |
# File 'lib/spidr/agent/filters.rb', line 346 def ignore_exts_like(pattern=nil,&block) if pattern ignore_exts << pattern elsif block ignore_exts << block end return self end |
#ignore_hosts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match host-names to not visit.
62 63 64 |
# File 'lib/spidr/agent/filters.rb', line 62 def ignore_hosts @host_rules.reject end |
#ignore_hosts_like(pattern = nil) {|host| ... } ⇒ Object
Adds a given pattern to the #ignore_hosts.
78 79 80 81 82 83 84 85 86 |
# File 'lib/spidr/agent/filters.rb', line 78 def ignore_hosts_like(pattern=nil,&block) if pattern ignore_hosts << pattern elsif block ignore_hosts << block end return self end |
#ignore_links ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match links to not visit.
194 195 196 |
# File 'lib/spidr/agent/filters.rb', line 194 def ignore_links @link_rules.reject end |
#ignore_links_like(pattern = nil) {|link| ... } ⇒ Object
Adds a given pattern to the #ignore_links.
210 211 212 213 214 215 216 217 218 |
# File 'lib/spidr/agent/filters.rb', line 210 def ignore_links_like(pattern=nil,&block) if pattern ignore_links << pattern elsif block ignore_links << block end return self end |
#ignore_ports ⇒ Array<Integer, Regexp, Proc>
Specifies the patterns that match ports to not visit.
126 127 128 |
# File 'lib/spidr/agent/filters.rb', line 126 def ignore_ports @port_rules.reject end |
#ignore_ports_like(pattern = nil) {|port| ... } ⇒ Object
Adds a given pattern to the #ignore_ports.
142 143 144 145 146 147 148 149 150 |
# File 'lib/spidr/agent/filters.rb', line 142 def ignore_ports_like(pattern=nil,&block) if pattern ignore_ports << pattern elsif block ignore_ports << block end return self end |
#ignore_urls ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match URLs to not visit.
264 265 266 |
# File 'lib/spidr/agent/filters.rb', line 264 def ignore_urls @url_rules.reject end |
#ignore_urls_like(pattern = nil) {|url| ... } ⇒ Object
Adds a given pattern to the #ignore_urls.
282 283 284 285 286 287 288 289 290 |
# File 'lib/spidr/agent/filters.rb', line 282 def ignore_urls_like(pattern=nil,&block) if pattern ignore_urls << pattern elsif block ignore_urls << block end return self end |
#initialize_actions ⇒ Object (protected)
101 102 103 |
# File 'lib/spidr/agent/actions.rb', line 101 def initialize_actions @paused = false end |
#initialize_events ⇒ Object (protected)
525 526 527 528 529 530 531 532 |
# File 'lib/spidr/agent/events.rb', line 525 def initialize_events @every_url_blocks = [] @every_failed_url_blocks = [] @every_url_like_blocks = Hash.new { |hash,key| hash[key] = [] } @every_page_blocks = [] @every_link_blocks = [] end |
#initialize_filters(schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil) ⇒ Object (protected)
Initializes filtering rules.
398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 |
# File 'lib/spidr/agent/filters.rb', line 398 def initialize_filters(schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil) @schemes = schemes.map(&:to_s) @host_rules = Rules.new(accept: hosts, reject: ignore_hosts) @port_rules = Rules.new(accept: ports, reject: ignore_ports) @link_rules = Rules.new(accept: links, reject: ignore_links) @url_rules = Rules.new(accept: urls, reject: ignore_urls) @ext_rules = Rules.new(accept: exts, reject: ignore_exts) visit_hosts_like(host) if host end |
#initialize_robots ⇒ Object
Initializes the robots filter.
13 14 15 16 17 18 19 |
# File 'lib/spidr/agent/robots.rb', line 13 def initialize_robots unless Object.const_defined?(:Robots) raise(ArgumentError,":robots option given but unable to require 'robots' gem") end @robots = Robots.new(@user_agent) end |
#initialize_sanitizers(strip_fragments: true, strip_query: false) ⇒ Object (protected)
Initializes the Sanitizer rules.
47 48 49 50 |
# File 'lib/spidr/agent/sanitizers.rb', line 47 def initialize_sanitizers(strip_fragments: true, strip_query: false) @strip_fragments = strip_fragments @strip_query = strip_query end |
#limit_reached? ⇒ Boolean (protected)
Determines if the maximum limit has been reached.
933 934 935 |
# File 'lib/spidr/agent.rb', line 933 def limit_reached? @limit && @history.length >= @limit end |
#pause! ⇒ Object
Pauses the agent, causing spidering to temporarily stop.
63 64 65 66 |
# File 'lib/spidr/agent/actions.rb', line 63 def pause! @paused = true raise(Actions::Paused) end |
#pause=(state) ⇒ Object
Sets the pause state of the agent.
53 54 55 |
# File 'lib/spidr/agent/actions.rb', line 53 def pause=(state) @paused = state end |
#paused? ⇒ Boolean
Determines whether the agent is paused.
74 75 76 |
# File 'lib/spidr/agent/actions.rb', line 74 def paused? @paused == true end |
#post_page(url, post_data = '') {|page| ... } ⇒ Page?
Posts supplied form data and creates a new Page object from a given URL.
745 746 747 748 749 750 751 752 753 754 755 756 757 |
# File 'lib/spidr/agent.rb', line 745 def post_page(url,post_data='') url = URI(url) prepare_request(url) do |session,path,headers| new_page = Page.new(url,session.post(path,post_data,headers)) # save any new cookies @cookies.from_page(new_page) yield new_page if block_given? return new_page end end |
#prepare_request(url) {|request| ... } ⇒ Object (protected)
Normalizes the request path and grabs a session to handle page get and post requests.
885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 |
# File 'lib/spidr/agent.rb', line 885 def prepare_request(url,&block) path = unless url.path.empty? url.path else '/' end # append the URL query to the path path += "?#{url.query}" if url.query headers = prepare_request_headers(url) begin sleep(@delay) if @delay > 0 yield @sessions[url], path, headers rescue SystemCallError, Timeout::Error, SocketError, IOError, OpenSSL::SSL::SSLError, Net::HTTPBadResponse, Zlib::Error @sessions.kill!(url) failed(url) return nil end end |
#prepare_request_headers(url) ⇒ Hash{String => String} (protected)
Prepares request headers for the given URL.
836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 |
# File 'lib/spidr/agent.rb', line 836 def prepare_request_headers(url) # set any additional HTTP headers headers = @default_headers.dup unless @host_headers.empty? @host_headers.each do |name,header| if url.host.match(name) headers['Host'] = header break end end end headers['Host'] ||= @host_header if @host_header headers['User-Agent'] = @user_agent if @user_agent headers['Referer'] = @referer if @referer if ( = @authorized.for_url(url)) headers['Authorization'] = "Basic #{}" end if ( = @cookies.for_host(url.host)) headers['Cookie'] = end return headers end |
#proxy ⇒ Proxy
The proxy information the agent uses.
434 435 436 |
# File 'lib/spidr/agent.rb', line 434 def proxy @sessions.proxy end |
#proxy=(new_proxy) ⇒ Proxy
Sets the proxy information that the agent uses.
451 452 453 |
# File 'lib/spidr/agent.rb', line 451 def proxy=(new_proxy) @sessions.proxy = new_proxy end |
#queued?(url) ⇒ Boolean
Determines whether a given URL has been enqueued.
644 645 646 |
# File 'lib/spidr/agent.rb', line 644 def queued?(url) @queue.include?(url) end |
#robot_allowed?(url) ⇒ Boolean
Determines whether a URL is allowed by the robot policy.
30 31 32 33 34 35 36 |
# File 'lib/spidr/agent/robots.rb', line 30 def robot_allowed?(url) if @robots @robots.allowed?(url) else true end end |
#run {|page| ... } ⇒ Object
Start spidering until the queue becomes empty or the agent is paused.
492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 |
# File 'lib/spidr/agent.rb', line 492 def run(&block) @running = true until (@queue.empty? || paused? || limit_reached?) begin visit_page(dequeue,&block) rescue Actions::Paused return self rescue Actions::Action end end @running = false @sessions.clear return self end |
#running? ⇒ Boolean
Determines if the agent is running.
515 516 517 |
# File 'lib/spidr/agent.rb', line 515 def running? @running == true end |
#sanitize_url(url) ⇒ URI::HTTP, URI::HTTPS
Sanitizes a URL based on filtering options.
25 26 27 28 29 30 31 32 |
# File 'lib/spidr/agent/sanitizers.rb', line 25 def sanitize_url(url) url = URI(url) url.fragment = nil if @strip_fragments url.query = nil if @strip_query return url end |
#skip_link! ⇒ Object
Causes the agent to skip the link being enqueued.
85 86 87 |
# File 'lib/spidr/agent/actions.rb', line 85 def skip_link! raise(Actions::SkipLink) end |
#skip_page! ⇒ Object
Causes the agent to skip the page being visited.
95 96 97 |
# File 'lib/spidr/agent/actions.rb', line 95 def skip_page! raise(Actions::SkipPage) end |
#start_at(url) {|page| ... } ⇒ Object
Start spidering at a given URL.
477 478 479 480 |
# File 'lib/spidr/agent.rb', line 477 def start_at(url,&block) enqueue(url) return run(&block) end |
#to_hash ⇒ Hash
Converts the agent into a Hash.
819 820 821 |
# File 'lib/spidr/agent.rb', line 819 def to_hash {history: @history, queue: @queue} end |
#urls_like(pattern, &block) ⇒ Object
56 57 58 |
# File 'lib/spidr/agent/events.rb', line 56 def urls_like(pattern,&block) every_url_like(pattern,&block) end |
#visit?(url) ⇒ Boolean (protected)
Determines if a given URL should be visited.
946 947 948 949 950 951 952 953 954 955 |
# File 'lib/spidr/agent.rb', line 946 def visit?(url) !visited?(url) && visit_scheme?(url.scheme) && visit_host?(url.host) && visit_port?(url.port) && visit_link?(url.to_s) && visit_url?(url) && visit_ext?(url.path) && robot_allowed?(url.to_s) end |
#visit_ext?(path) ⇒ Boolean (protected)
Determines if a given URI path extension should be visited.
525 526 527 |
# File 'lib/spidr/agent/filters.rb', line 525 def visit_ext?(path) @ext_rules.accept?(File.extname(path)[1..-1]) end |
#visit_exts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the URI path extensions to visit.
298 299 300 |
# File 'lib/spidr/agent/filters.rb', line 298 def visit_exts @ext_rules.accept end |
#visit_exts_like(pattern = nil) {|ext| ... } ⇒ Object
Adds a given pattern to the #visit_exts.
314 315 316 317 318 319 320 321 322 |
# File 'lib/spidr/agent/filters.rb', line 314 def visit_exts_like(pattern=nil,&block) if pattern visit_exts << pattern elsif block visit_exts << block end return self end |
#visit_host?(host) ⇒ Boolean (protected)
Determines if a given host-name should be visited.
471 472 473 |
# File 'lib/spidr/agent/filters.rb', line 471 def visit_host?(host) @host_rules.accept?(host) end |
#visit_hosts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match host-names to visit.
30 31 32 |
# File 'lib/spidr/agent/filters.rb', line 30 def visit_hosts @host_rules.accept end |
#visit_hosts_like(pattern = nil) {|host| ... } ⇒ Object
Adds a given pattern to the #visit_hosts.
46 47 48 49 50 51 52 53 54 |
# File 'lib/spidr/agent/filters.rb', line 46 def visit_hosts_like(pattern=nil,&block) if pattern visit_hosts << pattern elsif block visit_hosts << block end return self end |
#visit_link?(link) ⇒ Boolean (protected)
Determines if a given link should be visited.
497 498 499 |
# File 'lib/spidr/agent/filters.rb', line 497 def visit_link?(link) @link_rules.accept?(link) end |
#visit_links ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the links to visit.
160 161 162 |
# File 'lib/spidr/agent/filters.rb', line 160 def visit_links @link_rules.accept end |
#visit_links_like(pattern = nil) {|link| ... } ⇒ Object
Adds a given pattern to the #visit_links
178 179 180 181 182 183 184 185 186 |
# File 'lib/spidr/agent/filters.rb', line 178 def visit_links_like(pattern=nil,&block) if pattern visit_links << pattern elsif block visit_links << block end return self end |
#visit_page(url) {|page| ... } ⇒ Page?
Visits a given URL, and enqueues the links recovered from the URL to be visited later.
776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 |
# File 'lib/spidr/agent.rb', line 776 def visit_page(url) url = sanitize_url(url) get_page(url) do |page| @history << page.url begin @every_page_blocks.each { |page_block| page_block.call(page) } yield page if block_given? rescue Actions::Paused => action raise(action) rescue Actions::SkipPage return nil rescue Actions::Action end page.each_url do |next_url| begin @every_link_blocks.each do |link_block| link_block.call(page.url,next_url) end rescue Actions::Paused => action raise(action) rescue Actions::SkipLink next rescue Actions::Action end if (@max_depth.nil? || @max_depth > @levels[url]) enqueue(next_url,@levels[url] + 1) end end end end |
#visit_port?(port) ⇒ Boolean (protected)
Determines if a given port should be visited.
484 485 486 |
# File 'lib/spidr/agent/filters.rb', line 484 def visit_port?(port) @port_rules.accept?(port) end |
#visit_ports ⇒ Array<Integer, Regexp, Proc>
Specifies the patterns that match the ports to visit.
94 95 96 |
# File 'lib/spidr/agent/filters.rb', line 94 def visit_ports @port_rules.accept end |
#visit_ports_like(pattern = nil) {|port| ... } ⇒ Object
Adds a given pattern to the #visit_ports.
110 111 112 113 114 115 116 117 118 |
# File 'lib/spidr/agent/filters.rb', line 110 def visit_ports_like(pattern=nil,&block) if pattern visit_ports << pattern elsif block visit_ports << block end return self end |
#visit_scheme?(scheme) ⇒ Boolean (protected)
Determines if a given URI scheme should be visited.
454 455 456 457 458 459 460 |
# File 'lib/spidr/agent/filters.rb', line 454 def visit_scheme?(scheme) if scheme @schemes.include?(scheme) else true end end |
#visit_url?(link) ⇒ Boolean (protected)
Determines if a given URL should be visited.
512 513 514 |
# File 'lib/spidr/agent/filters.rb', line 512 def visit_url?(link) @url_rules.accept?(link) end |
#visit_urls ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the URLs to visit.
228 229 230 |
# File 'lib/spidr/agent/filters.rb', line 228 def visit_urls @url_rules.accept end |
#visit_urls_like(pattern = nil) {|url| ... } ⇒ Object
Adds a given pattern to the #visit_urls
246 247 248 249 250 251 252 253 254 |
# File 'lib/spidr/agent/filters.rb', line 246 def visit_urls_like(pattern=nil,&block) if pattern visit_urls << pattern elsif block visit_urls << block end return self end |
#visited?(url) ⇒ Boolean
Determines whether a URL was visited or not.
572 573 574 |
# File 'lib/spidr/agent.rb', line 572 def visited?(url) @history.include?(URI(url)) end |
#visited_hosts ⇒ Array<String>
Specifies all hosts that were visited.
559 560 561 |
# File 'lib/spidr/agent.rb', line 559 def visited_hosts visited_urls.map(&:host).uniq end |
#visited_links ⇒ Array<String>
Specifies the links which have been visited.
549 550 551 |
# File 'lib/spidr/agent.rb', line 549 def visited_links @history.map(&:to_s) end |