Class: Spidr::Agent

Inherits:

Object

Object
Spidr::Agent

show all

Includes:: Settings::UserAgent

Defined in:: lib/spidr/agent.rb,
lib/spidr/agent/events.rb,
lib/spidr/agent/robots.rb,
lib/spidr/agent/actions.rb,
lib/spidr/agent/filters.rb,
lib/spidr/agent/sanitizers.rb

Defined Under Namespace

Modules: Actions

Instance Attribute Summary collapse

#authorized ⇒ AuthStore
HTTP Authentication credentials.
#cookies ⇒ CookieJar readonly
Cached cookies.
#default_headers ⇒ Hash{String => String} readonly
HTTP Headers to use for every request.
#delay ⇒ Integer
Delay in between fetching pages.
#failures ⇒ Set<URI::HTTP>
List of unreachable URLs.
#history ⇒ Set<URI::HTTP> (also: #visited_urls)
History containing visited URLs.
#host_header ⇒ String
HTTP Host Header to use.
#host_headers ⇒ Hash{String,Regexp => String} readonly
HTTP Host Headers to use for specific hosts.
#levels ⇒ Hash{URI::HTTP => Integer} readonly
The visited URLs and their depth within a site.
#limit ⇒ Integer readonly
Maximum number of pages to visit.
#max_depth ⇒ Integer readonly
Maximum depth.
#queue ⇒ Array<URI::HTTP> (also: #pending_urls)
Queue of URLs to visit.
#referer ⇒ String
Referer to use.
#schemes ⇒ Object
List of acceptable URL schemes to follow.
#sessions ⇒ SessionCache readonly
The session cache.
#strip_fragments ⇒ Object
Specifies whether the Agent will strip URI fragments.
#strip_query ⇒ Object
Specifies whether the Agent will strip URI queries.

Attributes included from Settings::UserAgent

#user_agent

Class Method Summary collapse

.default_schemes ⇒ Array<String> protected
Determines the default URI schemes to follow.
.domain(name, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and spiders the entire domain.
.host(name, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and spiders the given host.
.site(url, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and spiders the web-site located at the given URL.
.start_at(url, **kwargs) {|agent| ... } ⇒ Agent
Creates a new agent and begin spidering at the given URL.

Instance Method Summary collapse

#all_headers {|headers| ... } ⇒ Object
Pass the headers from every response the agent receives to a given block.
#clear ⇒ Object
Clears the history of the agent.
#continue! {|page| ... } ⇒ Object
Continue spidering.
#dequeue ⇒ URI::HTTP protected
Dequeues a URL that will later be visited.
#enqueue(url, level = 0) ⇒ Boolean
Enqueues a given URL for visiting, only if it passes all of the agent's rules for visiting a given URL.
#every_atom_doc {|doc| ... } ⇒ Object
Pass every Atom document that the agent parses to a given block.
#every_atom_page {|feed| ... } ⇒ Object
Pass every Atom feed that the agent visits to a given block.
#every_bad_request_page {|page| ... } ⇒ Object
Pass every Bad Request page that the agent visits to a given block.
#every_css_page {|page| ... } ⇒ Object
Pass every CSS page that the agent visits to a given block.
#every_doc {|doc| ... } ⇒ Object
Pass every HTML or XML document that the agent parses to a given block.
#every_failed_url {|url| ... } ⇒ Object
Pass each URL that could not be requested to the given block.
#every_forbidden_page {|page| ... } ⇒ Object
Pass every Forbidden page that the agent visits to a given block.
#every_html_doc {|doc| ... } ⇒ Object
Pass every HTML document that the agent parses to a given block.
#every_html_page {|page| ... } ⇒ Object
Pass every HTML page that the agent visits to a given block.
#every_internal_server_error_page {|page| ... } ⇒ Object
Pass every Internal Server Error page that the agent visits to a given block.
#every_javascript_page {|page| ... } ⇒ Object
Pass every JavaScript page that the agent visits to a given block.
#every_link {|origin, dest| ... } ⇒ Object
Passes every origin and destination URI of each link to a given block.
#every_missing_page {|page| ... } ⇒ Object
Pass every Missing page that the agent visits to a given block.
#every_ms_word_page {|page| ... } ⇒ Object
Pass every MS Word page that the agent visits to a given block.
#every_ok_page {|page| ... } ⇒ Object
Pass every OK page that the agent visits to a given block.
#every_page {|page| ... } ⇒ Object
Pass every page that the agent visits to a given block.
#every_pdf_page {|page| ... } ⇒ Object
Pass every PDF page that the agent visits to a given block.
#every_redirect_page {|page| ... } ⇒ Object
Pass every Redirect page that the agent visits to a given block.
#every_rss_doc {|doc| ... } ⇒ Object
Pass every RSS document that the agent parses to a given block.
#every_rss_page {|feed| ... } ⇒ Object
Pass every RSS feed that the agent visits to a given block.
#every_timedout_page {|page| ... } ⇒ Object
Pass every Timeout page that the agent visits to a given block.
#every_txt_page {|page| ... } ⇒ Object
Pass every Plain Text page that the agent visits to a given block.
#every_unauthorized_page {|page| ... } ⇒ Object
Pass every Unauthorized page that the agent visits to a given block.
#every_url {|url| ... } ⇒ Object
Pass each URL from each page visited to the given block.
#every_url_like(pattern) {|url| ... } ⇒ Object
Pass every URL that the agent visits, and matches a given pattern, to a given block.
#every_xml_doc {|doc| ... } ⇒ Object
Pass every XML document that the agent parses to a given block.
#every_xml_page {|page| ... } ⇒ Object
Pass every XML page that the agent visits to a given block.
#every_xsl_doc {|doc| ... } ⇒ Object
Pass every XML Stylesheet (XSL) that the agent parses to a given block.
#every_xsl_page {|page| ... } ⇒ Object
Pass every XML Stylesheet (XSL) page that the agent visits to a given block.
#every_zip_page {|page| ... } ⇒ Object
Pass every ZIP page that the agent visits to a given block.
#failed(url) ⇒ Object protected
Adds a given URL to the failures list.
#failed?(url) ⇒ Boolean
Determines whether a given URL could not be visited.
#get_page(url) {|page| ... } ⇒ Page^?
Requests and creates a new Page object from a given URL.
#ignore_exts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match URI path extensions to not visit.
#ignore_exts_like(pattern = nil) {|ext| ... } ⇒ Object
Adds a given pattern to the #ignore_exts.
#ignore_hosts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match host-names to not visit.
#ignore_hosts_like(pattern = nil) {|host| ... } ⇒ Object
Adds a given pattern to the #ignore_hosts.
#ignore_links ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match links to not visit.
#ignore_links_like(pattern = nil) {|link| ... } ⇒ Object
Adds a given pattern to the #ignore_links.
#ignore_ports ⇒ Array<Integer, Regexp, Proc>
Specifies the patterns that match ports to not visit.
#ignore_ports_like(pattern = nil) {|port| ... } ⇒ Object
Adds a given pattern to the #ignore_ports.
#ignore_urls ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match URLs to not visit.
#ignore_urls_like(pattern = nil) {|url| ... } ⇒ Object
Adds a given pattern to the #ignore_urls.
#initialize(host_header: nil, host_headers: {}, default_headers: {}, user_agent: Spidr.user_agent, referer: nil, proxy: Spidr.proxy, open_timeout: Spidr.open_timeout, ssl_timeout: Spidr.ssl_timeout, read_timeout: Spidr.read_timeout, continue_timeout: Spidr.continue_timeout, keep_alive_timeout: Spidr.keep_alive_timeout, delay: 0, limit: nil, max_depth: nil, queue: nil, history: nil, strip_fragments: true, strip_query: false, schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil, robots: Spidr.robots?) {|agent| ... } ⇒ Agent constructor
Creates a new Agent object.
#initialize_actions ⇒ Object protected
#initialize_events ⇒ Object protected
#initialize_filters(schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil) ⇒ Object protected
Initializes filtering rules.
#initialize_robots ⇒ Object
Initializes the robots filter.
#initialize_sanitizers(strip_fragments: true, strip_query: false) ⇒ Object protected
Initializes the Sanitizer rules.
#limit_reached? ⇒ Boolean protected
Determines if the maximum limit has been reached.
#pause! ⇒ Object
Pauses the agent, causing spidering to temporarily stop.
#pause=(state) ⇒ Object
Sets the pause state of the agent.
#paused? ⇒ Boolean
Determines whether the agent is paused.
#post_page(url, post_data = '') {|page| ... } ⇒ Page^?
Posts supplied form data and creates a new Page object from a given URL.
#prepare_request(url) {|request| ... } ⇒ Object protected
Normalizes the request path and grabs a session to handle page get and post requests.
#prepare_request_headers(url) ⇒ Hash{String => String} protected
Prepares request headers for the given URL.
#proxy ⇒ Proxy
The proxy information the agent uses.
#proxy=(new_proxy) ⇒ Proxy
Sets the proxy information that the agent uses.
#queued?(url) ⇒ Boolean
Determines whether a given URL has been enqueued.
#robot_allowed?(url) ⇒ Boolean
Determines whether a URL is allowed by the robot policy.
#run {|page| ... } ⇒ Object
Start spidering until the queue becomes empty or the agent is paused.
#running? ⇒ Boolean
Determines if the agent is running.
#sanitize_url(url) ⇒ URI::HTTP, URI::HTTPS
Sanitizes a URL based on filtering options.
#skip_link! ⇒ Object
Causes the agent to skip the link being enqueued.
#skip_page! ⇒ Object
Causes the agent to skip the page being visited.
#start_at(url) {|page| ... } ⇒ Object
Start spidering at a given URL.
#to_hash ⇒ Hash
Converts the agent into a Hash.
#urls_like(pattern, &block) ⇒ Object
#visit?(url) ⇒ Boolean protected
Determines if a given URL should be visited.
#visit_ext?(path) ⇒ Boolean protected
Determines if a given URI path extension should be visited.
#visit_exts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the URI path extensions to visit.
#visit_exts_like(pattern = nil) {|ext| ... } ⇒ Object
Adds a given pattern to the #visit_exts.
#visit_host?(host) ⇒ Boolean protected
Determines if a given host-name should be visited.
#visit_hosts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match host-names to visit.
#visit_hosts_like(pattern = nil) {|host| ... } ⇒ Object
Adds a given pattern to the #visit_hosts.
#visit_link?(link) ⇒ Boolean protected
Determines if a given link should be visited.
#visit_links ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the links to visit.
#visit_links_like(pattern = nil) {|link| ... } ⇒ Object
Adds a given pattern to the #visit_links.
#visit_page(url) {|page| ... } ⇒ Page^?
Visits a given URL, and enqueues the links recovered from the URL to be visited later.
#visit_port?(port) ⇒ Boolean protected
Determines if a given port should be visited.
#visit_ports ⇒ Array<Integer, Regexp, Proc>
Specifies the patterns that match the ports to visit.
#visit_ports_like(pattern = nil) {|port| ... } ⇒ Object
Adds a given pattern to the #visit_ports.
#visit_scheme?(scheme) ⇒ Boolean protected
Determines if a given URI scheme should be visited.
#visit_url?(link) ⇒ Boolean protected
Determines if a given URL should be visited.
#visit_urls ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the URLs to visit.
#visit_urls_like(pattern = nil) {|url| ... } ⇒ Object
Adds a given pattern to the #visit_urls.
#visited?(url) ⇒ Boolean
Determines whether a URL was visited or not.
#visited_hosts ⇒ Array<String>
Specifies all hosts that were visited.
#visited_links ⇒ Array<String>
Specifies the links which have been visited.

Constructor Details

#initialize(host_header: nil, host_headers: {}, default_headers: {}, user_agent: Spidr.user_agent, referer: nil, proxy: Spidr.proxy, open_timeout: Spidr.open_timeout, ssl_timeout: Spidr.ssl_timeout, read_timeout: Spidr.read_timeout, continue_timeout: Spidr.continue_timeout, keep_alive_timeout: Spidr.keep_alive_timeout, delay: 0, limit: nil, max_depth: nil, queue: nil, history: nil, strip_fragments: true, strip_query: false, schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil, robots: Spidr.robots?) {|agent| ... } ⇒ `Agent`

Creates a new Agent object.

Parameters:

host_header (String, nil) (defaults to: nil) —
The HTTP Host header to use with each request.
host_headers (Hash{String,Regexp => String}) (defaults to: {}) —
The HTTP Host headers to use for specific hosts.
default_headers (Hash{String => String}) (defaults to: {}) —
Default headers to set for every request.
user_agent (String, nil) (defaults to: Spidr.user_agent) —
The User-Agent string to send with each requests.
referer (String, nil) (defaults to: nil) —
The Referer URL to send with each request.
open_timeout (Integer, nil) (defaults to: Spidr.open_timeout) —
Optional open connection timeout.
read_timeout (Integer, nil) (defaults to: Spidr.read_timeout) —
Optional read timeout.
ssl_timeout (Integer, nil) (defaults to: Spidr.ssl_timeout) —
Optional SSL connection timeout.
continue_timeout (Integer, nil) (defaults to: Spidr.continue_timeout) —
Optional continue timeout.
keep_alive_timeout (Integer, nil) (defaults to: Spidr.keep_alive_timeout) —
Optional Keep-Alive timeout.
proxy (Spidr::Proxy, Hash, URI::HTTP, String, nil) (defaults to: Spidr.proxy) —
The proxy information to use.
delay (Integer) (defaults to: 0) —
The number of seconds to pause between each request.
limit (Integer, nil) (defaults to: nil) —
The maximum number of pages to visit.
max_depth (Integer, nil) (defaults to: nil) —
The maximum link depth to follow.
queue (Set, Array, nil) (defaults to: nil) —
The initial queue of URLs to visit.
history (Set, Array, nil) (defaults to: nil) —
The initial list of visited URLs.
strip_fragments (Boolean) (defaults to: true) —
Controls whether to strip the fragment components from the URLs.
strip_query (Boolean) (defaults to: false) —
Controls whether to strip the query components from the URLs.
schemes (Array<String>) (defaults to: self.class.default_schemes) —
The list of acceptable URI schemes to visit. The https scheme will be ignored if net/https cannot be loaded.
host (String) (defaults to: nil) —
The host-name to visit.
hosts (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the host-names to visit.
ignore_hosts (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the host-names to not visit.
ports (Array<Integer, Regexp, Proc>) (defaults to: nil) —
The patterns which match the ports to visit.
ignore_ports (Array<Integer, Regexp, Proc>) (defaults to: nil) —
The patterns which match the ports to not visit.
links (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the links to visit.
ignore_links (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the links to not visit.
urls (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the URLs to visit.
ignore_urls (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the URLs to not visit.
exts (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the URI path extensions to visit.
ignore_exts (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the URI path extensions to not visit.
robots (Boolean) (defaults to: Spidr.robots?) —
Specifies whether robots.txt should be honored.

Options Hash (proxy:):

:host (String) —
The host the proxy is running on.
:port (Integer) — default: 8080 —
The port the proxy is running on.
:user (String, nil) —
The user to authenticate as with the proxy.
:password (String, nil) —
The password to authenticate with.

Yields:

(agent) —
If a block is given, it will be passed the newly created agent for further configuration.

Yield Parameters:

agent (Agent) —
The newly created agent.

# File 'lib/spidr/agent.rb', line 214

def initialize(# header keyword arguments
               host_header:        nil,
               host_headers:       {},
               default_headers:    {},
               user_agent:         Spidr.user_agent,
               referer:            nil,
               # session cache keyword arguments
               proxy:              Spidr.proxy,
               open_timeout:       Spidr.open_timeout,
               ssl_timeout:        Spidr.ssl_timeout,
               read_timeout:       Spidr.read_timeout,
               continue_timeout:   Spidr.continue_timeout,
               keep_alive_timeout: Spidr.keep_alive_timeout,
               # spidering controls keyword arguments
               delay:     0,
               limit:     nil,
               max_depth: nil,
               # history keyword arguments
               queue:   nil,
               history: nil,
               # sanitizer keyword arguments
               strip_fragments: true,
               strip_query:     false,
               # filtering keyword arguments
               schemes:      self.class.default_schemes,
               host:         nil,
               hosts:        nil,
               ignore_hosts: nil,
               ports:        nil,
               ignore_ports: nil,
               links:        nil,
               ignore_links: nil,
               urls:         nil,
               ignore_urls:  nil,
               exts:         nil,
               ignore_exts:  nil,
               # robots keyword arguments
               robots:       Spidr.robots?)
  @host_header  = host_header
  @host_headers = host_headers

  @default_headers = default_headers

  @user_agent = user_agent
  @referer    = referer

  @sessions   = SessionCache.new(
    proxy:              proxy,
    open_timeout:       open_timeout,
    ssl_timeout:        ssl_timeout,
    read_timeout:       read_timeout,
    continue_timeout:   continue_timeout,
    keep_alive_timeout: keep_alive_timeout
  )
  @cookies    = CookieJar.new
  @authorized = AuthStore.new

  @running  = false
  @delay    = delay
  @history  = Set[]
  @failures = Set[]
  @queue    = []

  @limit     = limit
  @levels    = Hash.new(0)
  @max_depth = max_depth

  self.queue   = queue   if queue
  self.history = history if history

  initialize_sanitizers(
    strip_fragments: strip_fragments,
    strip_query:     strip_query
  )

  initialize_filters(
    schemes:      schemes,
    host:         host,
    hosts:        hosts,
    ignore_hosts: ignore_hosts,
    ports:        ports,
    ignore_ports: ignore_ports,
    links:        links,
    ignore_links: ignore_links,
    urls:         urls,
    ignore_urls:  ignore_urls,
    exts:         exts,
    ignore_exts:  ignore_exts
  )
  initialize_actions
  initialize_events

  initialize_robots if robots

  yield self if block_given?
end

Instance Attribute Details

#authorized ⇒ `AuthStore`

HTTP Authentication credentials

Returns:

(AuthStore)


44
45
46

# File 'lib/spidr/agent.rb', line 44

def authorized
  @authorized
end

#cookies ⇒ `CookieJar` (readonly)

Cached cookies

Returns:

(CookieJar)


81
82
83

# File 'lib/spidr/agent.rb', line 81

def cookies
  @cookies
end

#default_headers ⇒ `Hash{String => String}` (readonly)

HTTP Headers to use for every request

Returns:

(Hash{String => String})

Since:

0.6.0


39
40
41

# File 'lib/spidr/agent.rb', line 39

def default_headers
  @default_headers
end

#delay ⇒ `Integer`

Delay in between fetching pages

Returns:

(Integer)


54
55
56

# File 'lib/spidr/agent.rb', line 54

def delay
  @delay
end

#failures ⇒ `Set<URI::HTTP>`

List of unreachable URLs

Returns:

(Set<URI::HTTP>)


64
65
66

# File 'lib/spidr/agent.rb', line 64

def failures
  @failures
end

#history ⇒ `Set<URI::HTTP>` Also known as: visited_urls

History containing visited URLs

Returns:

(Set<URI::HTTP>)


59
60
61

# File 'lib/spidr/agent.rb', line 59

def history
  @history
end

#host_header ⇒ `String`

HTTP Host Header to use

Returns:

(String)


27
28
29

# File 'lib/spidr/agent.rb', line 27

def host_header
  @host_header
end

#host_headers ⇒ `Hash{String,Regexp => String}` (readonly)

HTTP Host Headers to use for specific hosts

Returns:

(Hash{String,Regexp => String})


32
33
34

# File 'lib/spidr/agent.rb', line 32

def host_headers
  @host_headers
end

#levels ⇒ `Hash{URI::HTTP => Integer}` (readonly)

The visited URLs and their depth within a site

Returns:

(Hash{URI::HTTP => Integer})


96
97
98

# File 'lib/spidr/agent.rb', line 96

def levels
  @levels
end

#limit ⇒ `Integer` (readonly)

Maximum number of pages to visit.

Returns:

(Integer)


86
87
88

# File 'lib/spidr/agent.rb', line 86

def limit
  @limit
end

#max_depth ⇒ `Integer` (readonly)

Maximum depth

Returns:

(Integer)


91
92
93

# File 'lib/spidr/agent.rb', line 91

def max_depth
  @max_depth
end

#queue ⇒ `Array<URI::HTTP>` Also known as: pending_urls

Queue of URLs to visit

Returns:

(Array<URI::HTTP>)


69
70
71

# File 'lib/spidr/agent.rb', line 69

def queue
  @queue
end

#referer ⇒ `String`

Referer to use

Returns:

(String)


49
50
51

# File 'lib/spidr/agent.rb', line 49

def referer
  @referer
end

#schemes ⇒ `Object`

List of acceptable URL schemes to follow


9
10
11

# File 'lib/spidr/agent/filters.rb', line 9

def schemes
  @schemes
end

#sessions ⇒ `SessionCache` (readonly)

The session cache

Returns:

(SessionCache)

Since:

0.6.0


76
77
78

# File 'lib/spidr/agent.rb', line 76

def sessions
  @sessions
end

#strip_fragments ⇒ `Object`

Specifies whether the Agent will strip URI fragments


9
10
11

# File 'lib/spidr/agent/sanitizers.rb', line 9

def strip_fragments
  @strip_fragments
end

#strip_query ⇒ `Object`

Specifies whether the Agent will strip URI queries


12
13
14

# File 'lib/spidr/agent/sanitizers.rb', line 12

def strip_query
  @strip_query
end

Class Method Details

.default_schemes ⇒ `Array<String>` (protected)

Determines the default URI schemes to follow.

Returns:

(Array<String>) —
The default URI schemes to follow.

Since:

0.6.2

# File 'lib/spidr/agent/filters.rb', line 429

def self.default_schemes
  schemes = ['http']

  begin
    require 'net/https'

    schemes << 'https'
  rescue Gem::LoadError => e
    raise(e)
  rescue ::LoadError
    warn "Warning: cannot load 'net/https', https support disabled"
  end

  return schemes
end

.domain(name, **kwargs) {|agent| ... } ⇒ `Agent`

Creates a new agent and spiders the entire domain.

Parameters:

name (String) —
The top-level domain to spider.
kwargs (Hash{Symbol => Object}) —
Additional keyword arguments. See #initialize.

Yields:

(agent) —
If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

agent (Agent) —
The newly created agent.

Returns:

(Agent) —
The created agent object.

See Also:

#initialize

Since:

0.7.0

# File 'lib/spidr/agent.rb', line 418

def self.domain(name,**kwargs,&block)
  agent = new(host: /(^|\.)#{Regexp.escape(name)}$/, **kwargs, &block)
  agent.start_at(URI::HTTP.build(host: name, path: '/'))
  return agent
end

.host(name, **kwargs) {|agent| ... } ⇒ `Agent`

Creates a new agent and spiders the given host.

Parameters:

name (String) —
The host-name to spider.
kwargs (Hash{Symbol => Object}) —
Additional keyword arguments. See #initialize.

Yields:

(agent) —
If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

agent (Agent) —
The newly created agent.

Returns:

(Agent) —
The created agent object.

See Also:

#initialize

# File 'lib/spidr/agent.rb', line 389

def self.host(name,**kwargs,&block)
  agent = new(host: name, **kwargs, &block)
  agent.start_at(URI::HTTP.build(host: name, path: '/'))
  return agent
end

.site(url, **kwargs) {|agent| ... } ⇒ `Agent`

Creates a new agent and spiders the web-site located at the given URL.

Parameters:

url (URI::HTTP, String) —
The web-site to spider.
kwargs (Hash{Symbol => Object}) —
Additional keyword arguments. See #initialize.

Yields:

(agent) —
If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

agent (Agent) —
The newly created agent.

Returns:

(Agent) —
The created agent object.

See Also:

#initialize

# File 'lib/spidr/agent.rb', line 360

def self.site(url,**kwargs,&block)
  url = URI(url)

  agent = new(host: url.host, **kwargs, &block)
  agent.start_at(url)
  return agent
end

.start_at(url, **kwargs) {|agent| ... } ⇒ `Agent`

Creates a new agent and begin spidering at the given URL.

Parameters:

url (URI::HTTP, String) —
The URL to start spidering at.
kwargs (Hash{Symbol => Object}) —
Additional keyword arguments. See #initialize.

Yields:

(agent) —
If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

agent (Agent) —
The newly created agent.

Returns:

(Agent) —
The created agent object.

See Also:

# File 'lib/spidr/agent.rb', line 333

def self.start_at(url,**kwargs,&block)
  agent = new(**kwargs,&block)
  agent.start_at(url)
  return agent
end

Instance Method Details

#all_headers {|headers| ... } ⇒ `Object`

Pass the headers from every response the agent receives to a given block.

Yields:

(headers) —
The block will be passed the headers of every response.

Yield Parameters:

headers (Hash) —
The headers from a response.


70
71
72

# File 'lib/spidr/agent/events.rb', line 70

def all_headers
  every_page { |page| yield page.headers }
end

#clear ⇒ `Object`

Clears the history of the agent.

# File 'lib/spidr/agent.rb', line 458

def clear
  @queue.clear
  @history.clear
  @failures.clear
  return self
end

#continue! {|page| ... } ⇒ `Object`

Continue spidering.

Yields:

(page) —
If a block is given, it will be passed every page visited.

Yield Parameters:

page (Page) —
The page to be visited.

# File 'lib/spidr/agent/actions.rb', line 42

def continue!(&block)
  @paused = false
  return run(&block)
end

#dequeue ⇒ `URI::HTTP` (protected)

Dequeues a URL that will later be visited.

Returns:

(URI::HTTP) —
The URL that was at the front of the queue.


922
923
924

# File 'lib/spidr/agent.rb', line 922

def dequeue
  @queue.shift
end

#enqueue(url, level = 0) ⇒ `Boolean`

Enqueues a given URL for visiting, only if it passes all of the agent's rules for visiting a given URL.

Parameters:

url (URI::HTTP, String) —
The URL to enqueue for visiting.

Returns:

(Boolean) —
Specifies whether the URL was enqueued, or ignored.

# File 'lib/spidr/agent.rb', line 658

def enqueue(url,level=0)
  url = sanitize_url(url)

  if (!queued?(url) && visit?(url))
    link = url.to_s

    begin
      @every_url_blocks.each { |url_block| url_block.call(url) }

      @every_url_like_blocks.each do |pattern,url_blocks|
        match = case pattern
                when Regexp
                  link =~ pattern
                else
                  (pattern == link) || (pattern == url)
                end

        if match
          url_blocks.each { |url_block| url_block.call(url) }
        end
      end
    rescue Actions::Paused => action
      raise(action)
    rescue Actions::SkipLink
      return false
    rescue Actions::Action
    end

    @queue << url
    @levels[url] = level
    return true
  end

  return false
end

#every_atom_doc {|doc| ... } ⇒ `Object`

Pass every Atom document that the agent parses to a given block.

Yields:

(doc) —
The block will be passed every Atom document parsed.

Yield Parameters:

doc (Nokogiri::XML::Document) —
A parsed XML document.

See Also:

http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Document.html

# File 'lib/spidr/agent/events.rb', line 389

def every_atom_doc
  every_page do |page|
    if (block_given? && page.atom?)
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_atom_page {|feed| ... } ⇒ `Object`

Pass every Atom feed that the agent visits to a given block.

Yields:

(feed) —
The block will be passed every Atom feed visited.

Yield Parameters:

feed (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 453

def every_atom_page
  every_page do |page|
    yield page if (block_given? && page.atom?)
  end
end

#every_bad_request_page {|page| ... } ⇒ `Object`

Pass every Bad Request page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every Bad Request page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 142

def every_bad_request_page
  every_page do |page|
    yield page if (block_given? && page.bad_request?)
  end
end

#every_css_page {|page| ... } ⇒ `Object`

Pass every CSS page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every CSS page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 423

def every_css_page
  every_page do |page|
    yield page if (block_given? && page.css?)
  end
end

#every_doc {|doc| ... } ⇒ `Object`

Pass every HTML or XML document that the agent parses to a given block.

Yields:

(doc) —
The block will be passed every HTML or XML document parsed.

Yield Parameters:

doc (Nokogiri::HTML::Document, Nokogiri::XML::Document) —
A parsed HTML or XML document.

See Also:

# File 'lib/spidr/agent/events.rb', line 283

def every_doc
  every_page do |page|
    if block_given?
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_failed_url {|url| ... } ⇒ `Object`

Pass each URL that could not be requested to the given block.

Yields:

(url) —
The block will be passed every URL that could not be requested.

Yield Parameters:

url (URI::HTTP) —
A failed URL.

# File 'lib/spidr/agent/events.rb', line 28

def every_failed_url(&block)
  @every_failed_url_blocks << block
  return self
end

#every_forbidden_page {|page| ... } ⇒ `Object`

Pass every Forbidden page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every Forbidden page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 172

def every_forbidden_page
  every_page do |page|
    yield page if (block_given? && page.forbidden?)
  end
end

#every_html_doc {|doc| ... } ⇒ `Object`

Pass every HTML document that the agent parses to a given block.

Yields:

(doc) —
The block will be passed every HTML document parsed.

Yield Parameters:

doc (Nokogiri::HTML::Document) —
A parsed HTML document.

See Also:

http://nokogiri.rubyforge.org/nokogiri/Nokogiri/HTML/Document.html

# File 'lib/spidr/agent/events.rb', line 304

def every_html_doc
  every_page do |page|
    if (block_given? && page.html?)
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_html_page {|page| ... } ⇒ `Object`

Pass every HTML page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every HTML page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 233

def every_html_page
  every_page do |page|
    yield page if (block_given? && page.html?)
  end
end

#every_internal_server_error_page {|page| ... } ⇒ `Object`

Pass every Internal Server Error page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every Internal Server Error page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 203

def every_internal_server_error_page
  every_page do |page|
    yield page if (block_given? && page.had_internal_server_error?)
  end
end

#every_javascript_page {|page| ... } ⇒ `Object`

Pass every JavaScript page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every JavaScript page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 408

def every_javascript_page
  every_page do |page|
    yield page if (block_given? && page.javascript?)
  end
end

#every_link {|origin, dest| ... } ⇒ `Object`

Passes every origin and destination URI of each link to a given block.

Yields:

(origin, dest) —
The block will be passed every origin and destination URI of each link.

Yield Parameters:

origin (URI::HTTP) —
The URI that a link originated from.
dest (URI::HTTP) —
The destination URI of a link.

# File 'lib/spidr/agent/events.rb', line 518

def every_link(&block)
  @every_link_blocks << block
  return self
end

#every_missing_page {|page| ... } ⇒ `Object`

Pass every Missing page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every Missing page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 187

def every_missing_page
  every_page do |page|
    yield page if (block_given? && page.missing?)
  end
end

#every_ms_word_page {|page| ... } ⇒ `Object`

Pass every MS Word page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every MS Word page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 468

def every_ms_word_page
  every_page do |page|
    yield page if (block_given? && page.ms_word?)
  end
end

#every_ok_page {|page| ... } ⇒ `Object`

Pass every OK page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every OK page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 97

def every_ok_page
  every_page do |page|
    yield page if (block_given? && page.ok?)
  end
end

#every_page {|page| ... } ⇒ `Object`

Pass every page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 83

def every_page(&block)
  @every_page_blocks << block
  return self
end

#every_pdf_page {|page| ... } ⇒ `Object`

Pass every PDF page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every PDF page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 483

def every_pdf_page
  every_page do |page|
    yield page if (block_given? && page.pdf?)
  end
end

#every_redirect_page {|page| ... } ⇒ `Object`

Pass every Redirect page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every Redirect page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 112

def every_redirect_page
  every_page do |page|
    yield page if (block_given? && page.redirect?)
  end
end

#every_rss_doc {|doc| ... } ⇒ `Object`

Pass every RSS document that the agent parses to a given block.

Yields:

(doc) —
The block will be passed every RSS document parsed.

Yield Parameters:

doc (Nokogiri::XML::Document) —
A parsed XML document.

See Also:

http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Document.html

# File 'lib/spidr/agent/events.rb', line 368

def every_rss_doc
  every_page do |page|
    if (block_given? && page.rss?)
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_rss_page {|feed| ... } ⇒ `Object`

Pass every RSS feed that the agent visits to a given block.

Yields:

(feed) —
The block will be passed every RSS feed visited.

Yield Parameters:

feed (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 438

def every_rss_page
  every_page do |page|
    yield page if (block_given? && page.rss?)
  end
end

#every_timedout_page {|page| ... } ⇒ `Object`

Pass every Timeout page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every Timeout page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 127

def every_timedout_page
  every_page do |page|
    yield page if (block_given? && page.timedout?)
  end
end

#every_txt_page {|page| ... } ⇒ `Object`

Pass every Plain Text page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every Plain Text page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 218

def every_txt_page
  every_page do |page|
    yield page if (block_given? && page.txt?)
  end
end

#every_unauthorized_page {|page| ... } ⇒ `Object`

Pass every Unauthorized page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every Unauthorized page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 157

def every_unauthorized_page
  every_page do |page|
    yield page if (block_given? && page.unauthorized?)
  end
end

#every_url {|url| ... } ⇒ `Object`

Pass each URL from each page visited to the given block.

Yields:

(url) —
The block will be passed every URL from every page visited.

Yield Parameters:

url (URI::HTTP) —
Each URL from each page visited.

# File 'lib/spidr/agent/events.rb', line 14

def every_url(&block)
  @every_url_blocks << block
  return self
end

#every_url_like(pattern) {|url| ... } ⇒ `Object`

Pass every URL that the agent visits, and matches a given pattern, to a given block.

Parameters:

pattern (Regexp, String) —
The pattern to match URLs with.

Yields:

(url) —
The block will be passed every URL that matches the given pattern.

Yield Parameters:

url (URI::HTTP) —
A matching URL.

Since:

0.3.2

# File 'lib/spidr/agent/events.rb', line 48

def every_url_like(pattern,&block)
  @every_url_like_blocks[pattern] << block
  return self
end

#every_xml_doc {|doc| ... } ⇒ `Object`

Pass every XML document that the agent parses to a given block.

Yields:

(doc) —
The block will be passed every XML document parsed.

Yield Parameters:

doc (Nokogiri::XML::Document) —
A parsed XML document.

See Also:

http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Document.html

# File 'lib/spidr/agent/events.rb', line 325

def every_xml_doc
  every_page do |page|
    if (block_given? && page.xml?)
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_xml_page {|page| ... } ⇒ `Object`

Pass every XML page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every XML page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 248

def every_xml_page
  every_page do |page|
    yield page if (block_given? && page.xml?)
  end
end

#every_xsl_doc {|doc| ... } ⇒ `Object`

Pass every XML Stylesheet (XSL) that the agent parses to a given block.

Yields:

(doc) —
The block will be passed every XSL Stylesheet (XSL) parsed.

Yield Parameters:

doc (Nokogiri::XML::Document) —
A parsed XML document.

See Also:

http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Document.html

# File 'lib/spidr/agent/events.rb', line 347

def every_xsl_doc
  every_page do |page|
    if (block_given? && page.xsl?)
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_xsl_page {|page| ... } ⇒ `Object`

Pass every XML Stylesheet (XSL) page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every XML Stylesheet (XSL) page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 264

def every_xsl_page
  every_page do |page|
    yield page if (block_given? && page.xsl?)
  end
end

#every_zip_page {|page| ... } ⇒ `Object`

Pass every ZIP page that the agent visits to a given block.

Yields:

(page) —
The block will be passed every ZIP page visited.

Yield Parameters:

page (Page) —
A visited page.

# File 'lib/spidr/agent/events.rb', line 498

def every_zip_page
  every_page do |page|
    yield page if (block_given? && page.zip?)
  end
end

#failed(url) ⇒ `Object` (protected)

Adds a given URL to the failures list.

Parameters:

url (URI::HTTP) —
The URL to add to the failures list.

# File 'lib/spidr/agent.rb', line 963

def failed(url)
  @failures << url
  @every_failed_url_blocks.each { |fail_block| fail_block.call(url) }
  return true
end

#failed?(url) ⇒ `Boolean`

Determines whether a given URL could not be visited.

Parameters:

url (URI::HTTP, String) —
The URL to check for failures.

Returns:

(Boolean) —
Specifies whether the given URL was unable to be visited.


607
608
609

# File 'lib/spidr/agent.rb', line 607

def failed?(url)
  @failures.include?(URI(url))
end

#get_page(url) {|page| ... } ⇒ `Page`^?

Requests and creates a new Page object from a given URL.

Parameters:

url (URI::HTTP) —
The URL to request.

Yields:

(page) —
If a block is given, it will be passed the page that represents the response.

Yield Parameters:

page (Page) —
The page for the response.

Returns:

(Page, nil) —
The page for the response, or nil if the request failed.

# File 'lib/spidr/agent.rb', line 710

def get_page(url)
  url = URI(url)

  prepare_request(url) do |session,path,headers|
    new_page = Page.new(url,session.get(path,headers))

    # save any new cookies
    @cookies.from_page(new_page)

    yield new_page if block_given?
    return new_page
  end
end

#ignore_exts ⇒ `Array<String, Regexp, Proc>`

Specifies the patterns that match URI path extensions to not visit.

Returns:

(Array<String, Regexp, Proc>) —
The URI path extension patterns to not visit.


330
331
332

# File 'lib/spidr/agent/filters.rb', line 330

def ignore_exts
  @ext_rules.reject
end

#ignore_exts_like(pattern = nil) {|ext| ... } ⇒ `Object`

Adds a given pattern to the #ignore_exts.

Parameters:

pattern (String, Regexp) (defaults to: nil) —
The pattern to match URI path extensions with.

Yields:

(ext) —
If a block is given, it will be used to filter URI path extensions.

Yield Parameters:

ext (String) —
A URI path extension to reject or accept.

# File 'lib/spidr/agent/filters.rb', line 346

def ignore_exts_like(pattern=nil,&block)
  if pattern
    ignore_exts << pattern
  elsif block
    ignore_exts << block
  end

  return self
end

#ignore_hosts ⇒ `Array<String, Regexp, Proc>`

Specifies the patterns that match host-names to not visit.

Returns:

(Array<String, Regexp, Proc>) —
The host-name patterns to not visit.


62
63
64

# File 'lib/spidr/agent/filters.rb', line 62

def ignore_hosts
  @host_rules.reject
end

#ignore_hosts_like(pattern = nil) {|host| ... } ⇒ `Object`

Adds a given pattern to the #ignore_hosts.

Parameters:

pattern (String, Regexp) (defaults to: nil) —
The pattern to match host-names with.

Yields:

(host) —
If a block is given, it will be used to filter host-names.

Yield Parameters:

host (String) —
A host-name to reject or accept.

# File 'lib/spidr/agent/filters.rb', line 78

def ignore_hosts_like(pattern=nil,&block)
  if pattern
    ignore_hosts << pattern
  elsif block
    ignore_hosts << block
  end

  return self
end

#ignore_links ⇒ `Array<String, Regexp, Proc>`

Specifies the patterns that match links to not visit.

Returns:

(Array<String, Regexp, Proc>) —
The link patterns to not visit.


194
195
196

# File 'lib/spidr/agent/filters.rb', line 194

def ignore_links
  @link_rules.reject
end

#ignore_links_like(pattern = nil) {|link| ... } ⇒ `Object`

Adds a given pattern to the #ignore_links.

Parameters:

pattern (String, Regexp) (defaults to: nil) —
The pattern to match links with.

Yields:

(link) —
If a block is given, it will be used to filter links.

Yield Parameters:

link (String) —
A link to reject or accept.

# File 'lib/spidr/agent/filters.rb', line 210

def ignore_links_like(pattern=nil,&block)
  if pattern
    ignore_links << pattern
  elsif block
    ignore_links << block
  end

  return self
end

#ignore_ports ⇒ `Array<Integer, Regexp, Proc>`

Specifies the patterns that match ports to not visit.

Returns:

(Array<Integer, Regexp, Proc>) —
The port patterns to not visit.


126
127
128

# File 'lib/spidr/agent/filters.rb', line 126

def ignore_ports
  @port_rules.reject
end

#ignore_ports_like(pattern = nil) {|port| ... } ⇒ `Object`

Adds a given pattern to the #ignore_ports.

Parameters:

pattern (Integer, Regexp) (defaults to: nil) —
The pattern to match ports with.

Yields:

(port) —
If a block is given, it will be used to filter ports.

Yield Parameters:

port (Integer) —
A port to reject or accept.

# File 'lib/spidr/agent/filters.rb', line 142

def ignore_ports_like(pattern=nil,&block)
  if pattern
    ignore_ports << pattern
  elsif block
    ignore_ports << block
  end

  return self
end

#ignore_urls ⇒ `Array<String, Regexp, Proc>`

Specifies the patterns that match URLs to not visit.

Returns:

(Array<String, Regexp, Proc>) —
The URL patterns to not visit.

Since:

0.2.4


264
265
266

# File 'lib/spidr/agent/filters.rb', line 264

def ignore_urls
  @url_rules.reject
end

#ignore_urls_like(pattern = nil) {|url| ... } ⇒ `Object`

Adds a given pattern to the #ignore_urls.

Parameters:

pattern (String, Regexp) (defaults to: nil) —
The pattern to match URLs with.

Yields:

(url) —
If a block is given, it will be used to filter URLs.

Yield Parameters:

url (URI::HTTP, URI::HTTPS) —
A URL to reject or accept.

Since:

0.2.4

# File 'lib/spidr/agent/filters.rb', line 282

def ignore_urls_like(pattern=nil,&block)
  if pattern
    ignore_urls << pattern
  elsif block
    ignore_urls << block
  end

  return self
end

#initialize_actions ⇒ `Object` (protected)


101
102
103

# File 'lib/spidr/agent/actions.rb', line 101

def initialize_actions
  @paused = false
end

#initialize_events ⇒ `Object` (protected)

# File 'lib/spidr/agent/events.rb', line 525

def initialize_events
  @every_url_blocks        = []
  @every_failed_url_blocks = []
  @every_url_like_blocks   = Hash.new { |hash,key| hash[key] = [] }

  @every_page_blocks = []
  @every_link_blocks = []
end

#initialize_filters(schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil) ⇒ `Object` (protected)

Initializes filtering rules.

Parameters:

schemes (Array<String>) (defaults to: self.class.default_schemes) —
The list of acceptable URI schemes to visit. The https scheme will be ignored if net/https cannot be loaded.
host (String) (defaults to: nil) —
The host-name to visit.
hosts (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the host-names to visit.
ignore_hosts (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the host-names to not visit.
ports (Array<Integer, Regexp, Proc>) (defaults to: nil) —
The patterns which match the ports to visit.
ignore_ports (Array<Integer, Regexp, Proc>) (defaults to: nil) —
The patterns which match the ports to not visit.
links (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the links to visit.
ignore_links (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the links to not visit.
urls (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the URLs to visit.
ignore_urls (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the URLs to not visit.
exts (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the URI path extensions to visit.
ignore_exts (Array<String, Regexp, Proc>) (defaults to: nil) —
The patterns which match the URI path extensions to not visit.

# File 'lib/spidr/agent/filters.rb', line 398

def initialize_filters(schemes:      self.class.default_schemes,
                       host:         nil,
                       hosts:        nil,
                       ignore_hosts: nil,
                       ports:        nil,
                       ignore_ports: nil,
                       links:        nil,
                       ignore_links: nil,
                       urls:         nil,
                       ignore_urls:  nil,
                       exts:         nil,
                       ignore_exts:  nil)
  @schemes = schemes.map(&:to_s)

  @host_rules = Rules.new(accept: hosts, reject: ignore_hosts)
  @port_rules = Rules.new(accept: ports, reject: ignore_ports)
  @link_rules = Rules.new(accept: links, reject: ignore_links)
  @url_rules  = Rules.new(accept: urls,  reject: ignore_urls)
  @ext_rules  = Rules.new(accept: exts,  reject: ignore_exts)

  visit_hosts_like(host) if host
end

#initialize_robots ⇒ `Object`

Initializes the robots filter.

# File 'lib/spidr/agent/robots.rb', line 13

def initialize_robots
  unless Object.const_defined?(:Robots)
    raise(ArgumentError,":robots option given but unable to require 'robots' gem")
  end

  @robots = Robots.new(@user_agent)
end

#initialize_sanitizers(strip_fragments: true, strip_query: false) ⇒ `Object` (protected)

Initializes the Sanitizer rules.

Parameters:

strip_fragments (Boolean) (defaults to: true) —
Specifies whether or not to strip the fragment component from URLs.
strip_query (Boolean) (defaults to: false) —
Specifies whether or not to strip the query component from URLs.

Since:

0.2.2

# File 'lib/spidr/agent/sanitizers.rb', line 47

def initialize_sanitizers(strip_fragments: true, strip_query: false)
  @strip_fragments = strip_fragments
  @strip_query     = strip_query
end

#limit_reached? ⇒ `Boolean` (protected)

Determines if the maximum limit has been reached.

Returns:

(Boolean)

Since:

0.6.0


933
934
935

# File 'lib/spidr/agent.rb', line 933

def limit_reached?
  @limit && @history.length >= @limit
end

#pause! ⇒ `Object`

Pauses the agent, causing spidering to temporarily stop.

Raises:

(Paused) —
Indicates to the agent, that it should pause spidering.

# File 'lib/spidr/agent/actions.rb', line 63

def pause!
  @paused = true
  raise(Actions::Paused)
end

#pause=(state) ⇒ `Object`

Sets the pause state of the agent.

Parameters:

state (Boolean) —
The new pause state of the agent.


53
54
55

# File 'lib/spidr/agent/actions.rb', line 53

def pause=(state)
  @paused = state
end

#paused? ⇒ `Boolean`

Determines whether the agent is paused.

Returns:

(Boolean) —
Specifies whether the agent is paused.


74
75
76

# File 'lib/spidr/agent/actions.rb', line 74

def paused?
  @paused == true
end

#post_page(url, post_data = '') {|page| ... } ⇒ `Page`^?

Posts supplied form data and creates a new Page object from a given URL.

Parameters:

url (URI::HTTP) —
The URL to request.
post_data (String) (defaults to: '') —
Form option data.

Yields:

(page) —
If a block is given, it will be passed the page that represents the response.

Yield Parameters:

page (Page) —
The page for the response.

Returns:

(Page, nil) —
The page for the response, or nil if the request failed.

Since:

0.2.2

# File 'lib/spidr/agent.rb', line 745

def post_page(url,post_data='')
  url = URI(url)

  prepare_request(url) do |session,path,headers|
    new_page = Page.new(url,session.post(path,post_data,headers))

    # save any new cookies
    @cookies.from_page(new_page)

    yield new_page if block_given?
    return new_page
  end
end

#prepare_request(url) {|request| ... } ⇒ `Object` (protected)

Normalizes the request path and grabs a session to handle page get and post requests.

Parameters:

url (URI::HTTP) —
The URL to request.

Yields:

(request) —
A block whose purpose is to make a page request.

Yield Parameters:

session (Net::HTTP) —
An HTTP session object.
path (String) —
Normalized URL string.
headers (Hash) —
A Hash of request header options.

Since:

0.2.2

# File 'lib/spidr/agent.rb', line 885

def prepare_request(url,&block)
  path = unless url.path.empty?
           url.path
         else
           '/'
         end

  # append the URL query to the path
  path += "?#{url.query}" if url.query

  headers = prepare_request_headers(url)

  begin
    sleep(@delay) if @delay > 0

    yield @sessions[url], path, headers
  rescue SystemCallError,
         Timeout::Error,
         SocketError,
         IOError,
         OpenSSL::SSL::SSLError,
         Net::HTTPBadResponse,
         Zlib::Error

    @sessions.kill!(url)

    failed(url)
    return nil
  end
end

#prepare_request_headers(url) ⇒ `Hash{String => String}` (protected)

Prepares request headers for the given URL.

Parameters:

url (URI::HTTP) —
The URL to prepare the request headers for.

Returns:

(Hash{String => String}) —
The prepared headers.

Since:

0.6.0

# File 'lib/spidr/agent.rb', line 836

def prepare_request_headers(url)
  # set any additional HTTP headers
  headers = @default_headers.dup

  unless @host_headers.empty?
    @host_headers.each do |name,header|
      if url.host.match(name)
        headers['Host'] = header
        break
      end
    end
  end

  headers['Host']     ||= @host_header if @host_header
  headers['User-Agent'] = @user_agent if @user_agent
  headers['Referer']    = @referer if @referer

  if (authorization = @authorized.for_url(url))
    headers['Authorization'] = "Basic #{authorization}"
  end

  if (header_cookies = @cookies.for_host(url.host))
    headers['Cookie'] = header_cookies
  end

  return headers
end

#proxy ⇒ `Proxy`

The proxy information the agent uses.

Returns:

(Proxy) —
The proxy information.

See Also:

Settings::Proxy#proxy

Since:

0.2.2


434
435
436

# File 'lib/spidr/agent.rb', line 434

def proxy
  @sessions.proxy
end

#proxy=(new_proxy) ⇒ `Proxy`

Sets the proxy information that the agent uses.

Parameters:

new_proxy (Proxy, Hash, URI::HTTP, String, nil) —
The new proxy information.

Returns:

(Proxy) —
The new proxy information.

See Also:

Settings::Proxy#proxy=

Since:

0.2.2


451
452
453

# File 'lib/spidr/agent.rb', line 451

def proxy=(new_proxy)
  @sessions.proxy = new_proxy
end

#queued?(url) ⇒ `Boolean`

Determines whether a given URL has been enqueued.

Parameters:

url (URI::HTTP) —
The URL to search for in the queue.

Returns:

(Boolean) —
Specifies whether the given URL has been queued for visiting.


644
645
646

# File 'lib/spidr/agent.rb', line 644

def queued?(url)
  @queue.include?(url)
end

#robot_allowed?(url) ⇒ `Boolean`

Determines whether a URL is allowed by the robot policy.

Parameters:

url (URI::HTTP, String) —
The URL to check.

Returns:

(Boolean) —
Specifies whether a URL is allowed by the robot policy.

# File 'lib/spidr/agent/robots.rb', line 30

def robot_allowed?(url)
  if @robots
    @robots.allowed?(url)
  else
    true
  end
end

#run {|page| ... } ⇒ `Object`

Start spidering until the queue becomes empty or the agent is paused.

Yields:

(page) —
If a block is given, it will be passed every page visited.

Yield Parameters:

page (Page) —
A page which has been visited.

# File 'lib/spidr/agent.rb', line 492

def run(&block)
  @running = true

  until (@queue.empty? || paused? || limit_reached?)
    begin
      visit_page(dequeue,&block)
    rescue Actions::Paused
      return self
    rescue Actions::Action
    end
  end

  @running = false
  @sessions.clear
  return self
end

#running? ⇒ `Boolean`

Determines if the agent is running.

Returns:

(Boolean) —
Specifies whether the agent is running or stopped.


515
516
517

# File 'lib/spidr/agent.rb', line 515

def running?
  @running == true
end

#sanitize_url(url) ⇒ `URI::HTTP`, `URI::HTTPS`

Sanitizes a URL based on filtering options.

Parameters:

url (URI::HTTP, URI::HTTPS, String) —
The URL to be sanitized

Returns:

(URI::HTTP, URI::HTTPS) —
The new sanitized URL.

Since:

0.2.2

# File 'lib/spidr/agent/sanitizers.rb', line 25

def sanitize_url(url)
  url = URI(url)

  url.fragment = nil if @strip_fragments
  url.query    = nil if @strip_query

  return url
end

#skip_link! ⇒ `Object`

Causes the agent to skip the link being enqueued.

Raises:

(SkipLink) —
Indicates to the agent, that the current link should be skipped, and not enqueued or visited.


85
86
87

# File 'lib/spidr/agent/actions.rb', line 85

def skip_link!
  raise(Actions::SkipLink)
end

#skip_page! ⇒ `Object`

Causes the agent to skip the page being visited.

Raises:

(SkipPage) —
Indicates to the agent, that the current page should be skipped.


95
96
97

# File 'lib/spidr/agent/actions.rb', line 95

def skip_page!
  raise(Actions::SkipPage)
end

#start_at(url) {|page| ... } ⇒ `Object`

Start spidering at a given URL.

Parameters:

url (URI::HTTP, String) —
The URL to start spidering at.

Yields:

(page) —
If a block is given, it will be passed every page visited.

Yield Parameters:

page (Page) —
A page which has been visited.

# File 'lib/spidr/agent.rb', line 477

def start_at(url,&block)
  enqueue(url)
  return run(&block)
end

#to_hash ⇒ `Hash`

Converts the agent into a Hash.

Returns:

(Hash) —
The agent represented as a Hash containing the history and the queue of the agent.


819
820
821

# File 'lib/spidr/agent.rb', line 819

def to_hash
  {history: @history, queue: @queue}
end

#urls_like(pattern, &block) ⇒ `Object`

See Also:

#every_url_like


56
57
58

# File 'lib/spidr/agent/events.rb', line 56

def urls_like(pattern,&block)
  every_url_like(pattern,&block)
end

#visit?(url) ⇒ `Boolean` (protected)

Determines if a given URL should be visited.

Parameters:

url (URI::HTTP) —
The URL in question.

Returns:

(Boolean) —
Specifies whether the given URL should be visited.

# File 'lib/spidr/agent.rb', line 946

def visit?(url)
  !visited?(url) &&
   visit_scheme?(url.scheme) &&
   visit_host?(url.host) &&
   visit_port?(url.port) &&
   visit_link?(url.to_s) &&
   visit_url?(url) &&
   visit_ext?(url.path) &&
   robot_allowed?(url.to_s)
end

#visit_ext?(path) ⇒ `Boolean` (protected)

Determines if a given URI path extension should be visited.

Parameters:

path (String) —
The path that contains the extension.

Returns:

(Boolean) —
Specifies whether the given URI path extension should be visited.


525
526
527

# File 'lib/spidr/agent/filters.rb', line 525

def visit_ext?(path)
  @ext_rules.accept?(File.extname(path)[1..-1])
end

#visit_exts ⇒ `Array<String, Regexp, Proc>`

Specifies the patterns that match the URI path extensions to visit.

Returns:

(Array<String, Regexp, Proc>) —
The URI path extensions patterns to visit.


298
299
300

# File 'lib/spidr/agent/filters.rb', line 298

def visit_exts
  @ext_rules.accept
end

#visit_exts_like(pattern = nil) {|ext| ... } ⇒ `Object`

Adds a given pattern to the #visit_exts.

Parameters:

pattern (String, Regexp) (defaults to: nil) —
The pattern to match URI path extensions with.

Yields:

(ext) —
If a block is given, it will be used to filter URI path extensions.

Yield Parameters:

ext (String) —
A URI path extension to accept or reject.

# File 'lib/spidr/agent/filters.rb', line 314

def visit_exts_like(pattern=nil,&block)
  if pattern
    visit_exts << pattern
  elsif block
    visit_exts << block
  end

  return self
end

#visit_host?(host) ⇒ `Boolean` (protected)

Determines if a given host-name should be visited.

Parameters:

host (String) —
The host-name.

Returns:

(Boolean) —
Specifies whether the given host-name should be visited.


471
472
473

# File 'lib/spidr/agent/filters.rb', line 471

def visit_host?(host)
  @host_rules.accept?(host)
end

#visit_hosts ⇒ `Array<String, Regexp, Proc>`

Specifies the patterns that match host-names to visit.

Returns:

(Array<String, Regexp, Proc>) —
The host-name patterns to visit.


30
31
32

# File 'lib/spidr/agent/filters.rb', line 30

def visit_hosts
  @host_rules.accept
end

#visit_hosts_like(pattern = nil) {|host| ... } ⇒ `Object`

Adds a given pattern to the #visit_hosts.

Parameters:

pattern (String, Regexp) (defaults to: nil) —
The pattern to match host-names with.

Yields:

(host) —
If a block is given, it will be used to filter host-names.

Yield Parameters:

host (String) —
A host-name to accept or reject.

# File 'lib/spidr/agent/filters.rb', line 46

def visit_hosts_like(pattern=nil,&block)
  if pattern
    visit_hosts << pattern
  elsif block
    visit_hosts << block
  end

  return self
end

#visit_link?(link) ⇒ `Boolean` (protected)

Determines if a given link should be visited.

Parameters:

link (String) —
The link.

Returns:

(Boolean) —
Specifies whether the given link should be visited.


497
498
499

# File 'lib/spidr/agent/filters.rb', line 497

def visit_link?(link)
  @link_rules.accept?(link)
end

#visit_links ⇒ `Array<String, Regexp, Proc>`

Specifies the patterns that match the links to visit.

Returns:

(Array<String, Regexp, Proc>) —
The link patterns to visit.

Since:

0.2.4


160
161
162

# File 'lib/spidr/agent/filters.rb', line 160

def visit_links
  @link_rules.accept
end

#visit_links_like(pattern = nil) {|link| ... } ⇒ `Object`

Adds a given pattern to the #visit_links

Parameters:

pattern (String, Regexp) (defaults to: nil) —
The pattern to match link with.

Yields:

(link) —
If a block is given, it will be used to filter links.

Yield Parameters:

link (String) —
A link to accept or reject.

Since:

0.2.4

# File 'lib/spidr/agent/filters.rb', line 178

def visit_links_like(pattern=nil,&block)
  if pattern
    visit_links << pattern
  elsif block
    visit_links << block
  end

  return self
end

#visit_page(url) {|page| ... } ⇒ `Page`^?

Visits a given URL, and enqueues the links recovered from the URL to be visited later.

Parameters:

url (URI::HTTP, String) —
The URL to visit.

Yields:

(page) —
If a block is given, it will be passed the page which was visited.

Yield Parameters:

page (Page) —
The page which was visited.

Returns:

(Page, nil) —
The page that was visited. If nil is returned, either the request for the page failed, or the page was skipped.

# File 'lib/spidr/agent.rb', line 776

def visit_page(url)
  url = sanitize_url(url)

  get_page(url) do |page|
    @history << page.url

    begin
      @every_page_blocks.each { |page_block| page_block.call(page) }

      yield page if block_given?
    rescue Actions::Paused => action
      raise(action)
    rescue Actions::SkipPage
      return nil
    rescue Actions::Action
    end

    page.each_url do |next_url|
      begin
        @every_link_blocks.each do |link_block|
          link_block.call(page.url,next_url)
        end
      rescue Actions::Paused => action
        raise(action)
      rescue Actions::SkipLink
        next
      rescue Actions::Action
      end

      if (@max_depth.nil? || @max_depth > @levels[url])
        enqueue(next_url,@levels[url] + 1)
      end
    end
  end
end

#visit_port?(port) ⇒ `Boolean` (protected)

Determines if a given port should be visited.

Parameters:

port (Integer) —
The port number.

Returns:

(Boolean) —
Specifies whether the given port should be visited.


484
485
486

# File 'lib/spidr/agent/filters.rb', line 484

def visit_port?(port)
  @port_rules.accept?(port)
end

#visit_ports ⇒ `Array<Integer, Regexp, Proc>`

Specifies the patterns that match the ports to visit.

Returns:

(Array<Integer, Regexp, Proc>) —
The port patterns to visit.


94
95
96

# File 'lib/spidr/agent/filters.rb', line 94

def visit_ports
  @port_rules.accept
end

#visit_ports_like(pattern = nil) {|port| ... } ⇒ `Object`

Adds a given pattern to the #visit_ports.

Parameters:

pattern (Integer, Regexp) (defaults to: nil) —
The pattern to match ports with.

Yields:

(port) —
If a block is given, it will be used to filter ports.

Yield Parameters:

port (Integer) —
A port to accept or reject.

# File 'lib/spidr/agent/filters.rb', line 110

def visit_ports_like(pattern=nil,&block)
  if pattern
    visit_ports << pattern
  elsif block
    visit_ports << block
  end

  return self
end

#visit_scheme?(scheme) ⇒ `Boolean` (protected)

Determines if a given URI scheme should be visited.

Parameters:

scheme (String) —
The URI scheme.

Returns:

(Boolean) —
Specifies whether the given scheme should be visited.

# File 'lib/spidr/agent/filters.rb', line 454

def visit_scheme?(scheme)
  if scheme
    @schemes.include?(scheme)
  else
    true
  end
end

#visit_url?(link) ⇒ `Boolean` (protected)

Determines if a given URL should be visited.

Parameters:

link (URI::HTTP, URI::HTTPS) —
The URL.

Returns:

(Boolean) —
Specifies whether the given URL should be visited.

Since:

0.2.4


512
513
514

# File 'lib/spidr/agent/filters.rb', line 512

def visit_url?(link)
  @url_rules.accept?(link)
end

#visit_urls ⇒ `Array<String, Regexp, Proc>`

Specifies the patterns that match the URLs to visit.

Returns:

(Array<String, Regexp, Proc>) —
The link patterns to visit.

Since:

0.2.4


228
229
230

# File 'lib/spidr/agent/filters.rb', line 228

def visit_urls
  @url_rules.accept
end

#visit_urls_like(pattern = nil) {|url| ... } ⇒ `Object`

Adds a given pattern to the #visit_urls

Parameters:

pattern (String, Regexp) (defaults to: nil) —
The pattern to match URLs with.

Yields:

(url) —
If a block is given, it will be used to filter URLs.

Yield Parameters:

url (URI::HTTP, URI::HTTPS) —
A URL to accept or reject.

Since:

0.2.4

# File 'lib/spidr/agent/filters.rb', line 246

def visit_urls_like(pattern=nil,&block)
  if pattern
    visit_urls << pattern
  elsif block
    visit_urls << block
  end

  return self
end

#visited?(url) ⇒ `Boolean`

Determines whether a URL was visited or not.

Parameters:

url (URI::HTTP, String) —
The URL to search for.

Returns:

(Boolean) —
Specifies whether a URL was visited.


572
573
574

# File 'lib/spidr/agent.rb', line 572

def visited?(url)
  @history.include?(URI(url))
end

#visited_hosts ⇒ `Array<String>`

Specifies all hosts that were visited.

Returns:

(Array<String>) —
The hosts which have been visited.


559
560
561

# File 'lib/spidr/agent.rb', line 559

def visited_hosts
  visited_urls.map(&:host).uniq
end

#visited_links ⇒ `Array<String>`

Specifies the links which have been visited.

Returns:

(Array<String>) —
The links which have been visited.


549
550
551

# File 'lib/spidr/agent.rb', line 549

def visited_links
  @history.map(&:to_s)
end

Class: Spidr::Agent

Defined Under Namespace

Instance Attribute Summary collapse

Attributes included from Settings::UserAgent

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

Instance Attribute Details

#authorized ⇒ AuthStore

#cookies ⇒ CookieJar (readonly)

#default_headers ⇒ Hash{String => String} (readonly)

#delay ⇒ Integer

#failures ⇒ Set<URI::HTTP>

#history ⇒ Set<URI::HTTP> Also known as: visited_urls

#host_header ⇒ String

#host_headers ⇒ Hash{String,Regexp => String} (readonly)

#levels ⇒ Hash{URI::HTTP => Integer} (readonly)

#limit ⇒ Integer (readonly)

#max_depth ⇒ Integer (readonly)

#queue ⇒ Array<URI::HTTP> Also known as: pending_urls

#referer ⇒ String

#schemes ⇒ Object

#sessions ⇒ SessionCache (readonly)

#strip_fragments ⇒ Object

#strip_query ⇒ Object

Class Method Details

.default_schemes ⇒ Array<String> (protected)

.domain(name, **kwargs) {|agent| ... } ⇒ Agent

.host(name, **kwargs) {|agent| ... } ⇒ Agent

.site(url, **kwargs) {|agent| ... } ⇒ Agent

.start_at(url, **kwargs) {|agent| ... } ⇒ Agent

Instance Method Details

#all_headers {|headers| ... } ⇒ Object

#clear ⇒ Object

#continue! {|page| ... } ⇒ Object

#dequeue ⇒ URI::HTTP (protected)

#enqueue(url, level = 0) ⇒ Boolean

#every_atom_doc {|doc| ... } ⇒ Object

#every_atom_page {|feed| ... } ⇒ Object

#every_bad_request_page {|page| ... } ⇒ Object

#every_css_page {|page| ... } ⇒ Object

#every_doc {|doc| ... } ⇒ Object

#every_failed_url {|url| ... } ⇒ Object

#every_forbidden_page {|page| ... } ⇒ Object

#every_html_doc {|doc| ... } ⇒ Object

#every_html_page {|page| ... } ⇒ Object

#every_internal_server_error_page {|page| ... } ⇒ Object

#every_javascript_page {|page| ... } ⇒ Object

#every_link {|origin, dest| ... } ⇒ Object

#every_missing_page {|page| ... } ⇒ Object

#every_ms_word_page {|page| ... } ⇒ Object

#every_ok_page {|page| ... } ⇒ Object

#every_page {|page| ... } ⇒ Object

#every_pdf_page {|page| ... } ⇒ Object

#every_redirect_page {|page| ... } ⇒ Object

#every_rss_doc {|doc| ... } ⇒ Object

#every_rss_page {|feed| ... } ⇒ Object

#every_timedout_page {|page| ... } ⇒ Object

#every_txt_page {|page| ... } ⇒ Object

#every_unauthorized_page {|page| ... } ⇒ Object

#every_url {|url| ... } ⇒ Object

#every_url_like(pattern) {|url| ... } ⇒ Object

#every_xml_doc {|doc| ... } ⇒ Object

#every_xml_page {|page| ... } ⇒ Object

#every_xsl_doc {|doc| ... } ⇒ Object

#every_xsl_page {|page| ... } ⇒ Object

#every_zip_page {|page| ... } ⇒ Object

#failed(url) ⇒ Object (protected)

#failed?(url) ⇒ Boolean

#get_page(url) {|page| ... } ⇒ Page?

#ignore_exts ⇒ Array<String, Regexp, Proc>

#ignore_exts_like(pattern = nil) {|ext| ... } ⇒ Object

#ignore_hosts ⇒ Array<String, Regexp, Proc>

#ignore_hosts_like(pattern = nil) {|host| ... } ⇒ Object

#ignore_links ⇒ Array<String, Regexp, Proc>

#ignore_links_like(pattern = nil) {|link| ... } ⇒ Object

#ignore_ports ⇒ Array<Integer, Regexp, Proc>

#ignore_ports_like(pattern = nil) {|port| ... } ⇒ Object

#ignore_urls ⇒ Array<String, Regexp, Proc>

#ignore_urls_like(pattern = nil) {|url| ... } ⇒ Object

#authorized ⇒ `AuthStore`

#cookies ⇒ `CookieJar` (readonly)

#default_headers ⇒ `Hash{String => String}` (readonly)

#delay ⇒ `Integer`

#failures ⇒ `Set<URI::HTTP>`

#history ⇒ `Set<URI::HTTP>` Also known as: visited_urls

#host_header ⇒ `String`

#host_headers ⇒ `Hash{String,Regexp => String}` (readonly)

#levels ⇒ `Hash{URI::HTTP => Integer}` (readonly)

#limit ⇒ `Integer` (readonly)

#max_depth ⇒ `Integer` (readonly)

#queue ⇒ `Array<URI::HTTP>` Also known as: pending_urls

#referer ⇒ `String`

#schemes ⇒ `Object`

#sessions ⇒ `SessionCache` (readonly)

#strip_fragments ⇒ `Object`

#strip_query ⇒ `Object`

.default_schemes ⇒ `Array<String>` (protected)

.domain(name, **kwargs) {|agent| ... } ⇒ `Agent`

.host(name, **kwargs) {|agent| ... } ⇒ `Agent`

.site(url, **kwargs) {|agent| ... } ⇒ `Agent`

.start_at(url, **kwargs) {|agent| ... } ⇒ `Agent`

#all_headers {|headers| ... } ⇒ `Object`

#clear ⇒ `Object`

#continue! {|page| ... } ⇒ `Object`

#dequeue ⇒ `URI::HTTP` (protected)

#enqueue(url, level = 0) ⇒ `Boolean`

#every_atom_doc {|doc| ... } ⇒ `Object`

#every_atom_page {|feed| ... } ⇒ `Object`

#every_bad_request_page {|page| ... } ⇒ `Object`

#every_css_page {|page| ... } ⇒ `Object`

#every_doc {|doc| ... } ⇒ `Object`

#every_failed_url {|url| ... } ⇒ `Object`

#every_forbidden_page {|page| ... } ⇒ `Object`

#every_html_doc {|doc| ... } ⇒ `Object`

#every_html_page {|page| ... } ⇒ `Object`

#every_internal_server_error_page {|page| ... } ⇒ `Object`

#every_javascript_page {|page| ... } ⇒ `Object`

#every_link {|origin, dest| ... } ⇒ `Object`

#every_missing_page {|page| ... } ⇒ `Object`

#every_ms_word_page {|page| ... } ⇒ `Object`

#every_ok_page {|page| ... } ⇒ `Object`

#every_page {|page| ... } ⇒ `Object`

#every_pdf_page {|page| ... } ⇒ `Object`

#every_redirect_page {|page| ... } ⇒ `Object`

#every_rss_doc {|doc| ... } ⇒ `Object`

#every_rss_page {|feed| ... } ⇒ `Object`

#every_timedout_page {|page| ... } ⇒ `Object`

#every_txt_page {|page| ... } ⇒ `Object`

#every_unauthorized_page {|page| ... } ⇒ `Object`

#every_url {|url| ... } ⇒ `Object`

#every_url_like(pattern) {|url| ... } ⇒ `Object`

#every_xml_doc {|doc| ... } ⇒ `Object`

#every_xml_page {|page| ... } ⇒ `Object`

#every_xsl_doc {|doc| ... } ⇒ `Object`

#every_xsl_page {|page| ... } ⇒ `Object`

#every_zip_page {|page| ... } ⇒ `Object`

#failed(url) ⇒ `Object` (protected)

#failed?(url) ⇒ `Boolean`

#get_page(url) {|page| ... } ⇒ `Page`^?

#ignore_exts ⇒ `Array<String, Regexp, Proc>`

#ignore_exts_like(pattern = nil) {|ext| ... } ⇒ `Object`

#ignore_hosts ⇒ `Array<String, Regexp, Proc>`

#ignore_hosts_like(pattern = nil) {|host| ... } ⇒ `Object`

#ignore_links ⇒ `Array<String, Regexp, Proc>`

#ignore_links_like(pattern = nil) {|link| ... } ⇒ `Object`

#ignore_ports ⇒ `Array<Integer, Regexp, Proc>`

#ignore_ports_like(pattern = nil) {|port| ... } ⇒ `Object`

#ignore_urls ⇒ `Array<String, Regexp, Proc>`

#ignore_urls_like(pattern = nil) {|url| ... } ⇒ `Object`

#initialize_actions ⇒ `Object` (protected)

#initialize_events ⇒ `Object` (protected)

#initialize_filters(schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil) ⇒ `Object` (protected)

#initialize_robots ⇒ `Object`

#initialize_sanitizers(strip_fragments: true, strip_query: false) ⇒ `Object` (protected)

#limit_reached? ⇒ `Boolean` (protected)

#pause! ⇒ `Object`

#pause=(state) ⇒ `Object`

#paused? ⇒ `Boolean`

#post_page(url, post_data = '') {|page| ... } ⇒ `Page`^?