Class: Spidr::Agent

Inherits:
Object
  • Object
show all
Includes:
Settings::UserAgent
Defined in:
lib/spidr/agent.rb,
lib/spidr/agent/events.rb,
lib/spidr/agent/robots.rb,
lib/spidr/agent/actions.rb,
lib/spidr/agent/filters.rb,
lib/spidr/agent/sanitizers.rb

Defined Under Namespace

Modules: Actions

Instance Attribute Summary collapse

Attributes included from Settings::UserAgent

#user_agent

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(host_header: nil, host_headers: {}, default_headers: {}, user_agent: Spidr.user_agent, referer: nil, proxy: Spidr.proxy, open_timeout: Spidr.open_timeout, ssl_timeout: Spidr.ssl_timeout, read_timeout: Spidr.read_timeout, continue_timeout: Spidr.continue_timeout, keep_alive_timeout: Spidr.keep_alive_timeout, delay: 0, limit: nil, max_depth: nil, queue: nil, history: nil, strip_fragments: true, strip_query: false, schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil, robots: Spidr.robots?) {|agent| ... } ⇒ Agent

Creates a new Agent object.

Parameters:

  • host_header (String, nil) (defaults to: nil)

    The HTTP Host header to use with each request.

  • host_headers (Hash{String,Regexp => String}) (defaults to: {})

    The HTTP Host headers to use for specific hosts.

  • default_headers (Hash{String => String}) (defaults to: {})

    Default headers to set for every request.

  • user_agent (String, nil) (defaults to: Spidr.user_agent)

    The User-Agent string to send with each requests.

  • referer (String, nil) (defaults to: nil)

    The Referer URL to send with each request.

  • open_timeout (Integer, nil) (defaults to: Spidr.open_timeout)

    Optional open connection timeout.

  • read_timeout (Integer, nil) (defaults to: Spidr.read_timeout)

    Optional read timeout.

  • ssl_timeout (Integer, nil) (defaults to: Spidr.ssl_timeout)

    Optional SSL connection timeout.

  • continue_timeout (Integer, nil) (defaults to: Spidr.continue_timeout)

    Optional continue timeout.

  • keep_alive_timeout (Integer, nil) (defaults to: Spidr.keep_alive_timeout)

    Optional Keep-Alive timeout.

  • proxy (Spidr::Proxy, Hash, URI::HTTP, String, nil) (defaults to: Spidr.proxy)

    The proxy information to use.

  • delay (Integer) (defaults to: 0)

    The number of seconds to pause between each request.

  • limit (Integer, nil) (defaults to: nil)

    The maximum number of pages to visit.

  • max_depth (Integer, nil) (defaults to: nil)

    The maximum link depth to follow.

  • queue (Set, Array, nil) (defaults to: nil)

    The initial queue of URLs to visit.

  • history (Set, Array, nil) (defaults to: nil)

    The initial list of visited URLs.

  • strip_fragments (Boolean) (defaults to: true)

    Controls whether to strip the fragment components from the URLs.

  • strip_query (Boolean) (defaults to: false)

    Controls whether to strip the query components from the URLs.

  • schemes (Array<String>) (defaults to: self.class.default_schemes)

    The list of acceptable URI schemes to visit. The https scheme will be ignored if net/https cannot be loaded.

  • host (String) (defaults to: nil)

    The host-name to visit.

  • hosts (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the host-names to visit.

  • ignore_hosts (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the host-names to not visit.

  • ports (Array<Integer, Regexp, Proc>) (defaults to: nil)

    The patterns which match the ports to visit.

  • ignore_ports (Array<Integer, Regexp, Proc>) (defaults to: nil)

    The patterns which match the ports to not visit.

  • links (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the links to visit.

  • ignore_links (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the links to not visit.

  • urls (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the URLs to visit.

  • ignore_urls (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the URLs to not visit.

  • exts (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the URI path extensions to visit.

  • ignore_exts (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the URI path extensions to not visit.

  • robots (Boolean) (defaults to: Spidr.robots?)

    Specifies whether robots.txt should be honored.

Options Hash (proxy:):

  • :host (String)

    The host the proxy is running on.

  • :port (Integer) — default: 8080

    The port the proxy is running on.

  • :user (String, nil)

    The user to authenticate as with the proxy.

  • :password (String, nil)

    The password to authenticate with.

Yields:

  • (agent)

    If a block is given, it will be passed the newly created agent for further configuration.

Yield Parameters:

  • agent (Agent)

    The newly created agent.



214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
# File 'lib/spidr/agent.rb', line 214

def initialize(# header keyword arguments
               host_header:        nil,
               host_headers:       {},
               default_headers:    {},
               user_agent:         Spidr.user_agent,
               referer:            nil,
               # session cache keyword arguments
               proxy:              Spidr.proxy,
               open_timeout:       Spidr.open_timeout,
               ssl_timeout:        Spidr.ssl_timeout,
               read_timeout:       Spidr.read_timeout,
               continue_timeout:   Spidr.continue_timeout,
               keep_alive_timeout: Spidr.keep_alive_timeout,
               # spidering controls keyword arguments
               delay:     0,
               limit:     nil,
               max_depth: nil,
               # history keyword arguments
               queue:   nil,
               history: nil,
               # sanitizer keyword arguments
               strip_fragments: true,
               strip_query:     false,
               # filtering keyword arguments
               schemes:      self.class.default_schemes,
               host:         nil,
               hosts:        nil,
               ignore_hosts: nil,
               ports:        nil,
               ignore_ports: nil,
               links:        nil,
               ignore_links: nil,
               urls:         nil,
               ignore_urls:  nil,
               exts:         nil,
               ignore_exts:  nil,
               # robots keyword arguments
               robots:       Spidr.robots?)
  @host_header  = host_header
  @host_headers = host_headers

  @default_headers = default_headers

  @user_agent = user_agent
  @referer    = referer

  @sessions   = SessionCache.new(
    proxy:              proxy,
    open_timeout:       open_timeout,
    ssl_timeout:        ssl_timeout,
    read_timeout:       read_timeout,
    continue_timeout:   continue_timeout,
    keep_alive_timeout: keep_alive_timeout
  )
  @cookies    = CookieJar.new
  @authorized = AuthStore.new

  @running  = false
  @delay    = delay
  @history  = Set[]
  @failures = Set[]
  @queue    = []

  @limit     = limit
  @levels    = Hash.new(0)
  @max_depth = max_depth

  self.queue   = queue   if queue
  self.history = history if history

  initialize_sanitizers(
    strip_fragments: strip_fragments,
    strip_query:     strip_query
  )

  initialize_filters(
    schemes:      schemes,
    host:         host,
    hosts:        hosts,
    ignore_hosts: ignore_hosts,
    ports:        ports,
    ignore_ports: ignore_ports,
    links:        links,
    ignore_links: ignore_links,
    urls:         urls,
    ignore_urls:  ignore_urls,
    exts:         exts,
    ignore_exts:  ignore_exts
  )
  initialize_actions
  initialize_events

  initialize_robots if robots

  yield self if block_given?
end

Instance Attribute Details

#authorizedAuthStore

HTTP Authentication credentials

Returns:



44
45
46
# File 'lib/spidr/agent.rb', line 44

def authorized
  @authorized
end

#cookiesCookieJar (readonly)

Cached cookies

Returns:



81
82
83
# File 'lib/spidr/agent.rb', line 81

def cookies
  @cookies
end

#default_headersHash{String => String} (readonly)

HTTP Headers to use for every request

Returns:

  • (Hash{String => String})

Since:

  • 0.6.0



39
40
41
# File 'lib/spidr/agent.rb', line 39

def default_headers
  @default_headers
end

#delayInteger

Delay in between fetching pages

Returns:

  • (Integer)


54
55
56
# File 'lib/spidr/agent.rb', line 54

def delay
  @delay
end

#failuresSet<URI::HTTP>

List of unreachable URLs

Returns:

  • (Set<URI::HTTP>)


64
65
66
# File 'lib/spidr/agent.rb', line 64

def failures
  @failures
end

#historySet<URI::HTTP> Also known as: visited_urls

History containing visited URLs

Returns:

  • (Set<URI::HTTP>)


59
60
61
# File 'lib/spidr/agent.rb', line 59

def history
  @history
end

#host_headerString

HTTP Host Header to use

Returns:

  • (String)


27
28
29
# File 'lib/spidr/agent.rb', line 27

def host_header
  @host_header
end

#host_headersHash{String,Regexp => String} (readonly)

HTTP Host Headers to use for specific hosts

Returns:

  • (Hash{String,Regexp => String})


32
33
34
# File 'lib/spidr/agent.rb', line 32

def host_headers
  @host_headers
end

#levelsHash{URI::HTTP => Integer} (readonly)

The visited URLs and their depth within a site

Returns:

  • (Hash{URI::HTTP => Integer})


96
97
98
# File 'lib/spidr/agent.rb', line 96

def levels
  @levels
end

#limitInteger (readonly)

Maximum number of pages to visit.

Returns:

  • (Integer)


86
87
88
# File 'lib/spidr/agent.rb', line 86

def limit
  @limit
end

#max_depthInteger (readonly)

Maximum depth

Returns:

  • (Integer)


91
92
93
# File 'lib/spidr/agent.rb', line 91

def max_depth
  @max_depth
end

#queueArray<URI::HTTP> Also known as: pending_urls

Queue of URLs to visit

Returns:

  • (Array<URI::HTTP>)


69
70
71
# File 'lib/spidr/agent.rb', line 69

def queue
  @queue
end

#refererString

Referer to use

Returns:

  • (String)


49
50
51
# File 'lib/spidr/agent.rb', line 49

def referer
  @referer
end

#schemesObject

List of acceptable URL schemes to follow



9
10
11
# File 'lib/spidr/agent/filters.rb', line 9

def schemes
  @schemes
end

#sessionsSessionCache (readonly)

The session cache

Returns:

Since:

  • 0.6.0



76
77
78
# File 'lib/spidr/agent.rb', line 76

def sessions
  @sessions
end

#strip_fragmentsObject

Specifies whether the Agent will strip URI fragments



9
10
11
# File 'lib/spidr/agent/sanitizers.rb', line 9

def strip_fragments
  @strip_fragments
end

#strip_queryObject

Specifies whether the Agent will strip URI queries



12
13
14
# File 'lib/spidr/agent/sanitizers.rb', line 12

def strip_query
  @strip_query
end

Class Method Details

.default_schemesArray<String> (protected)

Determines the default URI schemes to follow.

Returns:

  • (Array<String>)

    The default URI schemes to follow.

Since:

  • 0.6.2



429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
# File 'lib/spidr/agent/filters.rb', line 429

def self.default_schemes
  schemes = ['http']

  begin
    require 'net/https'

    schemes << 'https'
  rescue Gem::LoadError => e
    raise(e)
  rescue ::LoadError
    warn "Warning: cannot load 'net/https', https support disabled"
  end

  return schemes
end

.domain(name, **kwargs) {|agent| ... } ⇒ Agent

Creates a new agent and spiders the entire domain.

Parameters:

  • name (String)

    The top-level domain to spider.

  • kwargs (Hash{Symbol => Object})

    Additional keyword arguments. See #initialize.

Yields:

  • (agent)

    If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

  • agent (Agent)

    The newly created agent.

Returns:

  • (Agent)

    The created agent object.

See Also:

Since:

  • 0.7.0



418
419
420
421
422
# File 'lib/spidr/agent.rb', line 418

def self.domain(name,**kwargs,&block)
  agent = new(host: /(^|\.)#{Regexp.escape(name)}$/, **kwargs, &block)
  agent.start_at(URI::HTTP.build(host: name, path: '/'))
  return agent
end

.host(name, **kwargs) {|agent| ... } ⇒ Agent

Creates a new agent and spiders the given host.

Parameters:

  • name (String)

    The host-name to spider.

  • kwargs (Hash{Symbol => Object})

    Additional keyword arguments. See #initialize.

Yields:

  • (agent)

    If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

  • agent (Agent)

    The newly created agent.

Returns:

  • (Agent)

    The created agent object.

See Also:



389
390
391
392
393
# File 'lib/spidr/agent.rb', line 389

def self.host(name,**kwargs,&block)
  agent = new(host: name, **kwargs, &block)
  agent.start_at(URI::HTTP.build(host: name, path: '/'))
  return agent
end

.site(url, **kwargs) {|agent| ... } ⇒ Agent

Creates a new agent and spiders the web-site located at the given URL.

Parameters:

  • url (URI::HTTP, String)

    The web-site to spider.

  • kwargs (Hash{Symbol => Object})

    Additional keyword arguments. See #initialize.

Yields:

  • (agent)

    If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

  • agent (Agent)

    The newly created agent.

Returns:

  • (Agent)

    The created agent object.

See Also:



360
361
362
363
364
365
366
# File 'lib/spidr/agent.rb', line 360

def self.site(url,**kwargs,&block)
  url = URI(url)

  agent = new(host: url.host, **kwargs, &block)
  agent.start_at(url)
  return agent
end

.start_at(url, **kwargs) {|agent| ... } ⇒ Agent

Creates a new agent and begin spidering at the given URL.

Parameters:

  • url (URI::HTTP, String)

    The URL to start spidering at.

  • kwargs (Hash{Symbol => Object})

    Additional keyword arguments. See #initialize.

Yields:

  • (agent)

    If a block is given, it will be passed the newly created agent before it begins spidering.

Yield Parameters:

  • agent (Agent)

    The newly created agent.

Returns:

  • (Agent)

    The created agent object.

See Also:



333
334
335
336
337
# File 'lib/spidr/agent.rb', line 333

def self.start_at(url,**kwargs,&block)
  agent = new(**kwargs,&block)
  agent.start_at(url)
  return agent
end

Instance Method Details

#all_headers {|headers| ... } ⇒ Object

Pass the headers from every response the agent receives to a given block.

Yields:

  • (headers)

    The block will be passed the headers of every response.

Yield Parameters:

  • headers (Hash)

    The headers from a response.



70
71
72
# File 'lib/spidr/agent/events.rb', line 70

def all_headers
  every_page { |page| yield page.headers }
end

#clearObject

Clears the history of the agent.



458
459
460
461
462
463
# File 'lib/spidr/agent.rb', line 458

def clear
  @queue.clear
  @history.clear
  @failures.clear
  return self
end

#continue! {|page| ... } ⇒ Object

Continue spidering.

Yields:

  • (page)

    If a block is given, it will be passed every page visited.

Yield Parameters:

  • page (Page)

    The page to be visited.



42
43
44
45
# File 'lib/spidr/agent/actions.rb', line 42

def continue!(&block)
  @paused = false
  return run(&block)
end

#dequeueURI::HTTP (protected)

Dequeues a URL that will later be visited.

Returns:

  • (URI::HTTP)

    The URL that was at the front of the queue.



922
923
924
# File 'lib/spidr/agent.rb', line 922

def dequeue
  @queue.shift
end

#enqueue(url, level = 0) ⇒ Boolean

Enqueues a given URL for visiting, only if it passes all of the agent's rules for visiting a given URL.

Parameters:

  • url (URI::HTTP, String)

    The URL to enqueue for visiting.

Returns:

  • (Boolean)

    Specifies whether the URL was enqueued, or ignored.



658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
# File 'lib/spidr/agent.rb', line 658

def enqueue(url,level=0)
  url = sanitize_url(url)

  if (!queued?(url) && visit?(url))
    link = url.to_s

    begin
      @every_url_blocks.each { |url_block| url_block.call(url) }

      @every_url_like_blocks.each do |pattern,url_blocks|
        match = case pattern
                when Regexp
                  link =~ pattern
                else
                  (pattern == link) || (pattern == url)
                end

        if match
          url_blocks.each { |url_block| url_block.call(url) }
        end
      end
    rescue Actions::Paused => action
      raise(action)
    rescue Actions::SkipLink
      return false
    rescue Actions::Action
    end

    @queue << url
    @levels[url] = level
    return true
  end

  return false
end

#every_atom_doc {|doc| ... } ⇒ Object

Pass every Atom document that the agent parses to a given block.

Yields:

  • (doc)

    The block will be passed every Atom document parsed.

Yield Parameters:

  • doc (Nokogiri::XML::Document)

    A parsed XML document.

See Also:



389
390
391
392
393
394
395
396
397
# File 'lib/spidr/agent/events.rb', line 389

def every_atom_doc
  every_page do |page|
    if (block_given? && page.atom?)
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_atom_page {|feed| ... } ⇒ Object

Pass every Atom feed that the agent visits to a given block.

Yields:

  • (feed)

    The block will be passed every Atom feed visited.

Yield Parameters:

  • feed (Page)

    A visited page.



453
454
455
456
457
# File 'lib/spidr/agent/events.rb', line 453

def every_atom_page
  every_page do |page|
    yield page if (block_given? && page.atom?)
  end
end

#every_bad_request_page {|page| ... } ⇒ Object

Pass every Bad Request page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every Bad Request page visited.

Yield Parameters:

  • page (Page)

    A visited page.



142
143
144
145
146
# File 'lib/spidr/agent/events.rb', line 142

def every_bad_request_page
  every_page do |page|
    yield page if (block_given? && page.bad_request?)
  end
end

#every_css_page {|page| ... } ⇒ Object

Pass every CSS page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every CSS page visited.

Yield Parameters:

  • page (Page)

    A visited page.



423
424
425
426
427
# File 'lib/spidr/agent/events.rb', line 423

def every_css_page
  every_page do |page|
    yield page if (block_given? && page.css?)
  end
end

#every_doc {|doc| ... } ⇒ Object

Pass every HTML or XML document that the agent parses to a given block.

Yields:

  • (doc)

    The block will be passed every HTML or XML document parsed.

Yield Parameters:

  • doc (Nokogiri::HTML::Document, Nokogiri::XML::Document)

    A parsed HTML or XML document.

See Also:



283
284
285
286
287
288
289
290
291
# File 'lib/spidr/agent/events.rb', line 283

def every_doc
  every_page do |page|
    if block_given?
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_failed_url {|url| ... } ⇒ Object

Pass each URL that could not be requested to the given block.

Yields:

  • (url)

    The block will be passed every URL that could not be requested.

Yield Parameters:

  • url (URI::HTTP)

    A failed URL.



28
29
30
31
# File 'lib/spidr/agent/events.rb', line 28

def every_failed_url(&block)
  @every_failed_url_blocks << block
  return self
end

#every_forbidden_page {|page| ... } ⇒ Object

Pass every Forbidden page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every Forbidden page visited.

Yield Parameters:

  • page (Page)

    A visited page.



172
173
174
175
176
# File 'lib/spidr/agent/events.rb', line 172

def every_forbidden_page
  every_page do |page|
    yield page if (block_given? && page.forbidden?)
  end
end

#every_html_doc {|doc| ... } ⇒ Object

Pass every HTML document that the agent parses to a given block.

Yields:

  • (doc)

    The block will be passed every HTML document parsed.

Yield Parameters:

  • doc (Nokogiri::HTML::Document)

    A parsed HTML document.

See Also:



304
305
306
307
308
309
310
311
312
# File 'lib/spidr/agent/events.rb', line 304

def every_html_doc
  every_page do |page|
    if (block_given? && page.html?)
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_html_page {|page| ... } ⇒ Object

Pass every HTML page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every HTML page visited.

Yield Parameters:

  • page (Page)

    A visited page.



233
234
235
236
237
# File 'lib/spidr/agent/events.rb', line 233

def every_html_page
  every_page do |page|
    yield page if (block_given? && page.html?)
  end
end

#every_internal_server_error_page {|page| ... } ⇒ Object

Pass every Internal Server Error page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every Internal Server Error page visited.

Yield Parameters:

  • page (Page)

    A visited page.



203
204
205
206
207
# File 'lib/spidr/agent/events.rb', line 203

def every_internal_server_error_page
  every_page do |page|
    yield page if (block_given? && page.had_internal_server_error?)
  end
end

#every_javascript_page {|page| ... } ⇒ Object

Pass every JavaScript page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every JavaScript page visited.

Yield Parameters:

  • page (Page)

    A visited page.



408
409
410
411
412
# File 'lib/spidr/agent/events.rb', line 408

def every_javascript_page
  every_page do |page|
    yield page if (block_given? && page.javascript?)
  end
end

Passes every origin and destination URI of each link to a given block.

Yields:

  • (origin, dest)

    The block will be passed every origin and destination URI of each link.

Yield Parameters:

  • origin (URI::HTTP)

    The URI that a link originated from.

  • dest (URI::HTTP)

    The destination URI of a link.



518
519
520
521
# File 'lib/spidr/agent/events.rb', line 518

def every_link(&block)
  @every_link_blocks << block
  return self
end

#every_missing_page {|page| ... } ⇒ Object

Pass every Missing page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every Missing page visited.

Yield Parameters:

  • page (Page)

    A visited page.



187
188
189
190
191
# File 'lib/spidr/agent/events.rb', line 187

def every_missing_page
  every_page do |page|
    yield page if (block_given? && page.missing?)
  end
end

#every_ms_word_page {|page| ... } ⇒ Object

Pass every MS Word page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every MS Word page visited.

Yield Parameters:

  • page (Page)

    A visited page.



468
469
470
471
472
# File 'lib/spidr/agent/events.rb', line 468

def every_ms_word_page
  every_page do |page|
    yield page if (block_given? && page.ms_word?)
  end
end

#every_ok_page {|page| ... } ⇒ Object

Pass every OK page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every OK page visited.

Yield Parameters:

  • page (Page)

    A visited page.



97
98
99
100
101
# File 'lib/spidr/agent/events.rb', line 97

def every_ok_page
  every_page do |page|
    yield page if (block_given? && page.ok?)
  end
end

#every_page {|page| ... } ⇒ Object

Pass every page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every page visited.

Yield Parameters:

  • page (Page)

    A visited page.



83
84
85
86
# File 'lib/spidr/agent/events.rb', line 83

def every_page(&block)
  @every_page_blocks << block
  return self
end

#every_pdf_page {|page| ... } ⇒ Object

Pass every PDF page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every PDF page visited.

Yield Parameters:

  • page (Page)

    A visited page.



483
484
485
486
487
# File 'lib/spidr/agent/events.rb', line 483

def every_pdf_page
  every_page do |page|
    yield page if (block_given? && page.pdf?)
  end
end

#every_redirect_page {|page| ... } ⇒ Object

Pass every Redirect page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every Redirect page visited.

Yield Parameters:

  • page (Page)

    A visited page.



112
113
114
115
116
# File 'lib/spidr/agent/events.rb', line 112

def every_redirect_page
  every_page do |page|
    yield page if (block_given? && page.redirect?)
  end
end

#every_rss_doc {|doc| ... } ⇒ Object

Pass every RSS document that the agent parses to a given block.

Yields:

  • (doc)

    The block will be passed every RSS document parsed.

Yield Parameters:

  • doc (Nokogiri::XML::Document)

    A parsed XML document.

See Also:



368
369
370
371
372
373
374
375
376
# File 'lib/spidr/agent/events.rb', line 368

def every_rss_doc
  every_page do |page|
    if (block_given? && page.rss?)
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_rss_page {|feed| ... } ⇒ Object

Pass every RSS feed that the agent visits to a given block.

Yields:

  • (feed)

    The block will be passed every RSS feed visited.

Yield Parameters:

  • feed (Page)

    A visited page.



438
439
440
441
442
# File 'lib/spidr/agent/events.rb', line 438

def every_rss_page
  every_page do |page|
    yield page if (block_given? && page.rss?)
  end
end

#every_timedout_page {|page| ... } ⇒ Object

Pass every Timeout page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every Timeout page visited.

Yield Parameters:

  • page (Page)

    A visited page.



127
128
129
130
131
# File 'lib/spidr/agent/events.rb', line 127

def every_timedout_page
  every_page do |page|
    yield page if (block_given? && page.timedout?)
  end
end

#every_txt_page {|page| ... } ⇒ Object

Pass every Plain Text page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every Plain Text page visited.

Yield Parameters:

  • page (Page)

    A visited page.



218
219
220
221
222
# File 'lib/spidr/agent/events.rb', line 218

def every_txt_page
  every_page do |page|
    yield page if (block_given? && page.txt?)
  end
end

#every_unauthorized_page {|page| ... } ⇒ Object

Pass every Unauthorized page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every Unauthorized page visited.

Yield Parameters:

  • page (Page)

    A visited page.



157
158
159
160
161
# File 'lib/spidr/agent/events.rb', line 157

def every_unauthorized_page
  every_page do |page|
    yield page if (block_given? && page.unauthorized?)
  end
end

#every_url {|url| ... } ⇒ Object

Pass each URL from each page visited to the given block.

Yields:

  • (url)

    The block will be passed every URL from every page visited.

Yield Parameters:

  • url (URI::HTTP)

    Each URL from each page visited.



14
15
16
17
# File 'lib/spidr/agent/events.rb', line 14

def every_url(&block)
  @every_url_blocks << block
  return self
end

#every_url_like(pattern) {|url| ... } ⇒ Object

Pass every URL that the agent visits, and matches a given pattern, to a given block.

Parameters:

  • pattern (Regexp, String)

    The pattern to match URLs with.

Yields:

  • (url)

    The block will be passed every URL that matches the given pattern.

Yield Parameters:

  • url (URI::HTTP)

    A matching URL.

Since:

  • 0.3.2



48
49
50
51
# File 'lib/spidr/agent/events.rb', line 48

def every_url_like(pattern,&block)
  @every_url_like_blocks[pattern] << block
  return self
end

#every_xml_doc {|doc| ... } ⇒ Object

Pass every XML document that the agent parses to a given block.

Yields:

  • (doc)

    The block will be passed every XML document parsed.

Yield Parameters:

  • doc (Nokogiri::XML::Document)

    A parsed XML document.

See Also:



325
326
327
328
329
330
331
332
333
# File 'lib/spidr/agent/events.rb', line 325

def every_xml_doc
  every_page do |page|
    if (block_given? && page.xml?)
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_xml_page {|page| ... } ⇒ Object

Pass every XML page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every XML page visited.

Yield Parameters:

  • page (Page)

    A visited page.



248
249
250
251
252
# File 'lib/spidr/agent/events.rb', line 248

def every_xml_page
  every_page do |page|
    yield page if (block_given? && page.xml?)
  end
end

#every_xsl_doc {|doc| ... } ⇒ Object

Pass every XML Stylesheet (XSL) that the agent parses to a given block.

Yields:

  • (doc)

    The block will be passed every XSL Stylesheet (XSL) parsed.

Yield Parameters:

  • doc (Nokogiri::XML::Document)

    A parsed XML document.

See Also:



347
348
349
350
351
352
353
354
355
# File 'lib/spidr/agent/events.rb', line 347

def every_xsl_doc
  every_page do |page|
    if (block_given? && page.xsl?)
      if (doc = page.doc)
        yield doc
      end
    end
  end
end

#every_xsl_page {|page| ... } ⇒ Object

Pass every XML Stylesheet (XSL) page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every XML Stylesheet (XSL) page visited.

Yield Parameters:

  • page (Page)

    A visited page.



264
265
266
267
268
# File 'lib/spidr/agent/events.rb', line 264

def every_xsl_page
  every_page do |page|
    yield page if (block_given? && page.xsl?)
  end
end

#every_zip_page {|page| ... } ⇒ Object

Pass every ZIP page that the agent visits to a given block.

Yields:

  • (page)

    The block will be passed every ZIP page visited.

Yield Parameters:

  • page (Page)

    A visited page.



498
499
500
501
502
# File 'lib/spidr/agent/events.rb', line 498

def every_zip_page
  every_page do |page|
    yield page if (block_given? && page.zip?)
  end
end

#failed(url) ⇒ Object (protected)

Adds a given URL to the failures list.

Parameters:

  • url (URI::HTTP)

    The URL to add to the failures list.



963
964
965
966
967
# File 'lib/spidr/agent.rb', line 963

def failed(url)
  @failures << url
  @every_failed_url_blocks.each { |fail_block| fail_block.call(url) }
  return true
end

#failed?(url) ⇒ Boolean

Determines whether a given URL could not be visited.

Parameters:

  • url (URI::HTTP, String)

    The URL to check for failures.

Returns:

  • (Boolean)

    Specifies whether the given URL was unable to be visited.



607
608
609
# File 'lib/spidr/agent.rb', line 607

def failed?(url)
  @failures.include?(URI(url))
end

#get_page(url) {|page| ... } ⇒ Page?

Requests and creates a new Page object from a given URL.

Parameters:

  • url (URI::HTTP)

    The URL to request.

Yields:

  • (page)

    If a block is given, it will be passed the page that represents the response.

Yield Parameters:

  • page (Page)

    The page for the response.

Returns:

  • (Page, nil)

    The page for the response, or nil if the request failed.



710
711
712
713
714
715
716
717
718
719
720
721
722
# File 'lib/spidr/agent.rb', line 710

def get_page(url)
  url = URI(url)

  prepare_request(url) do |session,path,headers|
    new_page = Page.new(url,session.get(path,headers))

    # save any new cookies
    @cookies.from_page(new_page)

    yield new_page if block_given?
    return new_page
  end
end

#ignore_extsArray<String, Regexp, Proc>

Specifies the patterns that match URI path extensions to not visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The URI path extension patterns to not visit.



330
331
332
# File 'lib/spidr/agent/filters.rb', line 330

def ignore_exts
  @ext_rules.reject
end

#ignore_exts_like(pattern = nil) {|ext| ... } ⇒ Object

Adds a given pattern to the #ignore_exts.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match URI path extensions with.

Yields:

  • (ext)

    If a block is given, it will be used to filter URI path extensions.

Yield Parameters:

  • ext (String)

    A URI path extension to reject or accept.



346
347
348
349
350
351
352
353
354
# File 'lib/spidr/agent/filters.rb', line 346

def ignore_exts_like(pattern=nil,&block)
  if pattern
    ignore_exts << pattern
  elsif block
    ignore_exts << block
  end

  return self
end

#ignore_hostsArray<String, Regexp, Proc>

Specifies the patterns that match host-names to not visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The host-name patterns to not visit.



62
63
64
# File 'lib/spidr/agent/filters.rb', line 62

def ignore_hosts
  @host_rules.reject
end

#ignore_hosts_like(pattern = nil) {|host| ... } ⇒ Object

Adds a given pattern to the #ignore_hosts.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match host-names with.

Yields:

  • (host)

    If a block is given, it will be used to filter host-names.

Yield Parameters:

  • host (String)

    A host-name to reject or accept.



78
79
80
81
82
83
84
85
86
# File 'lib/spidr/agent/filters.rb', line 78

def ignore_hosts_like(pattern=nil,&block)
  if pattern
    ignore_hosts << pattern
  elsif block
    ignore_hosts << block
  end

  return self
end

Specifies the patterns that match links to not visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The link patterns to not visit.



194
195
196
# File 'lib/spidr/agent/filters.rb', line 194

def ignore_links
  @link_rules.reject
end

Adds a given pattern to the #ignore_links.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match links with.

Yields:

  • (link)

    If a block is given, it will be used to filter links.

Yield Parameters:

  • link (String)

    A link to reject or accept.



210
211
212
213
214
215
216
217
218
# File 'lib/spidr/agent/filters.rb', line 210

def ignore_links_like(pattern=nil,&block)
  if pattern
    ignore_links << pattern
  elsif block
    ignore_links << block
  end

  return self
end

#ignore_portsArray<Integer, Regexp, Proc>

Specifies the patterns that match ports to not visit.

Returns:

  • (Array<Integer, Regexp, Proc>)

    The port patterns to not visit.



126
127
128
# File 'lib/spidr/agent/filters.rb', line 126

def ignore_ports
  @port_rules.reject
end

#ignore_ports_like(pattern = nil) {|port| ... } ⇒ Object

Adds a given pattern to the #ignore_ports.

Parameters:

  • pattern (Integer, Regexp) (defaults to: nil)

    The pattern to match ports with.

Yields:

  • (port)

    If a block is given, it will be used to filter ports.

Yield Parameters:

  • port (Integer)

    A port to reject or accept.



142
143
144
145
146
147
148
149
150
# File 'lib/spidr/agent/filters.rb', line 142

def ignore_ports_like(pattern=nil,&block)
  if pattern
    ignore_ports << pattern
  elsif block
    ignore_ports << block
  end

  return self
end

#ignore_urlsArray<String, Regexp, Proc>

Specifies the patterns that match URLs to not visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The URL patterns to not visit.

Since:

  • 0.2.4



264
265
266
# File 'lib/spidr/agent/filters.rb', line 264

def ignore_urls
  @url_rules.reject
end

#ignore_urls_like(pattern = nil) {|url| ... } ⇒ Object

Adds a given pattern to the #ignore_urls.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match URLs with.

Yields:

  • (url)

    If a block is given, it will be used to filter URLs.

Yield Parameters:

  • url (URI::HTTP, URI::HTTPS)

    A URL to reject or accept.

Since:

  • 0.2.4



282
283
284
285
286
287
288
289
290
# File 'lib/spidr/agent/filters.rb', line 282

def ignore_urls_like(pattern=nil,&block)
  if pattern
    ignore_urls << pattern
  elsif block
    ignore_urls << block
  end

  return self
end

#initialize_actionsObject (protected)



101
102
103
# File 'lib/spidr/agent/actions.rb', line 101

def initialize_actions
  @paused = false
end

#initialize_eventsObject (protected)



525
526
527
528
529
530
531
532
# File 'lib/spidr/agent/events.rb', line 525

def initialize_events
  @every_url_blocks        = []
  @every_failed_url_blocks = []
  @every_url_like_blocks   = Hash.new { |hash,key| hash[key] = [] }

  @every_page_blocks = []
  @every_link_blocks = []
end

#initialize_filters(schemes: self.class.default_schemes, host: nil, hosts: nil, ignore_hosts: nil, ports: nil, ignore_ports: nil, links: nil, ignore_links: nil, urls: nil, ignore_urls: nil, exts: nil, ignore_exts: nil) ⇒ Object (protected)

Initializes filtering rules.

Parameters:

  • schemes (Array<String>) (defaults to: self.class.default_schemes)

    The list of acceptable URI schemes to visit. The https scheme will be ignored if net/https cannot be loaded.

  • host (String) (defaults to: nil)

    The host-name to visit.

  • hosts (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the host-names to visit.

  • ignore_hosts (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the host-names to not visit.

  • ports (Array<Integer, Regexp, Proc>) (defaults to: nil)

    The patterns which match the ports to visit.

  • ignore_ports (Array<Integer, Regexp, Proc>) (defaults to: nil)

    The patterns which match the ports to not visit.

  • links (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the links to visit.

  • ignore_links (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the links to not visit.

  • urls (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the URLs to visit.

  • ignore_urls (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the URLs to not visit.

  • exts (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the URI path extensions to visit.

  • ignore_exts (Array<String, Regexp, Proc>) (defaults to: nil)

    The patterns which match the URI path extensions to not visit.



398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
# File 'lib/spidr/agent/filters.rb', line 398

def initialize_filters(schemes:      self.class.default_schemes,
                       host:         nil,
                       hosts:        nil,
                       ignore_hosts: nil,
                       ports:        nil,
                       ignore_ports: nil,
                       links:        nil,
                       ignore_links: nil,
                       urls:         nil,
                       ignore_urls:  nil,
                       exts:         nil,
                       ignore_exts:  nil)
  @schemes = schemes.map(&:to_s)

  @host_rules = Rules.new(accept: hosts, reject: ignore_hosts)
  @port_rules = Rules.new(accept: ports, reject: ignore_ports)
  @link_rules = Rules.new(accept: links, reject: ignore_links)
  @url_rules  = Rules.new(accept: urls,  reject: ignore_urls)
  @ext_rules  = Rules.new(accept: exts,  reject: ignore_exts)

  visit_hosts_like(host) if host
end

#initialize_robotsObject

Initializes the robots filter.



13
14
15
16
17
18
19
# File 'lib/spidr/agent/robots.rb', line 13

def initialize_robots
  unless Object.const_defined?(:Robots)
    raise(ArgumentError,":robots option given but unable to require 'robots' gem")
  end

  @robots = Robots.new(@user_agent)
end

#initialize_sanitizers(strip_fragments: true, strip_query: false) ⇒ Object (protected)

Initializes the Sanitizer rules.

Parameters:

  • strip_fragments (Boolean) (defaults to: true)

    Specifies whether or not to strip the fragment component from URLs.

  • strip_query (Boolean) (defaults to: false)

    Specifies whether or not to strip the query component from URLs.

Since:

  • 0.2.2



47
48
49
50
# File 'lib/spidr/agent/sanitizers.rb', line 47

def initialize_sanitizers(strip_fragments: true, strip_query: false)
  @strip_fragments = strip_fragments
  @strip_query     = strip_query
end

#limit_reached?Boolean (protected)

Determines if the maximum limit has been reached.

Returns:

  • (Boolean)

Since:

  • 0.6.0



933
934
935
# File 'lib/spidr/agent.rb', line 933

def limit_reached?
  @limit && @history.length >= @limit
end

#pause!Object

Pauses the agent, causing spidering to temporarily stop.

Raises:

  • (Paused)

    Indicates to the agent, that it should pause spidering.



63
64
65
66
# File 'lib/spidr/agent/actions.rb', line 63

def pause!
  @paused = true
  raise(Actions::Paused)
end

#pause=(state) ⇒ Object

Sets the pause state of the agent.

Parameters:

  • state (Boolean)

    The new pause state of the agent.



53
54
55
# File 'lib/spidr/agent/actions.rb', line 53

def pause=(state)
  @paused = state
end

#paused?Boolean

Determines whether the agent is paused.

Returns:

  • (Boolean)

    Specifies whether the agent is paused.



74
75
76
# File 'lib/spidr/agent/actions.rb', line 74

def paused?
  @paused == true
end

#post_page(url, post_data = '') {|page| ... } ⇒ Page?

Posts supplied form data and creates a new Page object from a given URL.

Parameters:

  • url (URI::HTTP)

    The URL to request.

  • post_data (String) (defaults to: '')

    Form option data.

Yields:

  • (page)

    If a block is given, it will be passed the page that represents the response.

Yield Parameters:

  • page (Page)

    The page for the response.

Returns:

  • (Page, nil)

    The page for the response, or nil if the request failed.

Since:

  • 0.2.2



745
746
747
748
749
750
751
752
753
754
755
756
757
# File 'lib/spidr/agent.rb', line 745

def post_page(url,post_data='')
  url = URI(url)

  prepare_request(url) do |session,path,headers|
    new_page = Page.new(url,session.post(path,post_data,headers))

    # save any new cookies
    @cookies.from_page(new_page)

    yield new_page if block_given?
    return new_page
  end
end

#prepare_request(url) {|request| ... } ⇒ Object (protected)

Normalizes the request path and grabs a session to handle page get and post requests.

Parameters:

  • url (URI::HTTP)

    The URL to request.

Yields:

  • (request)

    A block whose purpose is to make a page request.

Yield Parameters:

  • session (Net::HTTP)

    An HTTP session object.

  • path (String)

    Normalized URL string.

  • headers (Hash)

    A Hash of request header options.

Since:

  • 0.2.2



885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
# File 'lib/spidr/agent.rb', line 885

def prepare_request(url,&block)
  path = unless url.path.empty?
           url.path
         else
           '/'
         end

  # append the URL query to the path
  path += "?#{url.query}" if url.query

  headers = prepare_request_headers(url)

  begin
    sleep(@delay) if @delay > 0

    yield @sessions[url], path, headers
  rescue SystemCallError,
         Timeout::Error,
         SocketError,
         IOError,
         OpenSSL::SSL::SSLError,
         Net::HTTPBadResponse,
         Zlib::Error

    @sessions.kill!(url)

    failed(url)
    return nil
  end
end

#prepare_request_headers(url) ⇒ Hash{String => String} (protected)

Prepares request headers for the given URL.

Parameters:

  • url (URI::HTTP)

    The URL to prepare the request headers for.

Returns:

  • (Hash{String => String})

    The prepared headers.

Since:

  • 0.6.0



836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
# File 'lib/spidr/agent.rb', line 836

def prepare_request_headers(url)
  # set any additional HTTP headers
  headers = @default_headers.dup

  unless @host_headers.empty?
    @host_headers.each do |name,header|
      if url.host.match(name)
        headers['Host'] = header
        break
      end
    end
  end

  headers['Host']     ||= @host_header if @host_header
  headers['User-Agent'] = @user_agent if @user_agent
  headers['Referer']    = @referer if @referer

  if (authorization = @authorized.for_url(url))
    headers['Authorization'] = "Basic #{authorization}"
  end

  if (header_cookies = @cookies.for_host(url.host))
    headers['Cookie'] = header_cookies
  end

  return headers
end

#proxyProxy

The proxy information the agent uses.

Returns:

  • (Proxy)

    The proxy information.

See Also:

Since:

  • 0.2.2



434
435
436
# File 'lib/spidr/agent.rb', line 434

def proxy
  @sessions.proxy
end

#proxy=(new_proxy) ⇒ Proxy

Sets the proxy information that the agent uses.

Parameters:

  • new_proxy (Proxy, Hash, URI::HTTP, String, nil)

    The new proxy information.

Returns:

  • (Proxy)

    The new proxy information.

See Also:

Since:

  • 0.2.2



451
452
453
# File 'lib/spidr/agent.rb', line 451

def proxy=(new_proxy)
  @sessions.proxy = new_proxy
end

#queued?(url) ⇒ Boolean

Determines whether a given URL has been enqueued.

Parameters:

  • url (URI::HTTP)

    The URL to search for in the queue.

Returns:

  • (Boolean)

    Specifies whether the given URL has been queued for visiting.



644
645
646
# File 'lib/spidr/agent.rb', line 644

def queued?(url)
  @queue.include?(url)
end

#robot_allowed?(url) ⇒ Boolean

Determines whether a URL is allowed by the robot policy.

Parameters:

  • url (URI::HTTP, String)

    The URL to check.

Returns:

  • (Boolean)

    Specifies whether a URL is allowed by the robot policy.



30
31
32
33
34
35
36
# File 'lib/spidr/agent/robots.rb', line 30

def robot_allowed?(url)
  if @robots
    @robots.allowed?(url)
  else
    true
  end
end

#run {|page| ... } ⇒ Object

Start spidering until the queue becomes empty or the agent is paused.

Yields:

  • (page)

    If a block is given, it will be passed every page visited.

Yield Parameters:

  • page (Page)

    A page which has been visited.



492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
# File 'lib/spidr/agent.rb', line 492

def run(&block)
  @running = true

  until (@queue.empty? || paused? || limit_reached?)
    begin
      visit_page(dequeue,&block)
    rescue Actions::Paused
      return self
    rescue Actions::Action
    end
  end

  @running = false
  @sessions.clear
  return self
end

#running?Boolean

Determines if the agent is running.

Returns:

  • (Boolean)

    Specifies whether the agent is running or stopped.



515
516
517
# File 'lib/spidr/agent.rb', line 515

def running?
  @running == true
end

#sanitize_url(url) ⇒ URI::HTTP, URI::HTTPS

Sanitizes a URL based on filtering options.

Parameters:

  • url (URI::HTTP, URI::HTTPS, String)

    The URL to be sanitized

Returns:

  • (URI::HTTP, URI::HTTPS)

    The new sanitized URL.

Since:

  • 0.2.2



25
26
27
28
29
30
31
32
# File 'lib/spidr/agent/sanitizers.rb', line 25

def sanitize_url(url)
  url = URI(url)

  url.fragment = nil if @strip_fragments
  url.query    = nil if @strip_query

  return url
end

#skip_link!Object

Causes the agent to skip the link being enqueued.

Raises:

  • (SkipLink)

    Indicates to the agent, that the current link should be skipped, and not enqueued or visited.



85
86
87
# File 'lib/spidr/agent/actions.rb', line 85

def skip_link!
  raise(Actions::SkipLink)
end

#skip_page!Object

Causes the agent to skip the page being visited.

Raises:

  • (SkipPage)

    Indicates to the agent, that the current page should be skipped.



95
96
97
# File 'lib/spidr/agent/actions.rb', line 95

def skip_page!
  raise(Actions::SkipPage)
end

#start_at(url) {|page| ... } ⇒ Object

Start spidering at a given URL.

Parameters:

  • url (URI::HTTP, String)

    The URL to start spidering at.

Yields:

  • (page)

    If a block is given, it will be passed every page visited.

Yield Parameters:

  • page (Page)

    A page which has been visited.



477
478
479
480
# File 'lib/spidr/agent.rb', line 477

def start_at(url,&block)
  enqueue(url)
  return run(&block)
end

#to_hashHash

Converts the agent into a Hash.

Returns:

  • (Hash)

    The agent represented as a Hash containing the history and the queue of the agent.



819
820
821
# File 'lib/spidr/agent.rb', line 819

def to_hash
  {history: @history, queue: @queue}
end

#urls_like(pattern, &block) ⇒ Object

See Also:



56
57
58
# File 'lib/spidr/agent/events.rb', line 56

def urls_like(pattern,&block)
  every_url_like(pattern,&block)
end

#visit?(url) ⇒ Boolean (protected)

Determines if a given URL should be visited.

Parameters:

  • url (URI::HTTP)

    The URL in question.

Returns:

  • (Boolean)

    Specifies whether the given URL should be visited.



946
947
948
949
950
951
952
953
954
955
# File 'lib/spidr/agent.rb', line 946

def visit?(url)
  !visited?(url) &&
   visit_scheme?(url.scheme) &&
   visit_host?(url.host) &&
   visit_port?(url.port) &&
   visit_link?(url.to_s) &&
   visit_url?(url) &&
   visit_ext?(url.path) &&
   robot_allowed?(url.to_s)
end

#visit_ext?(path) ⇒ Boolean (protected)

Determines if a given URI path extension should be visited.

Parameters:

  • path (String)

    The path that contains the extension.

Returns:

  • (Boolean)

    Specifies whether the given URI path extension should be visited.



525
526
527
# File 'lib/spidr/agent/filters.rb', line 525

def visit_ext?(path)
  @ext_rules.accept?(File.extname(path)[1..-1])
end

#visit_extsArray<String, Regexp, Proc>

Specifies the patterns that match the URI path extensions to visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The URI path extensions patterns to visit.



298
299
300
# File 'lib/spidr/agent/filters.rb', line 298

def visit_exts
  @ext_rules.accept
end

#visit_exts_like(pattern = nil) {|ext| ... } ⇒ Object

Adds a given pattern to the #visit_exts.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match URI path extensions with.

Yields:

  • (ext)

    If a block is given, it will be used to filter URI path extensions.

Yield Parameters:

  • ext (String)

    A URI path extension to accept or reject.



314
315
316
317
318
319
320
321
322
# File 'lib/spidr/agent/filters.rb', line 314

def visit_exts_like(pattern=nil,&block)
  if pattern
    visit_exts << pattern
  elsif block
    visit_exts << block
  end

  return self
end

#visit_host?(host) ⇒ Boolean (protected)

Determines if a given host-name should be visited.

Parameters:

  • host (String)

    The host-name.

Returns:

  • (Boolean)

    Specifies whether the given host-name should be visited.



471
472
473
# File 'lib/spidr/agent/filters.rb', line 471

def visit_host?(host)
  @host_rules.accept?(host)
end

#visit_hostsArray<String, Regexp, Proc>

Specifies the patterns that match host-names to visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The host-name patterns to visit.



30
31
32
# File 'lib/spidr/agent/filters.rb', line 30

def visit_hosts
  @host_rules.accept
end

#visit_hosts_like(pattern = nil) {|host| ... } ⇒ Object

Adds a given pattern to the #visit_hosts.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match host-names with.

Yields:

  • (host)

    If a block is given, it will be used to filter host-names.

Yield Parameters:

  • host (String)

    A host-name to accept or reject.



46
47
48
49
50
51
52
53
54
# File 'lib/spidr/agent/filters.rb', line 46

def visit_hosts_like(pattern=nil,&block)
  if pattern
    visit_hosts << pattern
  elsif block
    visit_hosts << block
  end

  return self
end

#visit_link?(link) ⇒ Boolean (protected)

Determines if a given link should be visited.

Parameters:

  • link (String)

    The link.

Returns:

  • (Boolean)

    Specifies whether the given link should be visited.



497
498
499
# File 'lib/spidr/agent/filters.rb', line 497

def visit_link?(link)
  @link_rules.accept?(link)
end

Specifies the patterns that match the links to visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The link patterns to visit.

Since:

  • 0.2.4



160
161
162
# File 'lib/spidr/agent/filters.rb', line 160

def visit_links
  @link_rules.accept
end

Adds a given pattern to the #visit_links

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match link with.

Yields:

  • (link)

    If a block is given, it will be used to filter links.

Yield Parameters:

  • link (String)

    A link to accept or reject.

Since:

  • 0.2.4



178
179
180
181
182
183
184
185
186
# File 'lib/spidr/agent/filters.rb', line 178

def visit_links_like(pattern=nil,&block)
  if pattern
    visit_links << pattern
  elsif block
    visit_links << block
  end

  return self
end

#visit_page(url) {|page| ... } ⇒ Page?

Visits a given URL, and enqueues the links recovered from the URL to be visited later.

Parameters:

  • url (URI::HTTP, String)

    The URL to visit.

Yields:

  • (page)

    If a block is given, it will be passed the page which was visited.

Yield Parameters:

  • page (Page)

    The page which was visited.

Returns:

  • (Page, nil)

    The page that was visited. If nil is returned, either the request for the page failed, or the page was skipped.



776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
# File 'lib/spidr/agent.rb', line 776

def visit_page(url)
  url = sanitize_url(url)

  get_page(url) do |page|
    @history << page.url

    begin
      @every_page_blocks.each { |page_block| page_block.call(page) }

      yield page if block_given?
    rescue Actions::Paused => action
      raise(action)
    rescue Actions::SkipPage
      return nil
    rescue Actions::Action
    end

    page.each_url do |next_url|
      begin
        @every_link_blocks.each do |link_block|
          link_block.call(page.url,next_url)
        end
      rescue Actions::Paused => action
        raise(action)
      rescue Actions::SkipLink
        next
      rescue Actions::Action
      end

      if (@max_depth.nil? || @max_depth > @levels[url])
        enqueue(next_url,@levels[url] + 1)
      end
    end
  end
end

#visit_port?(port) ⇒ Boolean (protected)

Determines if a given port should be visited.

Parameters:

  • port (Integer)

    The port number.

Returns:

  • (Boolean)

    Specifies whether the given port should be visited.



484
485
486
# File 'lib/spidr/agent/filters.rb', line 484

def visit_port?(port)
  @port_rules.accept?(port)
end

#visit_portsArray<Integer, Regexp, Proc>

Specifies the patterns that match the ports to visit.

Returns:

  • (Array<Integer, Regexp, Proc>)

    The port patterns to visit.



94
95
96
# File 'lib/spidr/agent/filters.rb', line 94

def visit_ports
  @port_rules.accept
end

#visit_ports_like(pattern = nil) {|port| ... } ⇒ Object

Adds a given pattern to the #visit_ports.

Parameters:

  • pattern (Integer, Regexp) (defaults to: nil)

    The pattern to match ports with.

Yields:

  • (port)

    If a block is given, it will be used to filter ports.

Yield Parameters:

  • port (Integer)

    A port to accept or reject.



110
111
112
113
114
115
116
117
118
# File 'lib/spidr/agent/filters.rb', line 110

def visit_ports_like(pattern=nil,&block)
  if pattern
    visit_ports << pattern
  elsif block
    visit_ports << block
  end

  return self
end

#visit_scheme?(scheme) ⇒ Boolean (protected)

Determines if a given URI scheme should be visited.

Parameters:

  • scheme (String)

    The URI scheme.

Returns:

  • (Boolean)

    Specifies whether the given scheme should be visited.



454
455
456
457
458
459
460
# File 'lib/spidr/agent/filters.rb', line 454

def visit_scheme?(scheme)
  if scheme
    @schemes.include?(scheme)
  else
    true
  end
end

#visit_url?(link) ⇒ Boolean (protected)

Determines if a given URL should be visited.

Parameters:

  • link (URI::HTTP, URI::HTTPS)

    The URL.

Returns:

  • (Boolean)

    Specifies whether the given URL should be visited.

Since:

  • 0.2.4



512
513
514
# File 'lib/spidr/agent/filters.rb', line 512

def visit_url?(link)
  @url_rules.accept?(link)
end

#visit_urlsArray<String, Regexp, Proc>

Specifies the patterns that match the URLs to visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The link patterns to visit.

Since:

  • 0.2.4



228
229
230
# File 'lib/spidr/agent/filters.rb', line 228

def visit_urls
  @url_rules.accept
end

#visit_urls_like(pattern = nil) {|url| ... } ⇒ Object

Adds a given pattern to the #visit_urls

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match URLs with.

Yields:

  • (url)

    If a block is given, it will be used to filter URLs.

Yield Parameters:

  • url (URI::HTTP, URI::HTTPS)

    A URL to accept or reject.

Since:

  • 0.2.4



246
247
248
249
250
251
252
253
254
# File 'lib/spidr/agent/filters.rb', line 246

def visit_urls_like(pattern=nil,&block)
  if pattern
    visit_urls << pattern
  elsif block
    visit_urls << block
  end

  return self
end

#visited?(url) ⇒ Boolean

Determines whether a URL was visited or not.

Parameters:

  • url (URI::HTTP, String)

    The URL to search for.

Returns:

  • (Boolean)

    Specifies whether a URL was visited.



572
573
574
# File 'lib/spidr/agent.rb', line 572

def visited?(url)
  @history.include?(URI(url))
end

#visited_hostsArray<String>

Specifies all hosts that were visited.

Returns:

  • (Array<String>)

    The hosts which have been visited.



559
560
561
# File 'lib/spidr/agent.rb', line 559

def visited_hosts
  visited_urls.map(&:host).uniq
end

Specifies the links which have been visited.

Returns:

  • (Array<String>)

    The links which have been visited.



549
550
551
# File 'lib/spidr/agent.rb', line 549

def visited_links
  @history.map(&:to_s)
end