Module: Spidr::Filters

Included in:
Agent
Defined in:
lib/spidr/filters.rb

Overview

The Filters module adds methods to Agent for controlling which URLs the agent will visit.

Instance Attribute Summary collapse

Instance Method Summary collapse

Instance Attribute Details

#schemesObject

List of acceptable URL schemes to follow



10
11
12
# File 'lib/spidr/filters.rb', line 10

def schemes
  @schemes
end

Instance Method Details

#ignore_extsArray<String, Regexp, Proc>

Specifies the patterns that match URI path extensions to not visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The URI path extension patterns to not visit.



331
332
333
# File 'lib/spidr/filters.rb', line 331

def ignore_exts
  @ext_rules.reject
end

#ignore_exts_like(pattern = nil) {|ext| ... } ⇒ Object

Adds a given pattern to the #ignore_exts.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match URI path extensions with.

Yields:

  • (ext)

    If a block is given, it will be used to filter URI path extensions.

Yield Parameters:

  • ext (String)

    A URI path extension to reject or accept.



347
348
349
350
351
352
353
354
355
# File 'lib/spidr/filters.rb', line 347

def ignore_exts_like(pattern=nil,&block)
  if pattern
    ignore_exts << pattern
  elsif block
    ignore_exts << block
  end

  return self
end

#ignore_hostsArray<String, Regexp, Proc>

Specifies the patterns that match host-names to not visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The host-name patterns to not visit.



63
64
65
# File 'lib/spidr/filters.rb', line 63

def ignore_hosts
  @host_rules.reject
end

#ignore_hosts_like(pattern = nil) {|host| ... } ⇒ Object

Adds a given pattern to the #ignore_hosts.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match host-names with.

Yields:

  • (host)

    If a block is given, it will be used to filter host-names.

Yield Parameters:

  • host (String)

    A host-name to reject or accept.



79
80
81
82
83
84
85
86
87
# File 'lib/spidr/filters.rb', line 79

def ignore_hosts_like(pattern=nil,&block)
  if pattern
    ignore_hosts << pattern
  elsif block
    ignore_hosts << block
  end

  return self
end

Specifies the patterns that match links to not visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The link patterns to not visit.



195
196
197
# File 'lib/spidr/filters.rb', line 195

def ignore_links
  @link_rules.reject
end

Adds a given pattern to the #ignore_links.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match links with.

Yields:

  • (link)

    If a block is given, it will be used to filter links.

Yield Parameters:

  • link (String)

    A link to reject or accept.



211
212
213
214
215
216
217
218
219
# File 'lib/spidr/filters.rb', line 211

def ignore_links_like(pattern=nil,&block)
  if pattern
    ignore_links << pattern
  elsif block
    ignore_links << block
  end

  return self
end

#ignore_portsArray<Integer, Regexp, Proc>

Specifies the patterns that match ports to not visit.

Returns:

  • (Array<Integer, Regexp, Proc>)

    The port patterns to not visit.



127
128
129
# File 'lib/spidr/filters.rb', line 127

def ignore_ports
  @port_rules.reject
end

#ignore_ports_like(pattern = nil) {|port| ... } ⇒ Object

Adds a given pattern to the #ignore_ports.

Parameters:

  • pattern (Integer, Regexp) (defaults to: nil)

    The pattern to match ports with.

Yields:

  • (port)

    If a block is given, it will be used to filter ports.

Yield Parameters:

  • port (Integer)

    A port to reject or accept.



143
144
145
146
147
148
149
150
151
# File 'lib/spidr/filters.rb', line 143

def ignore_ports_like(pattern=nil,&block)
  if pattern
    ignore_ports << pattern
  elsif block
    ignore_ports << block
  end

  return self
end

#ignore_urlsArray<String, Regexp, Proc>

Specifies the patterns that match URLs to not visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The URL patterns to not visit.

Since:

  • 0.2.4



265
266
267
# File 'lib/spidr/filters.rb', line 265

def ignore_urls
  @url_rules.reject
end

#ignore_urls_like(pattern = nil) {|url| ... } ⇒ Object

Adds a given pattern to the #ignore_urls.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match URLs with.

Yields:

  • (url)

    If a block is given, it will be used to filter URLs.

Yield Parameters:

  • url (URI::HTTP, URI::HTTPS)

    A URL to reject or accept.

Since:

  • 0.2.4



283
284
285
286
287
288
289
290
291
# File 'lib/spidr/filters.rb', line 283

def ignore_urls_like(pattern=nil,&block)
  if pattern
    ignore_urls << pattern
  elsif block
    ignore_urls << block
  end

  return self
end

#initialize_filters(options = {}) ⇒ Object (protected)

Initializes filtering rules.

Parameters:

  • options (Hash) (defaults to: {})

    Additional options.

Options Hash (options):

  • :schemes (Array) — default: ['http', 'https']

    The list of acceptable URI schemes to visit. The https scheme will be ignored if net/https cannot be loaded.

  • :host (String)

    The host-name to visit.

  • :hosts (Array<String, Regexp, Proc>)

    The patterns which match the host-names to visit.

  • :ignore_hosts (Array<String, Regexp, Proc>)

    The patterns which match the host-names to not visit.

  • :ports (Array<Integer, Regexp, Proc>)

    The patterns which match the ports to visit.

  • :ignore_ports (Array<Integer, Regexp, Proc>)

    The patterns which match the ports to not visit.

  • :links (Array<String, Regexp, Proc>)

    The patterns which match the links to visit.

  • :ignore_links (Array<String, Regexp, Proc>)

    The patterns which match the links to not visit.

  • :urls (Array<String, Regexp, Proc>)

    The patterns which match the URLs to visit.

  • :ignore_urls (Array<String, Regexp, Proc>)

    The patterns which match the URLs to not visit.

  • :exts (Array<String, Regexp, Proc>)

    The patterns which match the URI path extensions to visit.

  • :ignore_exts (Array<String, Regexp, Proc>)

    The patterns which match the URI path extensions to not visit.



402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
# File 'lib/spidr/filters.rb', line 402

def initialize_filters(options={})
  @schemes = []

  if options[:schemes]
    @schemes += options[:schemes]
  else
    @schemes << 'http'

    begin
      require 'net/https'

      @schemes << 'https'
    rescue Gem::LoadError => e
      raise(e)
    rescue ::LoadError
      STDERR.puts "Warning: cannot load 'net/https', https support disabled"
    end
  end

  @host_rules = Rules.new(
    :accept => options[:hosts],
    :reject => options[:ignore_hosts]
  )
  @port_rules = Rules.new(
    :accept => options[:ports],
    :reject => options[:ignore_ports]
  )
  @link_rules = Rules.new(
    :accept => options[:links],
    :reject => options[:ignore_links]
  )
  @url_rules = Rules.new(
    :accept => options[:urls],
    :reject => options[:ignore_urls]
  )
  @ext_rules = Rules.new(
    :accept => options[:exts],
    :reject => options[:ignore_exts]
  )

  if options[:host]
    visit_hosts_like(options[:host])
  end

  if options[:queue]
    self.queue = options[:queue]
  end

  if options[:history]
    self.history = options[:history]
  end
end

#visit_ext?(path) ⇒ Boolean (protected)

Determines if a given URI path extension should be visited.

Parameters:

  • path (String)

    The path that contains the extension.

Returns:

  • (Boolean)

    Specifies whether the given URI path extension should be visited.



535
536
537
# File 'lib/spidr/filters.rb', line 535

def visit_ext?(path)
  @ext_rules.accept?(File.extname(path)[1..-1])
end

#visit_extsArray<String, Regexp, Proc>

Specifies the patterns that match the URI path extensions to visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The URI path extensions patterns to visit.



299
300
301
# File 'lib/spidr/filters.rb', line 299

def visit_exts
  @ext_rules.accept
end

#visit_exts_like(pattern = nil) {|ext| ... } ⇒ Object

Adds a given pattern to the #visit_exts.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match URI path extensions with.

Yields:

  • (ext)

    If a block is given, it will be used to filter URI path extensions.

Yield Parameters:

  • ext (String)

    A URI path extension to accept or reject.



315
316
317
318
319
320
321
322
323
# File 'lib/spidr/filters.rb', line 315

def visit_exts_like(pattern=nil,&block)
  if pattern
    visit_exts << pattern
  elsif block
    visit_exts << block
  end

  return self
end

#visit_host?(host) ⇒ Boolean (protected)

Determines if a given host-name should be visited.

Parameters:

  • host (String)

    The host-name.

Returns:

  • (Boolean)

    Specifies whether the given host-name should be visited.



481
482
483
# File 'lib/spidr/filters.rb', line 481

def visit_host?(host)
  @host_rules.accept?(host)
end

#visit_hostsArray<String, Regexp, Proc>

Specifies the patterns that match host-names to visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The host-name patterns to visit.



31
32
33
# File 'lib/spidr/filters.rb', line 31

def visit_hosts
  @host_rules.accept
end

#visit_hosts_like(pattern = nil) {|host| ... } ⇒ Object

Adds a given pattern to the #visit_hosts.

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match host-names with.

Yields:

  • (host)

    If a block is given, it will be used to filter host-names.

Yield Parameters:

  • host (String)

    A host-name to accept or reject.



47
48
49
50
51
52
53
54
55
# File 'lib/spidr/filters.rb', line 47

def visit_hosts_like(pattern=nil,&block)
  if pattern
    visit_hosts << pattern
  elsif block
    visit_hosts << block
  end

  return self
end

#visit_link?(link) ⇒ Boolean (protected)

Determines if a given link should be visited.

Parameters:

  • link (String)

    The link.

Returns:

  • (Boolean)

    Specifies whether the given link should be visited.



507
508
509
# File 'lib/spidr/filters.rb', line 507

def visit_link?(link)
  @link_rules.accept?(link)
end

Specifies the patterns that match the links to visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The link patterns to visit.

Since:

  • 0.2.4



161
162
163
# File 'lib/spidr/filters.rb', line 161

def visit_links
  @link_rules.accept
end

Adds a given pattern to the #visit_links

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match link with.

Yields:

  • (link)

    If a block is given, it will be used to filter links.

Yield Parameters:

  • link (String)

    A link to accept or reject.

Since:

  • 0.2.4



179
180
181
182
183
184
185
186
187
# File 'lib/spidr/filters.rb', line 179

def visit_links_like(pattern=nil,&block)
  if pattern
    visit_links << pattern
  elsif block
    visit_links << block
  end

  return self
end

#visit_port?(port) ⇒ Boolean (protected)

Determines if a given port should be visited.

Parameters:

  • port (Integer)

    The port number.

Returns:

  • (Boolean)

    Specifies whether the given port should be visited.



494
495
496
# File 'lib/spidr/filters.rb', line 494

def visit_port?(port)
  @port_rules.accept?(port)
end

#visit_portsArray<Integer, Regexp, Proc>

Specifies the patterns that match the ports to visit.

Returns:

  • (Array<Integer, Regexp, Proc>)

    The port patterns to visit.



95
96
97
# File 'lib/spidr/filters.rb', line 95

def visit_ports
  @port_rules.accept
end

#visit_ports_like(pattern = nil) {|port| ... } ⇒ Object

Adds a given pattern to the #visit_ports.

Parameters:

  • pattern (Integer, Regexp) (defaults to: nil)

    The pattern to match ports with.

Yields:

  • (port)

    If a block is given, it will be used to filter ports.

Yield Parameters:

  • port (Integer)

    A port to accept or reject.



111
112
113
114
115
116
117
118
119
# File 'lib/spidr/filters.rb', line 111

def visit_ports_like(pattern=nil,&block)
  if pattern
    visit_ports << pattern
  elsif block
    visit_ports << block
  end

  return self
end

#visit_scheme?(scheme) ⇒ Boolean (protected)

Determines if a given URI scheme should be visited.

Parameters:

  • scheme (String)

    The URI scheme.

Returns:

  • (Boolean)

    Specifies whether the given scheme should be visited.



464
465
466
467
468
469
470
# File 'lib/spidr/filters.rb', line 464

def visit_scheme?(scheme)
  if scheme
    return @schemes.include?(scheme)
  else
    return true
  end
end

#visit_url?(link) ⇒ Boolean (protected)

Determines if a given URL should be visited.

Parameters:

  • url (URI::HTTP, URI::HTTPS)

    The URL.

Returns:

  • (Boolean)

    Specifies whether the given URL should be visited.

Since:

  • 0.2.4



522
523
524
# File 'lib/spidr/filters.rb', line 522

def visit_url?(link)
  @url_rules.accept?(link)
end

#visit_urlsArray<String, Regexp, Proc>

Specifies the patterns that match the URLs to visit.

Returns:

  • (Array<String, Regexp, Proc>)

    The link patterns to visit.

Since:

  • 0.2.4



229
230
231
# File 'lib/spidr/filters.rb', line 229

def visit_urls
  @url_rules.accept
end

#visit_urls_like(pattern = nil) {|url| ... } ⇒ Object

Adds a given pattern to the #visit_urls

Parameters:

  • pattern (String, Regexp) (defaults to: nil)

    The pattern to match URLs with.

Yields:

  • (url)

    If a block is given, it will be used to filter URLs.

Yield Parameters:

  • url (URI::HTTP, URI::HTTPS)

    A URL to accept or reject.

Since:

  • 0.2.4



247
248
249
250
251
252
253
254
255
# File 'lib/spidr/filters.rb', line 247

def visit_urls_like(pattern=nil,&block)
  if pattern
    visit_urls << pattern
  elsif block
    visit_urls << block
  end

  return self
end