Module: Spidr::Filters
- Included in:
- Agent
- Defined in:
- lib/spidr/filters.rb
Overview
Instance Attribute Summary collapse
-
#schemes ⇒ Object
List of acceptable URL schemes to follow.
Instance Method Summary collapse
-
#ignore_exts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match URI path extensions to not visit.
-
#ignore_exts_like(pattern = nil) {|ext| ... } ⇒ Object
Adds a given pattern to the #ignore_exts.
-
#ignore_hosts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match host-names to not visit.
-
#ignore_hosts_like(pattern = nil) {|host| ... } ⇒ Object
Adds a given pattern to the #ignore_hosts.
-
#ignore_links ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match links to not visit.
-
#ignore_links_like(pattern = nil) {|link| ... } ⇒ Object
Adds a given pattern to the #ignore_links.
-
#ignore_ports ⇒ Array<Integer, Regexp, Proc>
Specifies the patterns that match ports to not visit.
-
#ignore_ports_like(pattern = nil) {|port| ... } ⇒ Object
Adds a given pattern to the #ignore_ports.
-
#ignore_urls ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match URLs to not visit.
-
#ignore_urls_like(pattern = nil) {|url| ... } ⇒ Object
Adds a given pattern to the #ignore_urls.
-
#initialize_filters(options = {}) ⇒ Object
protected
Initializes filtering rules.
-
#visit_ext?(path) ⇒ Boolean
protected
Determines if a given URI path extension should be visited.
-
#visit_exts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the URI path extensions to visit.
-
#visit_exts_like(pattern = nil) {|ext| ... } ⇒ Object
Adds a given pattern to the #visit_exts.
-
#visit_host?(host) ⇒ Boolean
protected
Determines if a given host-name should be visited.
-
#visit_hosts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match host-names to visit.
-
#visit_hosts_like(pattern = nil) {|host| ... } ⇒ Object
Adds a given pattern to the #visit_hosts.
-
#visit_link?(link) ⇒ Boolean
protected
Determines if a given link should be visited.
-
#visit_links ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the links to visit.
-
#visit_links_like(pattern = nil) {|link| ... } ⇒ Object
Adds a given pattern to the #visit_links.
-
#visit_port?(port) ⇒ Boolean
protected
Determines if a given port should be visited.
-
#visit_ports ⇒ Array<Integer, Regexp, Proc>
Specifies the patterns that match the ports to visit.
-
#visit_ports_like(pattern = nil) {|port| ... } ⇒ Object
Adds a given pattern to the #visit_ports.
-
#visit_scheme?(scheme) ⇒ Boolean
protected
Determines if a given URI scheme should be visited.
-
#visit_url?(link) ⇒ Boolean
protected
Determines if a given URL should be visited.
-
#visit_urls ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the URLs to visit.
-
#visit_urls_like(pattern = nil) {|url| ... } ⇒ Object
Adds a given pattern to the #visit_urls.
Instance Attribute Details
#schemes ⇒ Object
List of acceptable URL schemes to follow
10 11 12 |
# File 'lib/spidr/filters.rb', line 10 def schemes @schemes end |
Instance Method Details
#ignore_exts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match URI path extensions to not visit.
331 332 333 |
# File 'lib/spidr/filters.rb', line 331 def ignore_exts @ext_rules.reject end |
#ignore_exts_like(pattern = nil) {|ext| ... } ⇒ Object
Adds a given pattern to the #ignore_exts.
347 348 349 350 351 352 353 354 355 |
# File 'lib/spidr/filters.rb', line 347 def ignore_exts_like(pattern=nil,&block) if pattern ignore_exts << pattern elsif block ignore_exts << block end return self end |
#ignore_hosts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match host-names to not visit.
63 64 65 |
# File 'lib/spidr/filters.rb', line 63 def ignore_hosts @host_rules.reject end |
#ignore_hosts_like(pattern = nil) {|host| ... } ⇒ Object
Adds a given pattern to the #ignore_hosts.
79 80 81 82 83 84 85 86 87 |
# File 'lib/spidr/filters.rb', line 79 def ignore_hosts_like(pattern=nil,&block) if pattern ignore_hosts << pattern elsif block ignore_hosts << block end return self end |
#ignore_links ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match links to not visit.
195 196 197 |
# File 'lib/spidr/filters.rb', line 195 def ignore_links @link_rules.reject end |
#ignore_links_like(pattern = nil) {|link| ... } ⇒ Object
Adds a given pattern to the #ignore_links.
211 212 213 214 215 216 217 218 219 |
# File 'lib/spidr/filters.rb', line 211 def ignore_links_like(pattern=nil,&block) if pattern ignore_links << pattern elsif block ignore_links << block end return self end |
#ignore_ports ⇒ Array<Integer, Regexp, Proc>
Specifies the patterns that match ports to not visit.
127 128 129 |
# File 'lib/spidr/filters.rb', line 127 def ignore_ports @port_rules.reject end |
#ignore_ports_like(pattern = nil) {|port| ... } ⇒ Object
Adds a given pattern to the #ignore_ports.
143 144 145 146 147 148 149 150 151 |
# File 'lib/spidr/filters.rb', line 143 def ignore_ports_like(pattern=nil,&block) if pattern ignore_ports << pattern elsif block ignore_ports << block end return self end |
#ignore_urls ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match URLs to not visit.
265 266 267 |
# File 'lib/spidr/filters.rb', line 265 def ignore_urls @url_rules.reject end |
#ignore_urls_like(pattern = nil) {|url| ... } ⇒ Object
Adds a given pattern to the #ignore_urls.
283 284 285 286 287 288 289 290 291 |
# File 'lib/spidr/filters.rb', line 283 def ignore_urls_like(pattern=nil,&block) if pattern ignore_urls << pattern elsif block ignore_urls << block end return self end |
#initialize_filters(options = {}) ⇒ Object (protected)
Initializes filtering rules.
402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 |
# File 'lib/spidr/filters.rb', line 402 def initialize_filters(={}) @schemes = [] if [:schemes] @schemes += [:schemes] else @schemes << 'http' begin require 'net/https' @schemes << 'https' rescue Gem::LoadError => e raise(e) rescue ::LoadError STDERR.puts "Warning: cannot load 'net/https', https support disabled" end end @host_rules = Rules.new( :accept => [:hosts], :reject => [:ignore_hosts] ) @port_rules = Rules.new( :accept => [:ports], :reject => [:ignore_ports] ) @link_rules = Rules.new( :accept => [:links], :reject => [:ignore_links] ) @url_rules = Rules.new( :accept => [:urls], :reject => [:ignore_urls] ) @ext_rules = Rules.new( :accept => [:exts], :reject => [:ignore_exts] ) if [:host] visit_hosts_like([:host]) end if [:queue] self.queue = [:queue] end if [:history] self.history = [:history] end end |
#visit_ext?(path) ⇒ Boolean (protected)
Determines if a given URI path extension should be visited.
535 536 537 |
# File 'lib/spidr/filters.rb', line 535 def visit_ext?(path) @ext_rules.accept?(File.extname(path)[1..-1]) end |
#visit_exts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the URI path extensions to visit.
299 300 301 |
# File 'lib/spidr/filters.rb', line 299 def visit_exts @ext_rules.accept end |
#visit_exts_like(pattern = nil) {|ext| ... } ⇒ Object
Adds a given pattern to the #visit_exts.
315 316 317 318 319 320 321 322 323 |
# File 'lib/spidr/filters.rb', line 315 def visit_exts_like(pattern=nil,&block) if pattern visit_exts << pattern elsif block visit_exts << block end return self end |
#visit_host?(host) ⇒ Boolean (protected)
Determines if a given host-name should be visited.
481 482 483 |
# File 'lib/spidr/filters.rb', line 481 def visit_host?(host) @host_rules.accept?(host) end |
#visit_hosts ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match host-names to visit.
31 32 33 |
# File 'lib/spidr/filters.rb', line 31 def visit_hosts @host_rules.accept end |
#visit_hosts_like(pattern = nil) {|host| ... } ⇒ Object
Adds a given pattern to the #visit_hosts.
47 48 49 50 51 52 53 54 55 |
# File 'lib/spidr/filters.rb', line 47 def visit_hosts_like(pattern=nil,&block) if pattern visit_hosts << pattern elsif block visit_hosts << block end return self end |
#visit_link?(link) ⇒ Boolean (protected)
Determines if a given link should be visited.
507 508 509 |
# File 'lib/spidr/filters.rb', line 507 def visit_link?(link) @link_rules.accept?(link) end |
#visit_links ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the links to visit.
161 162 163 |
# File 'lib/spidr/filters.rb', line 161 def visit_links @link_rules.accept end |
#visit_links_like(pattern = nil) {|link| ... } ⇒ Object
Adds a given pattern to the #visit_links
179 180 181 182 183 184 185 186 187 |
# File 'lib/spidr/filters.rb', line 179 def visit_links_like(pattern=nil,&block) if pattern visit_links << pattern elsif block visit_links << block end return self end |
#visit_port?(port) ⇒ Boolean (protected)
Determines if a given port should be visited.
494 495 496 |
# File 'lib/spidr/filters.rb', line 494 def visit_port?(port) @port_rules.accept?(port) end |
#visit_ports ⇒ Array<Integer, Regexp, Proc>
Specifies the patterns that match the ports to visit.
95 96 97 |
# File 'lib/spidr/filters.rb', line 95 def visit_ports @port_rules.accept end |
#visit_ports_like(pattern = nil) {|port| ... } ⇒ Object
Adds a given pattern to the #visit_ports.
111 112 113 114 115 116 117 118 119 |
# File 'lib/spidr/filters.rb', line 111 def visit_ports_like(pattern=nil,&block) if pattern visit_ports << pattern elsif block visit_ports << block end return self end |
#visit_scheme?(scheme) ⇒ Boolean (protected)
Determines if a given URI scheme should be visited.
464 465 466 467 468 469 470 |
# File 'lib/spidr/filters.rb', line 464 def visit_scheme?(scheme) if scheme return @schemes.include?(scheme) else return true end end |
#visit_url?(link) ⇒ Boolean (protected)
Determines if a given URL should be visited.
522 523 524 |
# File 'lib/spidr/filters.rb', line 522 def visit_url?(link) @url_rules.accept?(link) end |
#visit_urls ⇒ Array<String, Regexp, Proc>
Specifies the patterns that match the URLs to visit.
229 230 231 |
# File 'lib/spidr/filters.rb', line 229 def visit_urls @url_rules.accept end |
#visit_urls_like(pattern = nil) {|url| ... } ⇒ Object
Adds a given pattern to the #visit_urls
247 248 249 250 251 252 253 254 255 |
# File 'lib/spidr/filters.rb', line 247 def visit_urls_like(pattern=nil,&block) if pattern visit_urls << pattern elsif block visit_urls << block end return self end |