Class: WWMD::Spider
- Inherits:
-
Object
- Object
- WWMD::Spider
- Defined in:
- lib/wwmd/page/spider.rb
Overview
when a WWMD::Page object is created, it created its own WWMD::Spider object which can be accessed using page.spider.method
. The page.set_data
method calls page.spider.add
with the current url and a list of scraped links from the page. This class doesn’t do any real heavy lifting.
a simple spider can be written just by recursing through page.spider.next until it’s empty.
Constant Summary collapse
- DEFAULT_IGNORE =
[ /logoff/i, /logout/i, ]
Instance Attribute Summary collapse
-
#bypass ⇒ Object
Returns the value of attribute bypass.
-
#csrf_token ⇒ Object
Returns the value of attribute csrf_token.
-
#ignore ⇒ Object
Returns the value of attribute ignore.
-
#local_only ⇒ Object
Returns the value of attribute local_only.
-
#opts ⇒ Object
readonly
Returns the value of attribute opts.
-
#queued ⇒ Object
Returns the value of attribute queued.
-
#visited ⇒ Object
Returns the value of attribute visited.
Instance Method Summary collapse
- #_check_ignore(url) ⇒ Object
- #_de_csrf(url) ⇒ Object
-
#add(url = '', links = []) ⇒ Object
add url to queue.
-
#get_last(url) ⇒ Object
get the last ul we visited? this doesn’t look right.
-
#get_next ⇒ Object
(also: #next)
get the next url in the queue.
-
#initialize(opts = {}, ignore = nil) ⇒ Spider
constructor
pass me opts and an array of regexps to ignore we have a set of sane(ish) defaults here.
-
#next? ⇒ Boolean
more elements in the queue?.
-
#push_url(url) ⇒ Object
(also: #push)
push an url onto the queue.
-
#set_ignore(arr) ⇒ Object
set up the ignore list ignore list is an array of regexp objects remember to set this up before calling any Page methods.
-
#show_queue(id = nil) ⇒ Object
(also: #q)
return the current queue (or the entry in the queue at [id].
-
#show_visited(id = nil) ⇒ Object
(also: #v)
show the visited list (or the entry in the list at [id]).
-
#skip(tim = 1) ⇒ Object
skip items in the queue.
Constructor Details
#initialize(opts = {}, ignore = nil) ⇒ Spider
pass me opts and an array of regexps to ignore we have a set of sane(ish) defaults here
26 27 28 29 30 31 32 33 34 35 36 |
# File 'lib/wwmd/page/spider.rb', line 26 def initialize(opts={},ignore=nil) @opts = opts @visited = [] @queued = [] @local_only = true @csrf_token = nil if !opts[:spider_local_only].nil? @local_only = opts[:spider_local_only] end @ignore = ignore || DEFAULT_IGNORE end |
Instance Attribute Details
#bypass ⇒ Object
Returns the value of attribute bypass.
13 14 15 |
# File 'lib/wwmd/page/spider.rb', line 13 def bypass @bypass end |
#csrf_token ⇒ Object
Returns the value of attribute csrf_token.
17 18 19 |
# File 'lib/wwmd/page/spider.rb', line 17 def csrf_token @csrf_token end |
#ignore ⇒ Object
Returns the value of attribute ignore.
16 17 18 |
# File 'lib/wwmd/page/spider.rb', line 16 def ignore @ignore end |
#local_only ⇒ Object
Returns the value of attribute local_only.
14 15 16 |
# File 'lib/wwmd/page/spider.rb', line 14 def local_only @local_only end |
#opts ⇒ Object (readonly)
Returns the value of attribute opts.
15 16 17 |
# File 'lib/wwmd/page/spider.rb', line 15 def opts @opts end |
#queued ⇒ Object
Returns the value of attribute queued.
11 12 13 |
# File 'lib/wwmd/page/spider.rb', line 11 def queued @queued end |
#visited ⇒ Object
Returns the value of attribute visited.
12 13 14 |
# File 'lib/wwmd/page/spider.rb', line 12 def visited @visited end |
Instance Method Details
#_check_ignore(url) ⇒ Object
122 123 124 125 |
# File 'lib/wwmd/page/spider.rb', line 122 def _check_ignore(url) @ignore.each { |x| return true if (url =~ x) } return false end |
#_de_csrf(url) ⇒ Object
113 114 115 116 117 118 119 120 |
# File 'lib/wwmd/page/spider.rb', line 113 def _de_csrf(url) return url if @csrf_token.nil? act,params = url.clopa form = params.to_form return url if !form.has_key?(@csrf_token) form[@csrf_token] = '' url = act + form.to_get end |
#add(url = '', links = []) ⇒ Object
add url to queue
99 100 101 102 103 104 |
# File 'lib/wwmd/page/spider.rb', line 99 def add(url='',links=[]) return nil if @visited.include?(url) @visited.push(url) links.each { |l| self.push_url l } nil end |
#get_last(url) ⇒ Object
get the last ul we visited? this doesn’t look right
69 70 71 72 |
# File 'lib/wwmd/page/spider.rb', line 69 def get_last(url) tmp = @visited.reject { |v| v =~ /#{url}/ } return tmp[-1] end |
#get_next ⇒ Object Also known as: next
get the next url in the queue
57 58 59 |
# File 'lib/wwmd/page/spider.rb', line 57 def get_next queued.shift end |
#next? ⇒ Boolean
more elements in the queue?
64 65 66 |
# File 'lib/wwmd/page/spider.rb', line 64 def next? !queued.empty? end |
#push_url(url) ⇒ Object Also known as: push
push an url onto the queue
39 40 41 42 43 44 45 46 47 |
# File 'lib/wwmd/page/spider.rb', line 39 def push_url(url) return false if _check_ignore(url) if @local_only return false if !(url =~ /#{@opts[:base_url]}/) end return false if (@visited.include?(url) or @queued.include?(url)) @queued.push(url) true end |
#set_ignore(arr) ⇒ Object
set up the ignore list ignore list is an array of regexp objects remember to set this up before calling any Page methods
109 110 111 |
# File 'lib/wwmd/page/spider.rb', line 109 def set_ignore(arr) @ignore = arr end |
#show_queue(id = nil) ⇒ Object Also known as: q
return the current queue (or the entry in the queue at [id]
87 88 89 90 91 92 93 94 |
# File 'lib/wwmd/page/spider.rb', line 87 def show_queue(id=nil) if id.nil? @queued.each_index { |i| putx i.to_s + " :: " + @queued[i].to_s } return nil else return @queued[id] end end |
#show_visited(id = nil) ⇒ Object Also known as: v
show the visited list (or the entry in the list at [id])
75 76 77 78 79 80 81 82 |
# File 'lib/wwmd/page/spider.rb', line 75 def show_visited(id=nil) if id.nil? @visited.each_index { |i| putx i.to_s + " :: " + @visited[i].to_s } return nil else return @visited[id] end end |
#skip(tim = 1) ⇒ Object
skip items in the queue
51 52 53 54 |
# File 'lib/wwmd/page/spider.rb', line 51 def skip(tim=1) tim.times { |i| @queued.shift } true end |