Class: WWMD::Spider

Inherits:
Object
  • Object
show all
Defined in:
lib/wwmd/page/spider.rb

Overview

when a WWMD::Page object is created, it created its own WWMD::Spider object which can be accessed using page.spider.method. The page.set_data method calls page.spider.add with the current url and a list of scraped links from the page. This class doesn’t do any real heavy lifting.

a simple spider can be written just by recursing through page.spider.next until it’s empty.

Constant Summary collapse

DEFAULT_IGNORE =
[
  /logoff/i,
  /logout/i,
]

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(opts = {}, ignore = nil) ⇒ Spider

pass me opts and an array of regexps to ignore we have a set of sane(ish) defaults here



26
27
28
29
30
31
32
33
34
35
36
# File 'lib/wwmd/page/spider.rb', line 26

def initialize(opts={},ignore=nil)
  @opts    = opts
  @visited = []
  @queued  = []
  @local_only = true
  @csrf_token = nil
  if !opts[:spider_local_only].nil?
    @local_only = opts[:spider_local_only]
  end
  @ignore = ignore || DEFAULT_IGNORE
end

Instance Attribute Details

#bypassObject

Returns the value of attribute bypass.



13
14
15
# File 'lib/wwmd/page/spider.rb', line 13

def bypass
  @bypass
end

#csrf_tokenObject

Returns the value of attribute csrf_token.



17
18
19
# File 'lib/wwmd/page/spider.rb', line 17

def csrf_token
  @csrf_token
end

#ignoreObject

Returns the value of attribute ignore.



16
17
18
# File 'lib/wwmd/page/spider.rb', line 16

def ignore
  @ignore
end

#local_onlyObject

Returns the value of attribute local_only.



14
15
16
# File 'lib/wwmd/page/spider.rb', line 14

def local_only
  @local_only
end

#optsObject (readonly)

Returns the value of attribute opts.



15
16
17
# File 'lib/wwmd/page/spider.rb', line 15

def opts
  @opts
end

#queuedObject

Returns the value of attribute queued.



11
12
13
# File 'lib/wwmd/page/spider.rb', line 11

def queued
  @queued
end

#visitedObject

Returns the value of attribute visited.



12
13
14
# File 'lib/wwmd/page/spider.rb', line 12

def visited
  @visited
end

Instance Method Details

#_check_ignore(url) ⇒ Object



122
123
124
125
# File 'lib/wwmd/page/spider.rb', line 122

def _check_ignore(url)
  @ignore.each { |x| return true if (url =~ x) }
  return false
end

#_de_csrf(url) ⇒ Object



113
114
115
116
117
118
119
120
# File 'lib/wwmd/page/spider.rb', line 113

def _de_csrf(url)
  return url if @csrf_token.nil?
  act,params = url.clopa
  form = params.to_form
  return url if !form.has_key?(@csrf_token)
  form[@csrf_token] = ''
  url = act + form.to_get
end

#add(url = '', links = []) ⇒ Object

add url to queue



99
100
101
102
103
104
# File 'lib/wwmd/page/spider.rb', line 99

def add(url='',links=[])
  return nil if @visited.include?(url)
  @visited.push(url)
  links.each { |l| self.push_url l }
  nil
end

#get_last(url) ⇒ Object

get the last ul we visited? this doesn’t look right



69
70
71
72
# File 'lib/wwmd/page/spider.rb', line 69

def get_last(url)
  tmp =  @visited.reject { |v| v =~ /#{url}/ }
  return tmp[-1]
end

#get_nextObject Also known as: next

get the next url in the queue



57
58
59
# File 'lib/wwmd/page/spider.rb', line 57

def get_next
  queued.shift
end

#next?Boolean

more elements in the queue?

Returns:

  • (Boolean)


64
65
66
# File 'lib/wwmd/page/spider.rb', line 64

def next?
  !queued.empty?
end

#push_url(url) ⇒ Object Also known as: push

push an url onto the queue



39
40
41
42
43
44
45
46
47
# File 'lib/wwmd/page/spider.rb', line 39

def push_url(url)
  return false if _check_ignore(url)
  if @local_only
    return false if !(url =~ /#{@opts[:base_url]}/)
  end
  return false if (@visited.include?(url) or @queued.include?(url))
  @queued.push(url)
  true
end

#set_ignore(arr) ⇒ Object

set up the ignore list ignore list is an array of regexp objects remember to set this up before calling any Page methods



109
110
111
# File 'lib/wwmd/page/spider.rb', line 109

def set_ignore(arr)
  @ignore = arr
end

#show_queue(id = nil) ⇒ Object Also known as: q

return the current queue (or the entry in the queue at [id]



87
88
89
90
91
92
93
94
# File 'lib/wwmd/page/spider.rb', line 87

def show_queue(id=nil)
  if id.nil?
    @queued.each_index { |i| putx i.to_s + " :: " + @queued[i].to_s }
    return nil
  else
    return @queued[id]
  end
end

#show_visited(id = nil) ⇒ Object Also known as: v

show the visited list (or the entry in the list at [id])



75
76
77
78
79
80
81
82
# File 'lib/wwmd/page/spider.rb', line 75

def show_visited(id=nil)
  if id.nil?
    @visited.each_index { |i| putx i.to_s + " :: " + @visited[i].to_s }
    return nil
  else
    return @visited[id]
  end
end

#skip(tim = 1) ⇒ Object

skip items in the queue



51
52
53
54
# File 'lib/wwmd/page/spider.rb', line 51

def skip(tim=1)
  tim.times { |i| @queued.shift }
  true
end