Class: Scrapes::Session

Inherits:
Object
  • Object
show all
Defined in:
lib/scrapes/session.rb

Overview

Session is used to process all web pages under a single session. This may be necessary when some web sites need you to login, or otherwise create a session ID with a cookie before you can continue processing pages.

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(log = nil) ⇒ Session

Returns a new instance of Session.



79
80
81
82
83
84
85
86
87
88
89
# File 'lib/scrapes/session.rb', line 79

def initialize log = nil
  @uri = nil
  @post = {}
  @when = Time.at(0)
  @timeout = 900
  @cookies = Cookies.new
  @base_uris = []
  @crawler = Crawler.new(self)
  @crawler.log = @log = log
  @refreshing = false
end

Instance Attribute Details

#base_urisObject (readonly)

Returns the value of attribute base_uris.



52
53
54
# File 'lib/scrapes/session.rb', line 52

def base_uris
  @base_uris
end

#cookiesObject

Returns the value of attribute cookies.



43
44
45
# File 'lib/scrapes/session.rb', line 43

def cookies
  @cookies
end

#crawlerObject (readonly)

Returns the value of attribute crawler.



49
50
51
# File 'lib/scrapes/session.rb', line 49

def crawler
  @crawler
end

#logObject (readonly)

Returns the value of attribute log.



34
35
36
# File 'lib/scrapes/session.rb', line 34

def log
  @log
end

#postObject

Returns the value of attribute post.



37
38
39
# File 'lib/scrapes/session.rb', line 37

def post
  @post
end

#timeoutObject

Returns the value of attribute timeout.



40
41
42
# File 'lib/scrapes/session.rb', line 40

def timeout
  @timeout
end

#uriObject

Returns the value of attribute uri.



46
47
48
# File 'lib/scrapes/session.rb', line 46

def uri
  @uri
end

Class Method Details

.from_get(uri, &block) ⇒ Object

Start a session using a HTTP GET



56
57
58
59
60
# File 'lib/scrapes/session.rb', line 56

def self.from_get (uri, &block)
  session = self.new
  session.uri = uri
  block ? yield(session) : session
end

.from_post(uri, post, &block) ⇒ Object

Start a session using HTTP POST



64
65
66
67
68
69
# File 'lib/scrapes/session.rb', line 64

def self.from_post (uri, post, &block)
  session = self.new
  session.uri = uri
  session.post = post
  block ? yield(session) : session
end

.start(log = nil, &block) ⇒ Object

Start a session witout having to create a session with the web site first.



73
74
75
76
# File 'lib/scrapes/session.rb', line 73

def self.start (log=nil,&block)
  session = self.new(log)
  block ? yield(session) : session
end

Instance Method Details

#absolute_uri(uri) ⇒ Object

Convert a relative URI to an absolute URI



146
147
148
149
150
# File 'lib/scrapes/session.rb', line 146

def absolute_uri (uri)
  return uri if @base_uris.empty?
  base = URI.parse(@base_uris.last)
  base.merge(uri).to_s
end

#fetch(uri, post = {}) {|@crawler.fetch(u, post)| ... } ⇒ Object

Fetch a URL in the session, but without a Scrapes::Page

Yields:



116
117
118
119
120
121
# File 'lib/scrapes/session.rb', line 116

def fetch (uri, post={}, &block)
  u = absolute_uri(uri)
  @base_uris.push(u)
  yield(@crawler.fetch(u, post))
  @base_uris.pop
end

#page(page_class, link, post = {}, &block) ⇒ Object

Process a web page



99
100
101
102
103
104
105
106
107
108
109
110
111
112
# File 'lib/scrapes/session.rb', line 99

def page (page_class, link, post={}, &block)
  return if link.nil?
  link = [link] unless link.respond_to?(:to_ary)
  block ||= lambda {|data| data}
  result = nil

  link.each do |u|
    fetch(u, post) do |res|
      result = page_class.extract(res.body, u, self, &block)
    end
  end

  result
end

#refreshObject

Refresh the session, sometimes necessary when you are getting pages out of the cache, but then go to the real web site and the session has expired.



126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
# File 'lib/scrapes/session.rb', line 126

def refresh
  if !@refreshing and @uri and (Time.now - @when) > @timeout
    begin
      @refreshing = true
      @when = Time.now
      @cookies.clear

      @crawler.cache.without_cache do
        @crawler.fetch(uri, post)
      end
    ensure
      @refreshing = false
    end
  end

  self
end