Class: Scrapes::Session
- Inherits:
-
Object
- Object
- Scrapes::Session
- Defined in:
- lib/scrapes/session.rb
Overview
Session is used to process all web pages under a single session. This may be necessary when some web sites need you to login, or otherwise create a session ID with a cookie before you can continue processing pages.
Instance Attribute Summary collapse
-
#base_uris ⇒ Object
readonly
Returns the value of attribute base_uris.
-
#cookies ⇒ Object
Returns the value of attribute cookies.
-
#crawler ⇒ Object
readonly
Returns the value of attribute crawler.
-
#log ⇒ Object
readonly
Returns the value of attribute log.
-
#post ⇒ Object
Returns the value of attribute post.
-
#timeout ⇒ Object
Returns the value of attribute timeout.
-
#uri ⇒ Object
Returns the value of attribute uri.
Class Method Summary collapse
-
.from_get(uri, &block) ⇒ Object
Start a session using a HTTP GET.
-
.from_post(uri, post, &block) ⇒ Object
Start a session using HTTP POST.
-
.start(log = nil, &block) ⇒ Object
Start a session witout having to create a session with the web site first.
Instance Method Summary collapse
-
#absolute_uri(uri) ⇒ Object
Convert a relative URI to an absolute URI.
-
#fetch(uri, post = {}) {|@crawler.fetch(u, post)| ... } ⇒ Object
Fetch a URL in the session, but without a Scrapes::Page.
-
#initialize(log = nil) ⇒ Session
constructor
A new instance of Session.
-
#page(page_class, link, post = {}, &block) ⇒ Object
Process a web page.
-
#refresh ⇒ Object
Refresh the session, sometimes necessary when you are getting pages out of the cache, but then go to the real web site and the session has expired.
Constructor Details
#initialize(log = nil) ⇒ Session
Returns a new instance of Session.
79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/scrapes/session.rb', line 79 def initialize log = nil @uri = nil @post = {} @when = Time.at(0) @timeout = 900 @cookies = Cookies.new @base_uris = [] @crawler = Crawler.new(self) @crawler.log = @log = log @refreshing = false end |
Instance Attribute Details
#base_uris ⇒ Object (readonly)
Returns the value of attribute base_uris.
52 53 54 |
# File 'lib/scrapes/session.rb', line 52 def base_uris @base_uris end |
#cookies ⇒ Object
Returns the value of attribute cookies.
43 44 45 |
# File 'lib/scrapes/session.rb', line 43 def @cookies end |
#crawler ⇒ Object (readonly)
Returns the value of attribute crawler.
49 50 51 |
# File 'lib/scrapes/session.rb', line 49 def crawler @crawler end |
#log ⇒ Object (readonly)
Returns the value of attribute log.
34 35 36 |
# File 'lib/scrapes/session.rb', line 34 def log @log end |
#post ⇒ Object
Returns the value of attribute post.
37 38 39 |
# File 'lib/scrapes/session.rb', line 37 def post @post end |
#timeout ⇒ Object
Returns the value of attribute timeout.
40 41 42 |
# File 'lib/scrapes/session.rb', line 40 def timeout @timeout end |
#uri ⇒ Object
Returns the value of attribute uri.
46 47 48 |
# File 'lib/scrapes/session.rb', line 46 def uri @uri end |
Class Method Details
.from_get(uri, &block) ⇒ Object
Start a session using a HTTP GET
56 57 58 59 60 |
# File 'lib/scrapes/session.rb', line 56 def self.from_get (uri, &block) session = self.new session.uri = uri block ? yield(session) : session end |
.from_post(uri, post, &block) ⇒ Object
Start a session using HTTP POST
64 65 66 67 68 69 |
# File 'lib/scrapes/session.rb', line 64 def self.from_post (uri, post, &block) session = self.new session.uri = uri session.post = post block ? yield(session) : session end |
.start(log = nil, &block) ⇒ Object
Start a session witout having to create a session with the web site first.
73 74 75 76 |
# File 'lib/scrapes/session.rb', line 73 def self.start (log=nil,&block) session = self.new(log) block ? yield(session) : session end |
Instance Method Details
#absolute_uri(uri) ⇒ Object
Convert a relative URI to an absolute URI
146 147 148 149 150 |
# File 'lib/scrapes/session.rb', line 146 def absolute_uri (uri) return uri if @base_uris.empty? base = URI.parse(@base_uris.last) base.merge(uri).to_s end |
#fetch(uri, post = {}) {|@crawler.fetch(u, post)| ... } ⇒ Object
Fetch a URL in the session, but without a Scrapes::Page
116 117 118 119 120 121 |
# File 'lib/scrapes/session.rb', line 116 def fetch (uri, post={}, &block) u = absolute_uri(uri) @base_uris.push(u) yield(@crawler.fetch(u, post)) @base_uris.pop end |
#page(page_class, link, post = {}, &block) ⇒ Object
Process a web page
99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
# File 'lib/scrapes/session.rb', line 99 def page (page_class, link, post={}, &block) return if link.nil? link = [link] unless link.respond_to?(:to_ary) block ||= lambda {|data| data} result = nil link.each do |u| fetch(u, post) do |res| result = page_class.extract(res.body, u, self, &block) end end result end |
#refresh ⇒ Object
Refresh the session, sometimes necessary when you are getting pages out of the cache, but then go to the real web site and the session has expired.
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
# File 'lib/scrapes/session.rb', line 126 def refresh if !@refreshing and @uri and (Time.now - @when) > @timeout begin @refreshing = true @when = Time.now @cookies.clear @crawler.cache.without_cache do @crawler.fetch(uri, post) end ensure @refreshing = false end end self end |