Class: Scrapes::Session

Inherits:

Object

Object
Scrapes::Session

show all

Defined in:: lib/scrapes/session.rb

Overview

Session is used to process all web pages under a single session. This may be necessary when some web sites need you to login, or otherwise create a session ID with a cookie before you can continue processing pages.

Instance Attribute Summary collapse

#base_uris ⇒ Object readonly

Returns the value of attribute base_uris.
#cookies ⇒ Object

Returns the value of attribute cookies.
#crawler ⇒ Object readonly

Returns the value of attribute crawler.
#log ⇒ Object readonly

Returns the value of attribute log.
#post ⇒ Object

Returns the value of attribute post.
#timeout ⇒ Object

Returns the value of attribute timeout.
#uri ⇒ Object

Returns the value of attribute uri.

Class Method Summary collapse

.from_get(uri, &block) ⇒ Object

Start a session using a HTTP GET.
.from_post(uri, post, &block) ⇒ Object

Start a session using HTTP POST.
.start(log = nil, &block) ⇒ Object

Start a session witout having to create a session with the web site first.

Instance Method Summary collapse

#absolute_uri(uri) ⇒ Object

Convert a relative URI to an absolute URI.
#fetch(uri, post = {}) {|@crawler.fetch(u, post)| ... } ⇒ Object

Fetch a URL in the session, but without a Scrapes::Page.
#initialize(log = nil) ⇒ Session constructor

A new instance of Session.
#page(page_class, link, post = {}, &block) ⇒ Object

Process a web page.
#refresh ⇒ Object

Refresh the session, sometimes necessary when you are getting pages out of the cache, but then go to the real web site and the session has expired.

Constructor Details

#initialize(log = nil) ⇒ `Session`

Returns a new instance of Session.

# File 'lib/scrapes/session.rb', line 79

def initialize log = nil
  @uri = nil
  @post = {}
  @when = Time.at(0)
  @timeout = 900
  @cookies = Cookies.new
  @base_uris = []
  @crawler = Crawler.new(self)
  @crawler.log = @log = log
  @refreshing = false
end

Instance Attribute Details

#base_uris ⇒ `Object` (readonly)

Returns the value of attribute base_uris.



52
53
54

# File 'lib/scrapes/session.rb', line 52

def base_uris
  @base_uris
end

#cookies ⇒ `Object`

Returns the value of attribute cookies.



43
44
45

# File 'lib/scrapes/session.rb', line 43

def cookies
  @cookies
end

#crawler ⇒ `Object` (readonly)

Returns the value of attribute crawler.



49
50
51

# File 'lib/scrapes/session.rb', line 49

def crawler
  @crawler
end

#log ⇒ `Object` (readonly)

Returns the value of attribute log.



34
35
36

# File 'lib/scrapes/session.rb', line 34

def log
  @log
end

#post ⇒ `Object`

Returns the value of attribute post.



37
38
39

# File 'lib/scrapes/session.rb', line 37

def post
  @post
end

#timeout ⇒ `Object`

Returns the value of attribute timeout.



40
41
42

# File 'lib/scrapes/session.rb', line 40

def timeout
  @timeout
end

#uri ⇒ `Object`

Returns the value of attribute uri.



46
47
48

# File 'lib/scrapes/session.rb', line 46

def uri
  @uri
end

Class Method Details

.from_get(uri, &block) ⇒ `Object`

Start a session using a HTTP GET

# File 'lib/scrapes/session.rb', line 56

def self.from_get (uri, &block)
  session = self.new
  session.uri = uri
  block ? yield(session) : session
end

.from_post(uri, post, &block) ⇒ `Object`

Start a session using HTTP POST

# File 'lib/scrapes/session.rb', line 64

def self.from_post (uri, post, &block)
  session = self.new
  session.uri = uri
  session.post = post
  block ? yield(session) : session
end

.start(log = nil, &block) ⇒ `Object`

Start a session witout having to create a session with the web site first.

# File 'lib/scrapes/session.rb', line 73

def self.start (log=nil,&block)
  session = self.new(log)
  block ? yield(session) : session
end

Instance Method Details

#absolute_uri(uri) ⇒ `Object`

Convert a relative URI to an absolute URI

# File 'lib/scrapes/session.rb', line 146

def absolute_uri (uri)
  return uri if @base_uris.empty?
  base = URI.parse(@base_uris.last)
  base.merge(uri).to_s
end

#fetch(uri, post = {}) {|@crawler.fetch(u, post)| ... } ⇒ `Object`

Fetch a URL in the session, but without a Scrapes::Page

Yields:

(@crawler.fetch(u, post))

# File 'lib/scrapes/session.rb', line 116

def fetch (uri, post={}, &block)
  u = absolute_uri(uri)
  @base_uris.push(u)
  yield(@crawler.fetch(u, post))
  @base_uris.pop
end

#page(page_class, link, post = {}, &block) ⇒ `Object`

Process a web page

# File 'lib/scrapes/session.rb', line 99

def page (page_class, link, post={}, &block)
  return if link.nil?
  link = [link] unless link.respond_to?(:to_ary)
  block ||= lambda {|data| data}
  result = nil

  link.each do |u|
    fetch(u, post) do |res|
      result = page_class.extract(res.body, u, self, &block)
    end
  end

  result
end

#refresh ⇒ `Object`

Refresh the session, sometimes necessary when you are getting pages out of the cache, but then go to the real web site and the session has expired.

# File 'lib/scrapes/session.rb', line 126

def refresh
  if !@refreshing and @uri and (Time.now - @when) > @timeout
    begin
      @refreshing = true
      @when = Time.now
      @cookies.clear

      @crawler.cache.without_cache do
        @crawler.fetch(uri, post)
      end
    ensure
      @refreshing = false
    end
  end

  self
end

Class: Scrapes::Session

Overview

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(log = nil) ⇒ Session

Instance Attribute Details

#base_uris ⇒ Object (readonly)

#cookies ⇒ Object

#crawler ⇒ Object (readonly)

#log ⇒ Object (readonly)

#post ⇒ Object

#timeout ⇒ Object

#uri ⇒ Object