Class: Aquanaut::Worker

Inherits:

Object

Object
Aquanaut::Worker

show all

Defined in:: lib/aquanaut/worker.rb

Overview

The worker contains the actual crawling procedure.

Instance Method Summary collapse

#explore ⇒ Object

Triggers the crawling process.
#initialize(target) ⇒ Worker constructor

A new instance of Worker.
#internal?(link) ⇒ Boolean

Evaluates if a link stays in the initial domain.
#links(uri) ⇒ Array<URI>, Array<Hash>

Retrieves all links to pages and static assets from a given page.

Constructor Details

#initialize(target) ⇒ `Worker`

Returns a new instance of Worker.

# File 'lib/aquanaut/worker.rb', line 8

def initialize(target)
  uri = URI.parse(target)
  @queue = [uri]
  @domain = PublicSuffix.parse(uri.host)

  @visited = Hash.new(false)

  @agent = Mechanize.new do |agent|
    agent.open_timeout = 5
    agent.read_timeout = 5
  end
end

Instance Method Details

#explore ⇒ `Object`

Triggers the crawling process.

# File 'lib/aquanaut/worker.rb', line 23

def explore
  while not @queue.empty?
    uri = @queue.shift  # dequeue
    next if @visited[uri]

    @visited[uri] = true
    puts "Visit #{uri}"

    links, assets = links(uri)
    links.each do |link|
      @queue.push(link) unless @visited[link]  # enqueue
    end

    yield uri, links, assets if block_given?
  end
end

#internal?(link) ⇒ `Boolean`

Evaluates if a link stays in the initial domain.

Used to keep the crawler inside the initial domain. In order to determinate it uses the second-level and top-level domain. If the public suffix cannot be detected due to possibly invalidity returns true to make sure the link does not go unchecked.

Parameters:

link (URI) —

the link to be checked.

Returns:

(Boolean) —

whether the link is internal or not.

# File 'lib/aquanaut/worker.rb', line 105

def internal?(link)
  return true unless PublicSuffix.valid?(link.host)
  link_domain = PublicSuffix.parse(link.host)
  @domain.sld == link_domain.sld and @domain.tld == link_domain.tld
end

#links(uri) ⇒ `Array<URI>`, `Array<Hash>`

Retrieves all links to pages and static assets from a given page. The decision whether a link points to an internal or external domain cannot be done by just exmaining the link’s URL. Due to possible HTTP 3xx responses the link needs to be resolved. Hence, each link is processed through a HTTP HEAD request to retrieve the final location.