Class: Aquanaut::Worker
- Inherits:
-
Object
- Object
- Aquanaut::Worker
- Defined in:
- lib/aquanaut/worker.rb
Overview
The worker contains the actual crawling procedure.
Instance Method Summary collapse
-
#explore ⇒ Object
Triggers the crawling process.
-
#initialize(target) ⇒ Worker
constructor
A new instance of Worker.
-
#internal?(link) ⇒ Boolean
Evaluates if a link stays in the initial domain.
-
#links(uri) ⇒ Array<URI>, Array<Hash>
Retrieves all links to pages and static assets from a given page.
Constructor Details
#initialize(target) ⇒ Worker
Returns a new instance of Worker.
8 9 10 11 12 13 14 15 16 17 18 19 |
# File 'lib/aquanaut/worker.rb', line 8 def initialize(target) uri = URI.parse(target) @queue = [uri] @domain = PublicSuffix.parse(uri.host) @visited = Hash.new(false) @agent = Mechanize.new do |agent| agent.open_timeout = 5 agent.read_timeout = 5 end end |
Instance Method Details
#explore ⇒ Object
Triggers the crawling process.
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# File 'lib/aquanaut/worker.rb', line 23 def explore while not @queue.empty? uri = @queue.shift # dequeue next if @visited[uri] @visited[uri] = true puts "Visit #{uri}" links, assets = links(uri) links.each do |link| @queue.push(link) unless @visited[link] # enqueue end yield uri, links, assets if block_given? end end |
#internal?(link) ⇒ Boolean
Evaluates if a link stays in the initial domain.
Used to keep the crawler inside the initial domain. In order to determinate it uses the second-level and top-level domain. If the public suffix cannot be detected due to possibly invalidity returns true to make sure the link does not go unchecked.
105 106 107 108 109 |
# File 'lib/aquanaut/worker.rb', line 105 def internal?(link) return true unless PublicSuffix.valid?(link.host) link_domain = PublicSuffix.parse(link.host) @domain.sld == link_domain.sld and @domain.tld == link_domain.tld end |
#links(uri) ⇒ Array<URI>, Array<Hash>
Retrieves all links to pages and static assets from a given page. The decision whether a link points to an internal or external domain cannot be done by just exmaining the link’s URL. Due to possible HTTP 3xx responses the link needs to be resolved. Hence, each link is processed through a HTTP HEAD request to retrieve the final location.
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
# File 'lib/aquanaut/worker.rb', line 51 def links(uri) page = @agent.get(uri) grabbed = Hash.new(false) return [] unless page.is_a?(Mechanize::Page) assets = page.images.map do |image| uri = URI.join(page.uri, image.url) { 'uri' => uri, 'type' => 'image' } end page.parser.css('link[rel="stylesheet"]').each do |stylesheet| uri = URI.join(page.uri, stylesheet['href']) asset = { 'uri' => uri, 'type' => 'styleshet' } assets << asset end links = page.links.map do |link| begin next if link.uri.nil? reference = URI.join(page.uri, link.uri) next if grabbed[reference] header = @agent.head(reference) location = header.uri next if not internal?(location) or not header.is_a?(Mechanize::Page) grabbed[reference] = true grabbed[location] = true location rescue Mechanize::Error, URI::InvalidURIError, Net::HTTP::Persistent::Error, Net::OpenTimeout, Net::ReadTimeout next end end.compact return links, assets rescue Mechanize::Error, Net::OpenTimeout, Net::ReadTimeout, Net::HTTP::Persistent::Error return [], [] # swallow end |