Class: Crawler

Inherits:
Mechanize
  • Object
show all
Defined in:
lib/iron-crawler/crawler.rb

Overview

Enables the spidering of websites by utilizing Mechanize

Instance Method Summary collapse

Constructor Details

#initializeCrawler

Returns a new instance of Crawler.



6
7
8
9
# File 'lib/iron-crawler/crawler.rb', line 6

def initialize
  @mech = Mechanize.new
  @mech.max_history = nil
end

Instance Method Details

#reject(link) ⇒ Boolean

Whether we should reject to spider a URL.

Parameters:

  • A (Mechanize::Page::Link)

    mechanize page link.

Returns:

  • (Boolean)

    true if we should reject URL.



60
61
62
63
64
65
66
67
# File 'lib/iron-crawler/crawler.rb', line 60

def reject(link)
  # TODO: are we accounting for subdomains?
  if not_valid_uri?(link) || not_same_domain?(link) || already_spidered?(link)
    return true
  else
    return false
  end
end

#spiderize(url) ⇒ Hash

Kicks off the spidering of a site.

Parameters:

  • A (String)

    simple URL string to crawl.

Returns:

  • (Hash)

    A hash of URls crawled.



16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# File 'lib/iron-crawler/crawler.rb', line 16

def spiderize(url)
  page = @mech.get(url)

  stack = page.links
  stack.push(*src_links(page))

  while link = stack.pop
    next if reject(link)
    puts "crawling #{link.uri}"
    begin
      page = link.click
      next unless Mechanize::Page == page
      stack.push(*src_links(page))
      stack.push(*page.links)
    rescue Mechanize::ResponseCodeError
    end
  end
  return @mech.history
end

Since mechanize doesn’t treat src elements as links, this will return all src links from a page.

Parameters:

  • A (Mechanize::Page)

    mechanize page object.

Returns:

  • (Array)

    An array of created Mechanize::Page::Link objects.



42
43
44
45
46
47
48
49
50
51
52
53
# File 'lib/iron-crawler/crawler.rb', line 42

def src_links(page)
  links = []
  page.search('script').each do |element|
    next if element.attributes['src'].nil?
    doc = Nokogiri::HTML::Document.new
    node = Nokogiri::XML::Node.new('foo', doc)
    node['href'] = element.attributes['src'].value
    link = Mechanize::Page::Link.new(node, @mech, page)
    links.push(link)
  end
  return links
end