Class: Crawler
- Inherits:
-
Mechanize
- Object
- Mechanize
- Crawler
- Defined in:
- lib/iron-crawler/crawler.rb
Overview
Enables the spidering of websites by utilizing Mechanize
Instance Method Summary collapse
-
#initialize ⇒ Crawler
constructor
A new instance of Crawler.
-
#reject(link) ⇒ Boolean
Whether we should reject to spider a URL.
-
#spiderize(url) ⇒ Hash
Kicks off the spidering of a site.
-
#src_links(page) ⇒ Array
Since mechanize doesn’t treat src elements as links, this will return all src links from a page.
Constructor Details
#initialize ⇒ Crawler
Returns a new instance of Crawler.
6 7 8 9 |
# File 'lib/iron-crawler/crawler.rb', line 6 def initialize @mech = Mechanize.new @mech.max_history = nil end |
Instance Method Details
#reject(link) ⇒ Boolean
Whether we should reject to spider a URL.
60 61 62 63 64 65 66 67 |
# File 'lib/iron-crawler/crawler.rb', line 60 def reject(link) # TODO: are we accounting for subdomains? if not_valid_uri?(link) || not_same_domain?(link) || already_spidered?(link) return true else return false end end |
#spiderize(url) ⇒ Hash
Kicks off the spidering of a site.
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
# File 'lib/iron-crawler/crawler.rb', line 16 def spiderize(url) page = @mech.get(url) stack = page.links stack.push(*src_links(page)) while link = stack.pop next if reject(link) puts "crawling #{link.uri}" begin page = link.click next unless Mechanize::Page == page stack.push(*src_links(page)) stack.push(*page.links) rescue Mechanize::ResponseCodeError end end return @mech.history end |
#src_links(page) ⇒ Array
Since mechanize doesn’t treat src elements as links, this will return all src links from a page.
42 43 44 45 46 47 48 49 50 51 52 53 |
# File 'lib/iron-crawler/crawler.rb', line 42 def src_links(page) links = [] page.search('script').each do |element| next if element.attributes['src'].nil? doc = Nokogiri::HTML::Document.new node = Nokogiri::XML::Node.new('foo', doc) node['href'] = element.attributes['src'].value link = Mechanize::Page::Link.new(node, @mech, page) links.push(link) end return links end |