Class: Magellan::Cartographer

Inherits:
Object
  • Object
show all
Includes:
Observable
Defined in:
lib/magellan/cartographer.rb

Overview

An instance of the Cartographer class maps a set of domains from a given starting url every time a new response is received the cartographer updates any observers listening to it to subscribe to the updates: cartographer = Cartographer.new({}) cartographer.add_observer(some_observer_instance)

Your observer instance should implement a update(time,result) method that takes in the current time and a Magellan::Result from the crawl

Instance Method Summary collapse

Constructor Details

#initialize(settings) ⇒ Cartographer

Create a new Cartographer with a hash of settings:

:origin_url
  • where to start exploring

:ignored_urls
  • an array of absolute urls to not explore

:domains
  • domains we should crawl

:depth_to_explore
  • how deep to explore

:links_we_want_to_explore
  • the kind of resources we will follow ex: //a

:trace
  • enable a step by step trace



22
23
24
25
26
27
28
29
# File 'lib/magellan/cartographer.rb', line 22

def initialize(settings)
  @origin_url = settings[:origin_url]
  @known_urls = settings[:ignored_urls]
  @domains = settings[:domains].map {|domain| URI.parse(domain)}
  @depth_to_explore = settings[:depth_to_explore]
  @links_we_want_to_explore = settings[:links_to_explore]
  @trace = settings[:trace]
end

Instance Method Details

#a_domain_we_care_about?(url) ⇒ Boolean

Is a given url in a domain that we care about?

Returns:

  • (Boolean)


71
72
73
74
75
76
77
# File 'lib/magellan/cartographer.rb', line 71

def a_domain_we_care_about?(url)
  begin
    !@domains.select { |domain| URI.parse(url).host == domain.host }.empty?
  rescue
    !@domains.select { |domain| url.gsub(/https*:\/\//,'').starts_with?(domain.host) }.empty?
  end
end

#crawlObject

Start recursivily exploring the site at the origin url you specify.



32
33
34
# File 'lib/magellan/cartographer.rb', line 32

def crawl
  recursive_explore([@origin_url],1)
end

#i_am_not_too_deep?(depth) ⇒ Boolean

Should we keep exploring this depth?

Returns:

  • (Boolean)


66
67
68
# File 'lib/magellan/cartographer.rb', line 66

def i_am_not_too_deep?(depth)
  depth <= @depth_to_explore
end

#i_have_seen_this_url_before?(url) ⇒ Boolean

Has the cartographer seen this url before?

Returns:

  • (Boolean)


61
62
63
# File 'lib/magellan/cartographer.rb', line 61

def i_have_seen_this_url_before?(url) 
  @known_urls.include?(url.remove_fragment)
end

#recursive_explore(urls, depth) ⇒ Object

Recursivily explore a list or urls until you reach a given depth or run out of known urls



37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# File 'lib/magellan/cartographer.rb', line 37

def recursive_explore(urls,depth)
  if i_am_not_too_deep?(depth)
    $stdout.puts "\nexploring:\n#{urls.join("\n")}" if @trace
    results = Explorer.new(urls,@links_we_want_to_explore).explore
    results.each do |result|
      changed
      notify_observers(Time.now, result)
      @known_urls << result.url.remove_fragment
      @known_urls << result.destination_url.remove_fragment
      remove_javascript_and_print_warning result
    end

    all_urls = results.map {|result| result.absolute_linked_resources }.flatten
    all_urls.uniq!
    #TODO: handle any other url parsing error
    all_urls.delete_if { |url| !a_domain_we_care_about?(url)}
    all_urls.delete_if { |url| i_have_seen_this_url_before?(url)}
    all_urls.chunk(40).each do |result_chunk|
      recursive_explore(result_chunk,depth+1)
    end
  end
end

#remove_javascript_and_print_warning(result) ⇒ Object

Remove the javascript links from the set of links on the page.



80
81
82
83
# File 'lib/magellan/cartographer.rb', line 80

def remove_javascript_and_print_warning(result)
  #TODO: put this in the logger
  result.linked_resources.delete_if { |linked_resource| linked_resource.downcase.starts_with?("javascript:") }
end