Class: Crawler

Inherits:

Object

Object
Crawler

show all

Defined in:: lib/analyzer_tools/crawl.rb

Overview

A fast web crawler that stays on the site it started from. Crawler randomly picks a URL from the page retrieved and follows it. If can’t find a URL for the next page, Crawler starts over from the beginning.

Crawler is multi-threaded and can run as many threads as you choose.

Instance Attribute Summary collapse

#times ⇒ Object readonly

Array of response times in seconds.

Instance Method Summary collapse

#do_request(url) ⇒ Object

Performs a request of url and returns the request body.
#extract_url_from(body, original_url) ⇒ Object

Returns a random URL on the same site as original_url from body using original_url to resolve relative paths.
#initialize(start_url, threads = 1) ⇒ Crawler constructor

Creates a new Crawler that will start at start_url and run threads concurrent threads.
#run ⇒ Object

Begins crawling.
#stop ⇒ Object

Stops crawling.
#time ⇒ Object

Returns the amount of time taken to execute the given block.
#timed_request(url) ⇒ Object

Performs a request of url and records the time taken into times.

Constructor Details

#initialize(start_url, threads = 1) ⇒ `Crawler`

Creates a new Crawler that will start at start_url and run threads concurrent threads.

Raises:

(ArgumentError)

# File 'lib/analyzer_tools/crawl.rb', line 25

def initialize(start_url, threads = 1)
  raise ArgumentError, "Thread count must be more than 0" if threads < 1
  @start_url = start_url
  @thread_count = threads
  @threads = ThreadGroup.new
  @times = []
end

Instance Attribute Details

#times ⇒ `Object` (readonly)

Array of response times in seconds.



19
20
21

# File 'lib/analyzer_tools/crawl.rb', line 19

def times
  @times
end

Instance Method Details

#do_request(url) ⇒ `Object`

Performs a request of url and returns the request body.

# File 'lib/analyzer_tools/crawl.rb', line 64

def do_request(url)
  req = []
  req << "GET #{url.request_uri} HTTP/1.0"
  req << "Host: #{url.host}"
  req << "User-Agent: RubyCrawl"
  req << ""
  req << ""
  req = req.join "\r\n"
  puts req

  begin
    s = TCPSocket.new url.host, url.port
    s.write req
    s.flush
    response = s.read
  ensure
    s.close unless s.nil?
  end

  headers, body = response.split(/\r\n\r\n/)

  headers = headers.split(/\r\n/)
  status = headers.shift
  headers = Hash[*headers.map { |h| h.split ': ', 2 }.flatten]

  puts status

  case status
  when / 302 / then
    body = "href=\"#{headers['Location']}\""
  when / 500 / then
    body = "href=\"#{@start_url}\""
  end

  return body
end

#extract_url_from(body, original_url) ⇒ `Object`

Returns a random URL on the same site as original_url from body using original_url to resolve relative paths. If no URL valid is found then the start URL is returned.

# File 'lib/analyzer_tools/crawl.rb', line 126

def extract_url_from(body, original_url)
  urls = body.scan(/href="(.+?)"/)
  until urls.empty? do
    begin
      rand_url = urls.delete_at(rand(urls.length)).first
      new_url = original_url + rand_url
      return new_url if new_url.host == original_url.host
    rescue URI::InvalidURIError
      retry
    end
  end

  return @start_url
end

#run ⇒ `Object`

Begins crawling.

# File 'lib/analyzer_tools/crawl.rb', line 36

def run
  url = @start_url

  @thread_count.times do
    Thread.start do
      @threads.add Thread.current
      loop do
        puts ">>> #{url}"
        body = timed_request url
        url = extract_url_from body, url
      end
    end
    Thread.pass
  end

  @threads.list.first.join until @threads.list.empty?
end

#stop ⇒ `Object`

Stops crawling.



57
58
59

# File 'lib/analyzer_tools/crawl.rb', line 57

def stop
  @threads.list.first.kill until @threads.list.empty?
end

#time ⇒ `Object`

Returns the amount of time taken to execute the given block.

# File 'lib/analyzer_tools/crawl.rb', line 104

def time
  start_time = Time.now.to_f
  yield
  end_time = Time.now.to_f
  return end_time - start_time
end

#timed_request(url) ⇒ `Object`

Performs a request of url and records the time taken into times. Returns the body of the request.

# File 'lib/analyzer_tools/crawl.rb', line 115

def timed_request(url)
  body = nil
  @times << time { body = do_request(url) }
  return body
end

Class: Crawler

Overview

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(start_url, threads = 1) ⇒ Crawler

Instance Attribute Details

#times ⇒ Object (readonly)

Instance Method Details

#do_request(url) ⇒ Object

#extract_url_from(body, original_url) ⇒ Object

#run ⇒ Object

#stop ⇒ Object

#time ⇒ Object

#timed_request(url) ⇒ Object