Class: Wgit::Crawler

Inherits:

Object

Object
Wgit::Crawler

show all

Includes:: Assertable

Defined in:: lib/wgit/crawler.rb

Overview

The Crawler class provides a means of crawling web based HTTP Wgit::Url's, serialising their HTML into Wgit::Document instances. This is the only Wgit class which contains network logic e.g. request/response handling.

Constant Summary collapse

SUPPORTED_FILE_EXTENSIONS = The URL file extensions (from <a> hrefs) which will be crawled by #crawl_site. The idea is to omit anything that isn't HTML and therefore doesn't keep the crawl of the site going. All URL's without a file extension will be crawled, because they're assumed to be HTML.

Set.new(
  %w[asp aspx cfm cgi htm html htmlx jsp php]
)

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::WRONG_METHOD_MSG

Instance Attribute Summary collapse

#encode ⇒ Object
Whether or not to UTF-8 encode the response body once crawled.
#last_response ⇒ Object readonly
The Wgit::Response of the most recently crawled URL.
#redirect_limit ⇒ Object
The amount of allowed redirects before raising an error.
#time_out ⇒ Object
The maximum amount of time (in seconds) a crawl request has to complete before raising an error.

Instance Method Summary collapse

#crawl_site(url, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>^? (also: #crawl_r)
Crawls an entire website's HTML pages by recursively going through its internal <a> links.
#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document^? (also: #crawl_page)
Crawl the url returning the response Wgit::Document or nil, if an error occurs.
#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document (also: #crawl, #crawl_pages)
Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath.
#fetch(url, follow_redirects: true) ⇒ String^? protected
Returns the url HTML String or nil.
#get_internal_links(doc, allow_paths: nil, disallow_paths: nil) ⇒ Array<Wgit::Url> protected
Returns a doc's internal HTML page links in absolute form; used when crawling a site.
#get_response(url, response) ⇒ Wgit::Response protected
Makes a HTTP request and enriches the given Wgit::Response from it.
#http_get(url) ⇒ Typhoeus::Response protected
Performs a HTTP GET request and returns the response.
#initialize(redirect_limit: 5, time_out: 5, encode: true) ⇒ Crawler constructor
Initializes and returns a Wgit::Crawler instance.
#resolve(url, response, follow_redirects: true) ⇒ Object protected
GETs the given url, resolving any redirects.

Methods included from Assertable

#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(redirect_limit: 5, time_out: 5, encode: true) ⇒ `Crawler`

Initializes and returns a Wgit::Crawler instance.

Parameters:

redirect_limit (Integer) (defaults to: 5) —
The amount of allowed redirects before raising an error. Set to 0 to disable redirects completely.
time_out (Integer, Float) (defaults to: 5) —
The maximum amount of time (in seconds) a crawl request has to complete before raising an error. Set to 0 to disable time outs completely.
encode (Boolean) (defaults to: true) —
Whether or not to UTF-8 encode the response body once crawled. Set to false if crawling more than just HTML e.g. images.

# File 'lib/wgit/crawler.rb', line 51

def initialize(redirect_limit: 5, time_out: 5, encode: true)
  @redirect_limit = redirect_limit
  @time_out       = time_out
  @encode         = encode
end

Instance Attribute Details

#encode ⇒ `Object`

Whether or not to UTF-8 encode the response body once crawled. Set to false if crawling more than just HTML e.g. images.



37
38
39

# File 'lib/wgit/crawler.rb', line 37

def encode
  @encode
end

#last_response ⇒ `Object` (readonly)

The Wgit::Response of the most recently crawled URL.



40
41
42

# File 'lib/wgit/crawler.rb', line 40

def last_response
  @last_response
end

#redirect_limit ⇒ `Object`

The amount of allowed redirects before raising an error. Set to 0 to disable redirects completely; or you can pass follow_redirects: false to any Wgit::Crawler.crawl_* method.



29
30
31

# File 'lib/wgit/crawler.rb', line 29

def redirect_limit
  @redirect_limit
end

#time_out ⇒ `Object`

The maximum amount of time (in seconds) a crawl request has to complete before raising an error. Set to 0 to disable time outs completely.



33
34
35

# File 'lib/wgit/crawler.rb', line 33

def time_out
  @time_out
end

Instance Method Details

#crawl_site(url, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Array<Wgit::Url>`^? Also known as: crawl_r

Crawls an entire website's HTML pages by recursively going through its internal <a> links. Each crawled Document is yielded to a block. Use doc.empty? to determine if the crawled link is valid.

Use the allow and disallow paths params to partially and selectively crawl a site; the glob syntax is fully supported e.g. 'wiki/\*' etc. Note that each path must NOT start with a slash; the only exception being a / on its own with no other characters, referring to the index page.

Only redirects to the same host are followed. For example, the Url 'http://www.example.co.uk/how' has a host of 'www.example.co.uk' meaning a link which redirects to 'https://ftp.example.co.uk' or 'https://www.example.com' will not be followed. The only exception to this is the initially crawled url which is allowed to redirect anywhere; it's host is then used for other link redirections on the site, as described above.

Parameters:

url (Wgit::Url) —
The base URL of the website to be crawled. It is recommended that this URL be the index page of the site to give a greater chance of finding all pages within that site/host.
allow_paths (String, Array<String>) (defaults to: nil) —
Filters links by selecting them if their path File.fnmatch? one of allow_paths.
disallow_paths (String, Array<String>) (defaults to: nil) —
Filters links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

(doc) —
Given each crawled page (Wgit::Document) of the site. A block is the only way to interact with each crawled Document. Use doc.empty? to determine if the page is valid.

Returns:

(Array<Wgit::Url>, nil) —
Unique Array of external urls collected from all of the site's pages or nil if the given url could not be crawled successfully.

# File 'lib/wgit/crawler.rb', line 87

def crawl_site(url, allow_paths: nil, disallow_paths: nil, &block)
  doc = crawl_url(url, &block)
  return nil if doc.nil?

  path_opts = { allow_paths: allow_paths, disallow_paths: disallow_paths }
  alt_url   = url.end_with?('/') ? url.chop : url + '/'

  crawled   = Set.new([url, alt_url])
  externals = Set.new(doc.external_links)
  internals = Set.new(get_internal_links(doc, path_opts))

  return externals.to_a if internals.empty?

  loop do
    links = internals - crawled
    break if links.empty?

    links.each do |link|
      orig_link = link.dup
      doc = crawl_url(link, follow_redirects: :host, &block)

      crawled += [orig_link, link] # Push both links in case of redirects.
      next if doc.nil?

      internals += get_internal_links(doc, path_opts)
      externals += doc.external_links
    end
  end

  externals.to_a
end

#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ `Wgit::Document`^? Also known as: crawl_page

Crawl the url returning the response Wgit::Document or nil, if an error occurs.

Parameters:

url (Wgit::Url) —
The Url to crawl; which will likely be modified.
follow_redirects (Boolean, Symbol) (defaults to: true) —
Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :base, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Yields:

(doc) —
The crawled HTML page (Wgit::Document) regardless if the crawl was successful or not. Therefore, Document#url etc. can be used.

Returns:

(Wgit::Document, nil) —
The crawled HTML Document or nil if the crawl was unsuccessful.

# File 'lib/wgit/crawler.rb', line 155

def crawl_url(url, follow_redirects: true)
  # A String url isn't allowed because it's passed by value not reference,
  # meaning a redirect isn't reflected; A Wgit::Url is passed by reference.
  assert_type(url, Wgit::Url)

  html = fetch(url, follow_redirects: follow_redirects)
  doc  = Wgit::Document.new(url, html, encode: @encode)

  yield(doc) if block_given?

  doc.empty? ? nil : doc
end

#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ `Wgit::Document` Also known as: crawl, crawl_pages

Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath. See Wgit::Crawler#crawl_site for crawling entire sites.

Parameters:

urls (*Wgit::Url) —
The Url's to crawl.
follow_redirects (Boolean, Symbol) (defaults to: true) —
Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :base, :host, :domain or :brand. See Wgit::Url#relative? opts param. This value will be used for all urls crawled.

Yields:

(doc) —
Given each crawled page (Wgit::Document); this is the only way to interact with them.

Returns:

(Wgit::Document) —
The last Document crawled.

Raises:

(StandardError) —
If no urls are provided.

# File 'lib/wgit/crawler.rb', line 132

def crawl_urls(*urls, follow_redirects: true, &block)
  raise 'You must provide at least one Url' if urls.empty?

  opts = { follow_redirects: follow_redirects }
  doc = nil

  Wgit::Utils.each(urls) { |url| doc = crawl_url(url, opts, &block) }

  doc
end

#fetch(url, follow_redirects: true) ⇒ `String`^? (protected)

Returns the url HTML String or nil. Handles any errors that arise and sets the @last_response. Errors or any HTTP response that doesn't return a HTML body will be ignored, returning nil.

Parameters:

url (Wgit::Url) —
The URL to fetch. This Url object is passed by reference and gets modified as a result of the fetch/crawl.
follow_redirects (Boolean, Symbol) (defaults to: true) —
Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :base, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Returns:

(String, nil) —
The crawled HTML or nil if the crawl was unsuccessful.

Raises:

(StandardError) —
If url isn't valid and absolute.

# File 'lib/wgit/crawler.rb', line 183

def fetch(url, follow_redirects: true)
  response = Wgit::Response.new
  raise "Invalid url: #{url}" if url.invalid?

  resolve(url, response, follow_redirects: follow_redirects)
  response.body_or_nil
rescue StandardError => e
  Wgit.logger.debug("Wgit::Crawler#fetch('#{url}') exception: #{e}")

  nil
ensure
  url.crawled        = true # Sets date_crawled underneath.
  url.crawl_duration = response.total_time

  @last_response = response
end

#get_internal_links(doc, allow_paths: nil, disallow_paths: nil) ⇒ `Array<Wgit::Url>` (protected)

Returns a doc's internal HTML page links in absolute form; used when crawling a site. Use the allow and disallow paths params to partially and selectively crawl a site; the glob syntax is supported e.g. 'wiki/\*' etc. Note that each path should NOT start with a slash.

Override this method in a subclass to change how a site is crawled, not what is extracted from each page (Document extensions should be used for this purpose instead). Just remember that only HTML files containing <a> links keep the crawl going beyond the base URL.

Parameters:

doc (Wgit::Document) —
The document from which to extract it's internal (absolute) page links.
allow_paths (String, Array<String>) (defaults to: nil) —
Filters links by selecting them if their path File.fnmatch? one of allow_paths.
disallow_paths (String, Array<String>) (defaults to: nil) —
Filters links by rejecting them if their path File.fnmatch? one of disallow_paths.

Returns:

(Array<Wgit::Url>) —
The internal page links from doc.

# File 'lib/wgit/crawler.rb', line 309

def get_internal_links(doc, allow_paths: nil, disallow_paths: nil)
  links = doc
          .internal_absolute_links
          .map(&:omit_fragment) # Because fragments don't alter content.
          .uniq
          .select do |link|
    ext = link.to_extension
    ext ? SUPPORTED_FILE_EXTENSIONS.include?(ext.downcase) : true
  end

  return links if allow_paths.nil? && disallow_paths.nil?

  process_paths(links, allow_paths, disallow_paths)
end

#get_response(url, response) ⇒ `Wgit::Response` (protected)

Makes a HTTP request and enriches the given Wgit::Response from it.

Parameters:

url (String) —
The url to GET. Will call url#normalize if possible.
response (Wgit::Response) —
The response to enrich. Modifies by reference.

Returns:

(Wgit::Response) —
The enriched HTTP Wgit::Response object.

Raises:

(StandardError) —
If a response can't be obtained.

# File 'lib/wgit/crawler.rb', line 249

def get_response(url, response)
  # Perform a HTTP GET request.
  orig_url = url.to_s
  url      = url.normalize if url.respond_to?(:normalize)

  http_response = http_get(url)

  # Enrich the given Wgit::Response object.
  response.adapter_response = http_response
  response.url              = orig_url
  response.status           = http_response.code
  response.headers          = http_response.headers
  response.body             = http_response.body
  response.ip_address       = http_response.primary_ip
  response.add_total_time(http_response.total_time)

  # Log the request/response details.
  log_http(response)

  # Handle a failed response.
  raise "No response (within timeout: #{@time_out} second(s))" \
  if response.failure?
end

#http_get(url) ⇒ `Typhoeus::Response` (protected)

Performs a HTTP GET request and returns the response.

Parameters:

url (String) —
The url to GET.

Returns:

(Typhoeus::Response) —
The HTTP response object.

# File 'lib/wgit/crawler.rb', line 277

def http_get(url)
  opts = {
    followlocation: false,
    timeout: @time_out,
    accept_encoding: 'gzip',
    headers: {
      'User-Agent' => "wgit/#{Wgit::VERSION}",
      'Accept'     => 'text/html'
    }
  }

  # See https://rubydoc.info/gems/typhoeus for more info.
  Typhoeus.get(url, opts)
end

#resolve(url, response, follow_redirects: true) ⇒ `Object` (protected)

GETs the given url, resolving any redirects. The given response object will be enriched.

Parameters:

url (Wgit::Url) —
The URL to GET and resolve.
response (Wgit::Response) —
The response to enrich. Modifies by reference.
follow_redirects (Boolean, Symbol) (defaults to: true) —
Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :base, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Raises:

(StandardError) —
If a redirect isn't allowed etc.

# File 'lib/wgit/crawler.rb', line 211

def resolve(url, response, follow_redirects: true)
  orig_url_base = url.to_url.to_base # Recorded before any redirects.
  follow_redirects, within = redirect?(follow_redirects)

  loop do
    get_response(url, response)
    break unless response.redirect?

    # Handle response 'Location' header.
    location = Wgit::Url.new(response.headers.fetch(:location, ''))
    raise 'Encountered redirect without Location header' if location.empty?

    yield(url, response, location) if block_given?

    # Validate if the redirect is allowed.
    raise "Redirect not allowed: #{location}" unless follow_redirects

    if within && !location.relative?(within => orig_url_base)
      raise "Redirect (outside of #{within}) is not allowed: '#{location}'"
    end

    raise "Too many redirects, exceeded: #{@redirect_limit}" \
    if response.redirect_count >= @redirect_limit

    # Process the location to be crawled next.
    location = url.to_base.concat(location) if location.relative?
    response.redirections[url.to_s] = location.to_s
    url.replace(location) # Update the url on redirect.
  end
end

Class: Wgit::Crawler

Overview

Constant Summary collapse

Constants included from Assertable

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Assertable

Constructor Details

#initialize(redirect_limit: 5, time_out: 5, encode: true) ⇒ Crawler

Instance Attribute Details

#encode ⇒ Object

#last_response ⇒ Object (readonly)

#redirect_limit ⇒ Object

#time_out ⇒ Object

Instance Method Details

#crawl_site(url, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r

#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document? Also known as: crawl_page

#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl, crawl_pages

#fetch(url, follow_redirects: true) ⇒ String? (protected)

#get_internal_links(doc, allow_paths: nil, disallow_paths: nil) ⇒ Array<Wgit::Url> (protected)

#get_response(url, response) ⇒ Wgit::Response (protected)

#http_get(url) ⇒ Typhoeus::Response (protected)

#resolve(url, response, follow_redirects: true) ⇒ Object (protected)

#initialize(redirect_limit: 5, time_out: 5, encode: true) ⇒ `Crawler`

#encode ⇒ `Object`

#last_response ⇒ `Object` (readonly)

#redirect_limit ⇒ `Object`

#time_out ⇒ `Object`

#crawl_site(url, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Array<Wgit::Url>`^? Also known as: crawl_r

#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ `Wgit::Document`^? Also known as: crawl_page

#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ `Wgit::Document` Also known as: crawl, crawl_pages

#fetch(url, follow_redirects: true) ⇒ `String`^? (protected)

#get_internal_links(doc, allow_paths: nil, disallow_paths: nil) ⇒ `Array<Wgit::Url>` (protected)

#get_response(url, response) ⇒ `Wgit::Response` (protected)

#http_get(url) ⇒ `Typhoeus::Response` (protected)

#resolve(url, response, follow_redirects: true) ⇒ `Object` (protected)