Class: Wgit::Crawler

Inherits:
Object
  • Object
show all
Includes:
Assertable
Defined in:
lib/wgit/crawler.rb

Overview

The Crawler class provides a means of crawling web based HTTP Wgit::Url's, serialising their HTML into Wgit::Document instances. This is the only Wgit class which contains network logic e.g. request/response handling.

Constant Summary collapse

SUPPORTED_FILE_EXTENSIONS =

The URL file extensions (from <a> hrefs) which will be crawled by #crawl_site. The idea is to omit anything that isn't HTML and therefore doesn't keep the crawl of the site going. All URL's without a file extension will be crawled, because they're assumed to be HTML.

Set.new(
  %w[asp aspx cfm cgi htm html htmlx jsp php]
)

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::WRONG_METHOD_MSG

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Assertable

#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(redirect_limit: 5, time_out: 5, encode: true) ⇒ Crawler

Initializes and returns a Wgit::Crawler instance.

Parameters:

  • redirect_limit (Integer) (defaults to: 5)

    The amount of allowed redirects before raising an error. Set to 0 to disable redirects completely.

  • time_out (Integer, Float) (defaults to: 5)

    The maximum amount of time (in seconds) a crawl request has to complete before raising an error. Set to 0 to disable time outs completely.

  • encode (Boolean) (defaults to: true)

    Whether or not to UTF-8 encode the response body once crawled. Set to false if crawling more than just HTML e.g. images.



51
52
53
54
55
# File 'lib/wgit/crawler.rb', line 51

def initialize(redirect_limit: 5, time_out: 5, encode: true)
  @redirect_limit = redirect_limit
  @time_out       = time_out
  @encode         = encode
end

Instance Attribute Details

#encodeObject

Whether or not to UTF-8 encode the response body once crawled. Set to false if crawling more than just HTML e.g. images.



37
38
39
# File 'lib/wgit/crawler.rb', line 37

def encode
  @encode
end

#last_responseObject (readonly)

The Wgit::Response of the most recently crawled URL.



40
41
42
# File 'lib/wgit/crawler.rb', line 40

def last_response
  @last_response
end

#redirect_limitObject

The amount of allowed redirects before raising an error. Set to 0 to disable redirects completely; or you can pass follow_redirects: false to any Wgit::Crawler.crawl_* method.



29
30
31
# File 'lib/wgit/crawler.rb', line 29

def redirect_limit
  @redirect_limit
end

#time_outObject

The maximum amount of time (in seconds) a crawl request has to complete before raising an error. Set to 0 to disable time outs completely.



33
34
35
# File 'lib/wgit/crawler.rb', line 33

def time_out
  @time_out
end

Instance Method Details

#crawl_site(url, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r

Crawls an entire website's HTML pages by recursively going through its internal <a> links. Each crawled Document is yielded to a block. Use doc.empty? to determine if the crawled link is valid.

Use the allow and disallow paths params to partially and selectively crawl a site; the glob syntax is fully supported e.g. 'wiki/\*' etc. Note that each path must NOT start with a slash; the only exception being a / on its own with no other characters, referring to the index page.

Only redirects to the same host are followed. For example, the Url 'http://www.example.co.uk/how' has a host of 'www.example.co.uk' meaning a link which redirects to 'https://ftp.example.co.uk' or 'https://www.example.com' will not be followed. The only exception to this is the initially crawled url which is allowed to redirect anywhere; it's host is then used for other link redirections on the site, as described above.

Parameters:

  • url (Wgit::Url)

    The base URL of the website to be crawled. It is recommended that this URL be the index page of the site to give a greater chance of finding all pages within that site/host.

  • allow_paths (String, Array<String>) (defaults to: nil)

    Filters links by selecting them if their path File.fnmatch? one of allow_paths.

  • disallow_paths (String, Array<String>) (defaults to: nil)

    Filters links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

  • (doc)

    Given each crawled page (Wgit::Document) of the site. A block is the only way to interact with each crawled Document. Use doc.empty? to determine if the page is valid.

Returns:

  • (Array<Wgit::Url>, nil)

    Unique Array of external urls collected from all of the site's pages or nil if the given url could not be crawled successfully.



87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# File 'lib/wgit/crawler.rb', line 87

def crawl_site(url, allow_paths: nil, disallow_paths: nil, &block)
  doc = crawl_url(url, &block)
  return nil if doc.nil?

  path_opts = { allow_paths: allow_paths, disallow_paths: disallow_paths }
  alt_url   = url.end_with?('/') ? url.chop : url + '/'

  crawled   = Set.new([url, alt_url])
  externals = Set.new(doc.external_links)
  internals = Set.new(get_internal_links(doc, path_opts))

  return externals.to_a if internals.empty?

  loop do
    links = internals - crawled
    break if links.empty?

    links.each do |link|
      orig_link = link.dup
      doc = crawl_url(link, follow_redirects: :host, &block)

      crawled += [orig_link, link] # Push both links in case of redirects.
      next if doc.nil?

      internals += get_internal_links(doc, path_opts)
      externals += doc.external_links
    end
  end

  externals.to_a
end

#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document? Also known as: crawl_page

Crawl the url returning the response Wgit::Document or nil, if an error occurs.

Parameters:

  • url (Wgit::Url)

    The Url to crawl; which will likely be modified.

  • follow_redirects (Boolean, Symbol) (defaults to: true)

    Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :base, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Yields:

  • (doc)

    The crawled HTML page (Wgit::Document) regardless if the crawl was successful or not. Therefore, Document#url etc. can be used.

Returns:

  • (Wgit::Document, nil)

    The crawled HTML Document or nil if the crawl was unsuccessful.



155
156
157
158
159
160
161
162
163
164
165
166
# File 'lib/wgit/crawler.rb', line 155

def crawl_url(url, follow_redirects: true)
  # A String url isn't allowed because it's passed by value not reference,
  # meaning a redirect isn't reflected; A Wgit::Url is passed by reference.
  assert_type(url, Wgit::Url)

  html = fetch(url, follow_redirects: follow_redirects)
  doc  = Wgit::Document.new(url, html, encode: @encode)

  yield(doc) if block_given?

  doc.empty? ? nil : doc
end

#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl, crawl_pages

Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath. See Wgit::Crawler#crawl_site for crawling entire sites.

Parameters:

  • urls (*Wgit::Url)

    The Url's to crawl.

  • follow_redirects (Boolean, Symbol) (defaults to: true)

    Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :base, :host, :domain or :brand. See Wgit::Url#relative? opts param. This value will be used for all urls crawled.

Yields:

  • (doc)

    Given each crawled page (Wgit::Document); this is the only way to interact with them.

Returns:

Raises:

  • (StandardError)

    If no urls are provided.



132
133
134
135
136
137
138
139
140
141
# File 'lib/wgit/crawler.rb', line 132

def crawl_urls(*urls, follow_redirects: true, &block)
  raise 'You must provide at least one Url' if urls.empty?

  opts = { follow_redirects: follow_redirects }
  doc = nil

  Wgit::Utils.each(urls) { |url| doc = crawl_url(url, opts, &block) }

  doc
end

#fetch(url, follow_redirects: true) ⇒ String? (protected)

Returns the url HTML String or nil. Handles any errors that arise and sets the @last_response. Errors or any HTTP response that doesn't return a HTML body will be ignored, returning nil.

Parameters:

  • url (Wgit::Url)

    The URL to fetch. This Url object is passed by reference and gets modified as a result of the fetch/crawl.

  • follow_redirects (Boolean, Symbol) (defaults to: true)

    Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :base, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Returns:

  • (String, nil)

    The crawled HTML or nil if the crawl was unsuccessful.

Raises:

  • (StandardError)

    If url isn't valid and absolute.



183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
# File 'lib/wgit/crawler.rb', line 183

def fetch(url, follow_redirects: true)
  response = Wgit::Response.new
  raise "Invalid url: #{url}" if url.invalid?

  resolve(url, response, follow_redirects: follow_redirects)
  response.body_or_nil
rescue StandardError => e
  Wgit.logger.debug("Wgit::Crawler#fetch('#{url}') exception: #{e}")

  nil
ensure
  url.crawled        = true # Sets date_crawled underneath.
  url.crawl_duration = response.total_time

  @last_response = response
end

Returns a doc's internal HTML page links in absolute form; used when crawling a site. Use the allow and disallow paths params to partially and selectively crawl a site; the glob syntax is supported e.g. 'wiki/\*' etc. Note that each path should NOT start with a slash.

Override this method in a subclass to change how a site is crawled, not what is extracted from each page (Document extensions should be used for this purpose instead). Just remember that only HTML files containing <a> links keep the crawl going beyond the base URL.

Parameters:

  • doc (Wgit::Document)

    The document from which to extract it's internal (absolute) page links.

  • allow_paths (String, Array<String>) (defaults to: nil)

    Filters links by selecting them if their path File.fnmatch? one of allow_paths.

  • disallow_paths (String, Array<String>) (defaults to: nil)

    Filters links by rejecting them if their path File.fnmatch? one of disallow_paths.

Returns:

  • (Array<Wgit::Url>)

    The internal page links from doc.



309
310
311
312
313
314
315
316
317
318
319
320
321
322
# File 'lib/wgit/crawler.rb', line 309

def get_internal_links(doc, allow_paths: nil, disallow_paths: nil)
  links = doc
          .internal_absolute_links
          .map(&:omit_fragment) # Because fragments don't alter content.
          .uniq
          .select do |link|
    ext = link.to_extension
    ext ? SUPPORTED_FILE_EXTENSIONS.include?(ext.downcase) : true
  end

  return links if allow_paths.nil? && disallow_paths.nil?

  process_paths(links, allow_paths, disallow_paths)
end

#get_response(url, response) ⇒ Wgit::Response (protected)

Makes a HTTP request and enriches the given Wgit::Response from it.

Parameters:

  • url (String)

    The url to GET. Will call url#normalize if possible.

  • response (Wgit::Response)

    The response to enrich. Modifies by reference.

Returns:

Raises:

  • (StandardError)

    If a response can't be obtained.



249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
# File 'lib/wgit/crawler.rb', line 249

def get_response(url, response)
  # Perform a HTTP GET request.
  orig_url = url.to_s
  url      = url.normalize if url.respond_to?(:normalize)

  http_response = http_get(url)

  # Enrich the given Wgit::Response object.
  response.adapter_response = http_response
  response.url              = orig_url
  response.status           = http_response.code
  response.headers          = http_response.headers
  response.body             = http_response.body
  response.ip_address       = http_response.primary_ip
  response.add_total_time(http_response.total_time)

  # Log the request/response details.
  log_http(response)

  # Handle a failed response.
  raise "No response (within timeout: #{@time_out} second(s))" \
  if response.failure?
end

#http_get(url) ⇒ Typhoeus::Response (protected)

Performs a HTTP GET request and returns the response.

Parameters:

  • url (String)

    The url to GET.

Returns:

  • (Typhoeus::Response)

    The HTTP response object.



277
278
279
280
281
282
283
284
285
286
287
288
289
290
# File 'lib/wgit/crawler.rb', line 277

def http_get(url)
  opts = {
    followlocation: false,
    timeout: @time_out,
    accept_encoding: 'gzip',
    headers: {
      'User-Agent' => "wgit/#{Wgit::VERSION}",
      'Accept'     => 'text/html'
    }
  }

  # See https://rubydoc.info/gems/typhoeus for more info.
  Typhoeus.get(url, opts)
end

#resolve(url, response, follow_redirects: true) ⇒ Object (protected)

GETs the given url, resolving any redirects. The given response object will be enriched.

Parameters:

  • url (Wgit::Url)

    The URL to GET and resolve.

  • response (Wgit::Response)

    The response to enrich. Modifies by reference.

  • follow_redirects (Boolean, Symbol) (defaults to: true)

    Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :base, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Raises:

  • (StandardError)

    If a redirect isn't allowed etc.



211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
# File 'lib/wgit/crawler.rb', line 211

def resolve(url, response, follow_redirects: true)
  orig_url_base = url.to_url.to_base # Recorded before any redirects.
  follow_redirects, within = redirect?(follow_redirects)

  loop do
    get_response(url, response)
    break unless response.redirect?

    # Handle response 'Location' header.
    location = Wgit::Url.new(response.headers.fetch(:location, ''))
    raise 'Encountered redirect without Location header' if location.empty?

    yield(url, response, location) if block_given?

    # Validate if the redirect is allowed.
    raise "Redirect not allowed: #{location}" unless follow_redirects

    if within && !location.relative?(within => orig_url_base)
      raise "Redirect (outside of #{within}) is not allowed: '#{location}'"
    end

    raise "Too many redirects, exceeded: #{@redirect_limit}" \
    if response.redirect_count >= @redirect_limit

    # Process the location to be crawled next.
    location = url.to_base.concat(location) if location.relative?
    response.redirections[url.to_s] = location.to_s
    url.replace(location) # Update the url on redirect.
  end
end