Class: Wgit::Crawler

Inherits:
Object
  • Object
show all
Includes:
Assertable
Defined in:
lib/wgit/crawler.rb

Overview

The Crawler class provides a means of crawling web based HTTP Wgit::Urls, and serialising their HTML into Wgit::Document instances. This is the only Wgit class containing network logic (HTTP request/response handling).

Constant Summary

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::NON_ENUMERABLE_MSG

Class Attribute Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Assertable

#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(redirect_limit: 5, timeout: 5, encode: true, parse_javascript: false, parse_javascript_delay: 1) ⇒ Crawler

Initializes and returns a Wgit::Crawler instance.

Parameters:

  • redirect_limit (Integer) (defaults to: 5)

    The amount of allowed redirects before raising an error. Set to 0 to disable redirects completely.

  • timeout (Integer, Float) (defaults to: 5)

    The maximum amount of time (in seconds) a crawl request has to complete before raising an error. Set to 0 to disable time outs completely.

  • encode (Boolean) (defaults to: true)

    Whether or not to UTF-8 encode the response body once crawled. Set to false if crawling more than just HTML e.g. images.

  • parse_javascript (Boolean) (defaults to: false)

    Whether or not to parse the Javascript of the crawled document. Parsing requires Chrome/Chromium to be installed and in $PATH.

  • parse_javascript_delay (Integer) (defaults to: 1)

    The delay time given to a page's JS to update the DOM. After the delay, the HTML is crawled.



74
75
76
77
78
79
80
81
# File 'lib/wgit/crawler.rb', line 74

def initialize(redirect_limit: 5, timeout: 5, encode: true,
               parse_javascript: false, parse_javascript_delay: 1)
  @redirect_limit         = redirect_limit
  @timeout                = timeout
  @encode                 = encode
  @parse_javascript       = parse_javascript
  @parse_javascript_delay = parse_javascript_delay
end

Class Attribute Details

.supported_file_extensionsObject (readonly)

The URL file extensions (from <a> hrefs) which will be crawled by #crawl_site. The idea is to omit anything that isn't HTML and therefore doesn't keep the crawl of the site going. All URL's without a file extension will be crawled, because they're assumed to be HTML. The #crawl method will crawl anything since it's given the URL(s). You can add your own site's URL file extension e.g. Wgit::Crawler.supported_file_extensions << 'html5' etc.



32
33
34
# File 'lib/wgit/crawler.rb', line 32

def supported_file_extensions
  @supported_file_extensions
end

Instance Attribute Details

#encodeObject

Whether or not to UTF-8 encode the response body once crawled. Set to false if crawling more than just HTML e.g. images.



46
47
48
# File 'lib/wgit/crawler.rb', line 46

def encode
  @encode
end

#last_responseObject (readonly)

The Wgit::Response of the most recently crawled URL.



58
59
60
# File 'lib/wgit/crawler.rb', line 58

def last_response
  @last_response
end

#parse_javascriptObject

Whether or not to parse the Javascript of the crawled document. Parsing requires Chrome/Chromium to be installed and in $PATH.



50
51
52
# File 'lib/wgit/crawler.rb', line 50

def parse_javascript
  @parse_javascript
end

#parse_javascript_delayObject

The delay between checks in a page's HTML size. When the page has stopped "growing", the Javascript has finished dynamically updating the DOM. The value should balance between a good UX and enough JS parse time.



55
56
57
# File 'lib/wgit/crawler.rb', line 55

def parse_javascript_delay
  @parse_javascript_delay
end

#redirect_limitObject

The amount of allowed redirects before raising an error. Set to 0 to disable redirects completely; or you can pass follow_redirects: false to any Wgit::Crawler.crawl_* method.



38
39
40
# File 'lib/wgit/crawler.rb', line 38

def redirect_limit
  @redirect_limit
end

#timeoutObject

The maximum amount of time (in seconds) a crawl request has to complete before raising an error. Set to 0 to disable time outs completely.



42
43
44
# File 'lib/wgit/crawler.rb', line 42

def timeout
  @timeout
end

Instance Method Details

#browser_get(url) ⇒ Ferrum::Browser (protected)

Performs a HTTP GET request in a web browser and parses the response JS before returning the HTML body of the fully rendered webpage. This allows Javascript (SPA apps etc.) to generate HTML dynamically.

Parameters:

  • url (String)

    The url to browse to.

Returns:

  • (Ferrum::Browser)

    The browser response object.



372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
# File 'lib/wgit/crawler.rb', line 372

def browser_get(url)
  @browser ||= Ferrum::Browser.new(timeout: @timeout, process_timeout: 10)
  @browser.goto(url)

  # Wait for the page's JS to finish dynamically manipulating the DOM.
  html = @browser.body
  loop do
    sleep @parse_javascript_delay
    break if html.size == @browser.body.size

    html = @browser.body
  end

  @browser
end

#crawl_site(url, follow: :default, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r

Crawls an entire website's HTML pages by recursively going through its internal <a> links; this can be overridden with follow: xpath. Each crawled Document is yielded to a block. Use doc.empty? to determine if the crawled link was successful / is valid.

Use the allow and disallow paths params to partially and selectively crawl a site; the glob syntax is fully supported e.g. 'wiki/\*' etc.

Only redirects to the same host are followed. For example, the Url 'http://www.example.co.uk/how' has a host of 'www.example.co.uk' meaning a link which redirects to 'https://ftp.example.co.uk' or 'https://www.example.com' will not be followed. The only exception to this is the initially crawled url which is allowed to redirect anywhere; it's host is then used for other link redirections on the site, as described above.

Parameters:

  • url (Wgit::Url)

    The base URL of the website to be crawled. It is recommended that this URL be the index page of the site to give a greater chance of finding all pages within that site/host.

  • follow (String) (defaults to: :default)

    The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML.

  • allow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.

  • disallow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

  • (doc)

    Given each crawled page (Wgit::Document) of the site. A block is the only way to interact with each crawled Document. Use doc.empty? to determine if the page is valid.

Returns:

  • (Array<Wgit::Url>, nil)

    Unique Array of external urls collected from all of the site's pages or nil if the given url could not be crawled successfully.



116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# File 'lib/wgit/crawler.rb', line 116

def crawl_site(
  url, follow: :default, allow_paths: nil, disallow_paths: nil, &block
)
  doc = crawl_url(url, &block)
  return nil if doc.empty?

  total_pages = 1
  link_opts = { xpath: follow, allow_paths:, disallow_paths: }

  crawled   = Set.new(url.redirects_journey)
  externals = Set.new(doc.external_links)
  internals = Set.new(next_internal_links(doc, **link_opts))

  return externals.to_a if internals.empty?

  loop do
    links = subtract_links(internals, crawled)
    break if links.empty?

    links.each do |link|
      doc = crawl_url(link, follow_redirects: :host, &block)

      crawled += link.redirects_journey
      next if doc.empty?

      total_pages += 1
      internals   += next_internal_links(doc, **link_opts)
      externals   += doc.external_links
    end
  end

  Wgit.logger.debug("Crawled #{total_pages} documents for the site: #{url}")

  externals.to_a
end

#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl_page

Crawl the url returning the response Wgit::Document or nil, if an error occurs.

Parameters:

  • url (Wgit::Url)

    The Url to crawl; which will be modified in the event of a redirect.

  • follow_redirects (Boolean, Symbol) (defaults to: true)

    Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Yields:

  • (doc)

    The crawled HTML page (Wgit::Document) regardless if the crawl was successful or not. Therefore, Document#url etc. can be used. Use doc.empty? to determine if the page is valid.

Returns:

  • (Wgit::Document)

    The crawled HTML Document. Check if the crawl was successful with doc.empty? (true if unsuccessful).



191
192
193
194
195
196
197
198
199
200
201
202
# File 'lib/wgit/crawler.rb', line 191

def crawl_url(url, follow_redirects: true)
  # A String url isn't allowed because it's passed by value not reference,
  # meaning a redirect isn't reflected; A Wgit::Url is passed by reference.
  assert_type(url, Wgit::Url)

  html = fetch(url, follow_redirects:)
  doc  = Wgit::Document.new(url, html, encode: @encode)

  yield(doc) if block_given?

  doc
end

#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl, crawl_pages

Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath. See Wgit::Crawler#crawl_site for crawling entire sites.

Parameters:

  • urls (*Wgit::Url)

    The Url's to crawl.

  • follow_redirects (Boolean, Symbol) (defaults to: true)

    Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param. This value will be used for all urls crawled.

Yields:

  • (doc)

    Given each crawled page (Wgit::Document); this is the only way to interact with them. Use doc.empty? to determine if the page is valid.

Returns:

Raises:

  • (StandardError)

    If no urls are provided.



166
167
168
169
170
171
172
173
174
175
# File 'lib/wgit/crawler.rb', line 166

def crawl_urls(*urls, follow_redirects: true, &block)
  raise 'You must provide at least one Url' if urls.empty?

  opts = { follow_redirects: }
  doc = nil

  Wgit::Utils.each(urls) { |url| doc = crawl_url(url, **opts, &block) }

  doc
end

#fetch(url, follow_redirects: true) ⇒ String? (protected)

Returns the URL's HTML String or nil. Handles any errors that arise and sets the @last_response. Errors or any HTTP response that doesn't return a HTML body will be ignored, returning nil.

If @parse_javascript is true, then the final resolved URL will be browsed to and Javascript parsed allowing for dynamic HTML generation.

Parameters:

  • url (Wgit::Url)

    The URL to fetch. This Url object is passed by reference and gets modified as a result of the fetch/crawl.

  • follow_redirects (Boolean, Symbol) (defaults to: true)

    Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Returns:

  • (String, nil)

    The crawled HTML or nil if the crawl was unsuccessful.

Raises:

  • (StandardError)

    If url isn't valid and absolute.



222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
# File 'lib/wgit/crawler.rb', line 222

def fetch(url, follow_redirects: true)
  response = Wgit::Response.new
  raise "Invalid url: #{url}" if url.invalid?

  resolve(url, response, follow_redirects:)
  get_browser_response(url, response) if @parse_javascript

  response.body_or_nil
rescue StandardError => e
  Wgit.logger.debug("Wgit::Crawler#fetch('#{url}') exception: #{e}")

  nil
ensure
  url.crawled        = true # Sets date_crawled underneath.
  url.crawl_duration = response.total_time

  # Don't override previous url.redirects if response is fully resolved.
  url.redirects      = response.redirects unless response.redirects.empty?

  @last_response = response
end

#get_browser_response(url, response) {|browser| ... } ⇒ Wgit::Response (protected)

Makes a browser request and enriches the given Wgit::Response from it.

Parameters:

  • url (String)

    The url to browse to. Will call url#normalize if possible.

  • response (Wgit::Response)

    The response to enrich. Modifies by reference.

Yields:

  • (browser)

Returns:

Raises:

  • (StandardError)

    If a response can't be obtained.



325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
# File 'lib/wgit/crawler.rb', line 325

def get_browser_response(url, response)
  url     = url.normalize if url.respond_to?(:normalize)
  browser = nil

  crawl_time = Benchmark.measure { browser = browser_get(url) }.real
  yield browser if block_given?

  # Enrich the given Wgit::Response object (on top of Typhoeus response).
  response.adapter_response = browser.network.response
  response.status           = browser.network.response.status
  response.headers          = browser.network.response.headers
  response.body             = browser.body
  response.add_total_time(crawl_time)

  # Log the request/response details.
  log_net(:browser, response, crawl_time)

  # Handle a failed response.
  raise "No browser response (within timeout: #{@timeout} second(s))" \
  if response.failure?
end

#get_http_response(url, response) ⇒ Wgit::Response (protected)

Makes a HTTP request and enriches the given Wgit::Response from it.

Parameters:

  • url (String)

    The url to GET. Will call url#normalize if possible.

  • response (Wgit::Response)

    The response to enrich. Modifies by reference.

Returns:

Raises:

  • (StandardError)

    If a response can't be obtained.



293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
# File 'lib/wgit/crawler.rb', line 293

def get_http_response(url, response)
  # Perform a HTTP GET request.
  orig_url = url.to_s
  url      = url.normalize if url.respond_to?(:normalize)

  http_response = http_get(url)

  # Enrich the given Wgit::Response object.
  response.adapter_response = http_response
  response.url              = orig_url
  response.status           = http_response.code
  response.headers          = http_response.headers
  response.body             = http_response.body
  response.ip_address       = http_response.primary_ip
  response.add_total_time(http_response.total_time)

  # Log the request/response details.
  log_net(:http, response, http_response.total_time)

  # Handle a failed response.
  raise "No response (within timeout: #{@timeout} second(s))" \
  if response.failure?
end

#http_get(url) ⇒ Typhoeus::Response (protected)

Performs a HTTP GET request and returns the response.

Parameters:

  • url (String)

    The url to GET.

Returns:

  • (Typhoeus::Response)

    The HTTP response object.



351
352
353
354
355
356
357
358
359
360
361
362
363
364
# File 'lib/wgit/crawler.rb', line 351

def http_get(url)
  opts = {
    followlocation: false,
    timeout: @timeout,
    accept_encoding: 'gzip',
    headers: {
      'User-Agent' => "wgit/#{Wgit::VERSION}",
      'Accept'     => 'text/html'
    }
  }

  # See https://rubydoc.info/gems/typhoeus for more info.
  Typhoeus.get(url, **opts)
end

Returns a doc's internal HTML page links in absolute form; used when crawling a site. By default, any <a> href returning HTML is returned; override this with xpath: if desired.

Use the allow and disallow paths params to partially and selectively crawl a site; the glob syntax is supported e.g. 'wiki/\*' etc. Note that each path should NOT start with a slash.

Parameters:

Returns:

  • (Array<Wgit::Url>)

    The internal page links from doc.



407
408
409
410
411
412
413
414
415
416
417
418
419
# File 'lib/wgit/crawler.rb', line 407

def next_internal_links(
  doc, xpath: :default, allow_paths: nil, disallow_paths: nil
)
  links = if xpath && xpath != :default
            follow_xpath(doc, xpath)
          else
            follow_default(doc)
          end

  return links if allow_paths.nil? && disallow_paths.nil?

  process_paths(links, allow_paths, disallow_paths)
end

#resolve(url, response, follow_redirects: true) ⇒ Object (protected)

GETs the given url, resolving any redirects. The given response object will be enriched.

Parameters:

  • url (Wgit::Url)

    The URL to GET and resolve.

  • response (Wgit::Response)

    The response to enrich. Modifies by reference.

  • follow_redirects (Boolean, Symbol) (defaults to: true)

    Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Raises:

  • (StandardError)

    If a redirect isn't allowed etc.



255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
# File 'lib/wgit/crawler.rb', line 255

def resolve(url, response, follow_redirects: true)
  origin = url.to_origin # Record the origin before any redirects.
  follow_redirects, within = redirect?(follow_redirects)

  loop do
    get_http_response(url, response)
    break unless response.redirect?

    # Handle response 'Location' header.
    location = Wgit::Url.new(response.headers.fetch(:location, ''))
    raise 'Encountered redirect without Location header' if location.empty?

    yield(url, response, location) if block_given?

    # Validate if the redirect is allowed.
    raise "Redirect not allowed: #{location}" unless follow_redirects

    if within && !location.relative?(within => origin)
      raise "Redirect (outside of #{within}) is not allowed: '#{location}'"
    end

    raise "Too many redirects, exceeded: #{@redirect_limit}" \
    if response.redirect_count >= @redirect_limit

    # Process the location to be crawled next.
    location = url.to_origin.join(location) if location.relative?
    response.redirections[url.to_s] = location.to_s
    url.replace(location) # Update the url on redirect.
  end
end