Class: Wgit::Crawler

Inherits:

Object

Object
Wgit::Crawler

show all

Includes:: Assertable

Defined in:: lib/wgit/crawler.rb

Overview

The Crawler class provides a means of crawling web based HTTP Wgit::Urls, and serialising their HTML into Wgit::Document instances. This is the only Wgit class containing network logic (HTTP request/response handling).

Constant Summary

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG

Class Attribute Summary collapse

.supported_file_extensions ⇒ Object readonly
The URL file extensions (from <a> hrefs) which will be crawled by #crawl_site.

Instance Attribute Summary collapse

#encode ⇒ Object
Whether or not to UTF-8 encode the response body once crawled.
#ferrum_opts ⇒ Object
The opts Hash passed directly to the ferrum Chrome browser when parse_javascript: true.
#last_response ⇒ Object readonly
The Wgit::Response of the most recently crawled URL.
#parse_javascript ⇒ Object
Whether or not to parse the Javascript of the crawled document.
#parse_javascript_delay ⇒ Object
The delay between checks in a page's HTML size.
#redirect_limit ⇒ Object
The amount of allowed redirects before raising an error.
#timeout ⇒ Object
The maximum amount of time (in seconds) a crawl request has to complete before raising an error.

Instance Method Summary collapse

#browser_get(url) ⇒ Ferrum::Browser protected
Performs a HTTP GET request in a web browser and parses the response JS before returning the HTML body of the fully rendered webpage.
#crawl_site(url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>^? (also: #crawl_r)
Crawls an entire website's HTML pages by recursively going through its internal <a> links; this can be overridden with follow: xpath.
#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document (also: #crawl_page)
Crawl the url returning the response Wgit::Document or nil, if an error occurs.
#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document (also: #crawl, #crawl_pages)
Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath.
#fetch(url, follow_redirects: true) ⇒ String^? protected
Returns the URL's HTML String or nil.
#get_browser_response(url, response) {|browser| ... } ⇒ Wgit::Response protected
Makes a browser request and enriches the given Wgit::Response from it.
#get_http_response(url, response) ⇒ Wgit::Response protected
Makes a HTTP request and enriches the given Wgit::Response from it.
#http_get(url) ⇒ Typhoeus::Response protected
Performs a HTTP GET request and returns the response.
#initialize(redirect_limit: 5, timeout: 5, encode: true, parse_javascript: false, parse_javascript_delay: 1, ferrum_opts: {}) ⇒ Crawler constructor
Initializes and returns a Wgit::Crawler instance.
#inspect ⇒ String
Overrides String#inspect to shorten the printed output of a Crawler.
#next_internal_links(doc, xpath: :default, allow_paths: nil, disallow_paths: nil) ⇒ Array<Wgit::Url> protected
Returns a doc's internal HTML page links in absolute form; used when crawling a site.
#resolve(url, response, follow_redirects: true) ⇒ Object protected
GETs the given url, resolving any redirects.

Methods included from Assertable

#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(redirect_limit: 5, timeout: 5, encode: true, parse_javascript: false, parse_javascript_delay: 1, ferrum_opts: {}) ⇒ `Crawler`

Initializes and returns a Wgit::Crawler instance.

Parameters:

redirect_limit (Integer) (defaults to: 5) —
The amount of allowed redirects before raising an error. Set to 0 to disable redirects completely.
timeout (Integer, Float) (defaults to: 5) —
The maximum amount of time (in seconds) a crawl request has to complete before raising an error. Set to 0 to disable time outs completely.
encode (Boolean) (defaults to: true) —
Whether or not to UTF-8 encode the response body once crawled. Set to false if crawling more than just HTML e.g. images.
parse_javascript (Boolean) (defaults to: false) —
Whether or not to parse the Javascript of the crawled document. Parsing requires Chrome/Chromium to be installed and in $PATH.
parse_javascript_delay (Integer) (defaults to: 1) —
The delay time given to a page's JS to update the DOM. After the delay, the HTML is crawled.

# File 'lib/wgit/crawler.rb', line 79

def initialize(redirect_limit: 5, timeout: 5, encode: true,
               parse_javascript: false, parse_javascript_delay: 1,
               ferrum_opts: {})
  assert_type(redirect_limit, Integer)
  assert_type(timeout, [Integer, Float])
  assert_type(encode, [TrueClass, FalseClass])
  assert_type(parse_javascript, [TrueClass, FalseClass])
  assert_type(parse_javascript_delay, Integer)
  assert_type(ferrum_opts, Hash)

  @redirect_limit         = redirect_limit
  @timeout                = timeout
  @encode                 = encode
  @parse_javascript       = parse_javascript
  @parse_javascript_delay = parse_javascript_delay
  @ferrum_opts            = default_ferrum_opts.merge(ferrum_opts)
end

Class Attribute Details

.supported_file_extensions ⇒ `Object` (readonly)

The URL file extensions (from <a> hrefs) which will be crawled by #crawl_site. The idea is to omit anything that isn't HTML and therefore doesn't keep the crawl of the site going. All URL's without a file extension will be crawled, because they're assumed to be HTML. The #crawl method will crawl anything since it's given the URL(s). You can add your own site's URL file extension e.g. Wgit::Crawler.supported_file_extensions << 'html5' etc.



32
33
34

# File 'lib/wgit/crawler.rb', line 32

def supported_file_extensions
  @supported_file_extensions
end

Instance Attribute Details

#encode ⇒ `Object`

Whether or not to UTF-8 encode the response body once crawled. Set to false if crawling more than just HTML e.g. images.



46
47
48

# File 'lib/wgit/crawler.rb', line 46

def encode
  @encode
end

#ferrum_opts ⇒ `Object`

The opts Hash passed directly to the ferrum Chrome browser when parse_javascript: true. See https://github.com/rubycdp/ferrum for details.



60
61
62

# File 'lib/wgit/crawler.rb', line 60

def ferrum_opts
  @ferrum_opts
end

#last_response ⇒ `Object` (readonly)

The Wgit::Response of the most recently crawled URL.



63
64
65

# File 'lib/wgit/crawler.rb', line 63

def last_response
  @last_response
end

#parse_javascript ⇒ `Object`

Whether or not to parse the Javascript of the crawled document. Parsing requires Chrome/Chromium to be installed and in $PATH.



50
51
52

# File 'lib/wgit/crawler.rb', line 50

def parse_javascript
  @parse_javascript
end

#parse_javascript_delay ⇒ `Object`

The delay between checks in a page's HTML size. When the page has stopped "growing", the Javascript has finished dynamically updating the DOM. The value should balance between a good UX and enough JS parse time.



55
56
57

# File 'lib/wgit/crawler.rb', line 55

def parse_javascript_delay
  @parse_javascript_delay
end

#redirect_limit ⇒ `Object`

The amount of allowed redirects before raising an error. Set to 0 to disable redirects completely; or you can pass follow_redirects: false to any Wgit::Crawler.crawl_* method.



38
39
40

# File 'lib/wgit/crawler.rb', line 38

def redirect_limit
  @redirect_limit
end

#timeout ⇒ `Object`

The maximum amount of time (in seconds) a crawl request has to complete before raising an error. Set to 0 to disable time outs completely.



42
43
44

# File 'lib/wgit/crawler.rb', line 42

def timeout
  @timeout
end

Instance Method Details

#browser_get(url) ⇒ `Ferrum::Browser` (protected)

Performs a HTTP GET request in a web browser and parses the response JS before returning the HTML body of the fully rendered webpage. This allows Javascript (SPA apps etc.) to generate HTML dynamically.

Parameters:

url (String) —
The url to browse to.

Returns:

(Ferrum::Browser) —
The browser response object.

# File 'lib/wgit/crawler.rb', line 404

def browser_get(url)
  @browser ||= Ferrum::Browser.new(**@ferrum_opts)
  @browser.goto(url)

  # Wait for the page's JS to finish dynamically manipulating the DOM.
  html = @browser.body
  loop do
    sleep @parse_javascript_delay
    break if html.size == @browser.body.size

    html = @browser.body
  end

  @browser
end

#crawl_site(url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Array<Wgit::Url>`^? Also known as: crawl_r

Crawls an entire website's HTML pages by recursively going through its internal <a> links; this can be overridden with follow: xpath. Each crawled Document is yielded to a block. Use doc.empty? to determine if the crawled link was successful / is valid.

Use the allow and disallow paths params to partially and selectively crawl a site; the glob syntax is fully supported e.g. 'wiki/\*' etc.

Only redirects to the same host are followed. For example, the Url 'http://www.example.co.uk/how' has a host of 'www.example.co.uk' meaning a link which redirects to 'https://ftp.example.co.uk' or 'https://www.example.com' will not be followed. The only exception to this is the initially crawled url which is allowed to redirect anywhere; it's host is then used for other link redirections on the site, as described above.

Parameters:

url (Wgit::Url) —
The base URL of the website to be crawled. It is recommended that this URL be the index page of the site to give a greater chance of finding all pages within that site/host.
follow (String) (defaults to: :default) —
The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML.
max_pages (Integer) (defaults to: nil)
allow_paths (String, Array<String>) (defaults to: nil) —
Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.
disallow_paths (String, Array<String>) (defaults to: nil) —
Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

(doc) —
Given each crawled page (Wgit::Document) of the site. A block is the only way to interact with each crawled Document. Use doc.empty? to determine if the page is valid.

Returns:

(Array<Wgit::Url>, nil) —
Unique Array of external urls collected from all of the site's pages or nil if the given url could not be crawled successfully.

# File 'lib/wgit/crawler.rb', line 138

def crawl_site(
  url, follow: :default, max_pages: nil,
  allow_paths: nil, disallow_paths: nil, &block
)
  doc = crawl_url(url, &block)
  return nil if doc.empty?

  total_pages = 1
  limit_reached = max_pages && total_pages >= max_pages
  link_opts = { xpath: follow, allow_paths:, disallow_paths: }

  crawled   = Set.new(url.redirects_journey)
  externals = Set.new(doc.external_links)
  internals = Set.new(next_internal_links(doc, **link_opts))

  return externals.to_a if internals.empty?

  loop do
    if limit_reached
      Wgit.logger.debug("Crawled and reached the max_pages limit of: #{max_pages}")
      break
    end

    links = subtract_links(internals, crawled)
    break if links.empty?

    links.each do |link|
      limit_reached = max_pages && total_pages >= max_pages
      break if limit_reached

      doc = crawl_url(link, follow_redirects: :host, &block)

      crawled += link.redirects_journey
      next if doc.empty?

      total_pages += 1
      internals   += next_internal_links(doc, **link_opts)
      externals   += doc.external_links
    end
  end

  Wgit.logger.debug("Crawled #{total_pages} documents for the site: #{url}")

  externals.to_a
end

#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ `Wgit::Document` Also known as: crawl_page

Crawl the url returning the response Wgit::Document or nil, if an error occurs.

Parameters:

url (Wgit::Url) —
The Url to crawl; which will be modified in the event of a redirect.
follow_redirects (Boolean, Symbol) (defaults to: true) —
Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Yields:

(doc) —
The crawled HTML page (Wgit::Document) regardless if the crawl was successful or not. Therefore, Document#url etc. can be used. Use doc.empty? to determine if the page is valid.

Returns:

(Wgit::Document) —
The crawled HTML Document. Check if the crawl was successful with doc.empty? (true if unsuccessful).

# File 'lib/wgit/crawler.rb', line 223

def crawl_url(url, follow_redirects: true)
  # A String url isn't allowed because it's passed by value not reference,
  # meaning a redirect isn't reflected; A Wgit::Url is passed by reference.
  assert_type(url, Wgit::Url)

  html = fetch(url, follow_redirects:)
  doc  = Wgit::Document.new(url, html, encode: @encode)

  yield(doc) if block_given?

  doc
end

#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ `Wgit::Document` Also known as: crawl, crawl_pages

Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath. See Wgit::Crawler#crawl_site for crawling entire sites.

Parameters:

urls (*Wgit::Url) —
The Url's to crawl.
follow_redirects (Boolean, Symbol) (defaults to: true) —
Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param. This value will be used for all urls crawled.

Yields:

(doc) —
Given each crawled page (Wgit::Document); this is the only way to interact with them. Use doc.empty? to determine if the page is valid.

Returns:

(Wgit::Document) —
The last Document crawled.

Raises:

(StandardError) —
If no urls are provided.

# File 'lib/wgit/crawler.rb', line 198

def crawl_urls(*urls, follow_redirects: true, &block)
  raise 'You must provide at least one Url' if urls.empty?

  opts = { follow_redirects: }
  doc = nil

  Wgit::Utils.each(urls) { |url| doc = crawl_url(url, **opts, &block) }

  doc
end

#fetch(url, follow_redirects: true) ⇒ `String`^? (protected)

Returns the URL's HTML String or nil. Handles any errors that arise and sets the @last_response. Errors or any HTTP response that doesn't return a HTML body will be ignored, returning nil.

If @parse_javascript is true, then the final resolved URL will be browsed to and Javascript parsed allowing for dynamic HTML generation.

Parameters:

url (Wgit::Url) —
The URL to fetch. This Url object is passed by reference and gets modified as a result of the fetch/crawl.
follow_redirects (Boolean, Symbol) (defaults to: true) —
Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Returns:

(String, nil) —
The crawled HTML or nil if the crawl was unsuccessful.

Raises:

(StandardError) —
If url isn't valid and absolute.

# File 'lib/wgit/crawler.rb', line 254

def fetch(url, follow_redirects: true)
  response = Wgit::Response.new
  raise "Invalid url: #{url}" if url.invalid?

  resolve(url, response, follow_redirects:)
  get_browser_response(url, response) if @parse_javascript

  response.body_or_nil
rescue StandardError => e
  Wgit.logger.debug("Wgit::Crawler#fetch('#{url}') exception: #{e}")

  nil
ensure
  url.crawled        = true # Sets date_crawled underneath.
  url.crawl_duration = response.total_time

  # Don't override previous url.redirects if response is fully resolved.
  url.redirects      = response.redirects unless response.redirects.empty?

  @last_response = response
end

#get_browser_response(url, response) {|browser| ... } ⇒ `Wgit::Response` (protected)

Makes a browser request and enriches the given Wgit::Response from it.

Parameters:

url (String) —
The url to browse to. Will call url#normalize if possible.
response (Wgit::Response) —
The response to enrich. Modifies by reference.

Yields:

(browser)

Returns:

(Wgit::Response) —
The enriched HTTP Wgit::Response object.

Raises:

(StandardError) —
If a response can't be obtained.

# File 'lib/wgit/crawler.rb', line 357

def get_browser_response(url, response)
  url     = url.normalize if url.respond_to?(:normalize)
  browser = nil

  crawl_time = Benchmark.measure { browser = browser_get(url) }.real
  yield browser if block_given?

  # Enrich the given Wgit::Response object (on top of Typhoeus response).
  response.adapter_response = browser.network.response
  response.status           = browser.network.response.status
  response.headers          = browser.network.response.headers
  response.body             = browser.body
  response.add_total_time(crawl_time)

  # Log the request/response details.
  log_net(:browser, response, crawl_time)

  # Handle a failed response.
  raise "No browser response (within timeout: #{@timeout} second(s))" \
  if response.failure?
end

#get_http_response(url, response) ⇒ `Wgit::Response` (protected)

Makes a HTTP request and enriches the given Wgit::Response from it.

Parameters:

url (String) —
The url to GET. Will call url#normalize if possible.
response (Wgit::Response) —
The response to enrich. Modifies by reference.

Returns:

(Wgit::Response) —
The enriched HTTP Wgit::Response object.

Raises:

(StandardError) —
If a response can't be obtained.

# File 'lib/wgit/crawler.rb', line 325

def get_http_response(url, response)
  # Perform a HTTP GET request.
  orig_url = url.to_s
  url      = url.normalize if url.respond_to?(:normalize)

  http_response = http_get(url)

  # Enrich the given Wgit::Response object.
  response.adapter_response = http_response
  response.url              = orig_url
  response.status           = http_response.code
  response.headers          = http_response.headers
  response.body             = http_response.body
  response.ip_address       = http_response.primary_ip
  response.add_total_time(http_response.total_time)

  # Log the request/response details.
  log_net(:http, response, http_response.total_time)

  # Handle a failed response.
  raise "No response (within timeout: #{@timeout} second(s))" \
  if response.failure?
end

#http_get(url) ⇒ `Typhoeus::Response` (protected)

Performs a HTTP GET request and returns the response.

Parameters:

url (String) —
The url to GET.

Returns:

(Typhoeus::Response) —
The HTTP response object.

# File 'lib/wgit/crawler.rb', line 383

def http_get(url)
  opts = {
    followlocation: false,
    timeout: @timeout,
    accept_encoding: 'gzip',
    headers: {
      'User-Agent' => "wgit/#{Wgit::VERSION}",
      'Accept'     => 'text/html'
    }
  }

  # See https://rubydoc.info/gems/typhoeus for more info.
  Typhoeus.get(url, **opts)
end

#inspect ⇒ `String`

Overrides String#inspect to shorten the printed output of a Crawler.

Returns:

(String) —
A short textual representation of this Crawler.



100
101
102

# File 'lib/wgit/crawler.rb', line 100

def inspect
  "#<Wgit::Crawler timeout=#{@timeout} redirect_limit=#{@redirect_limit} encode=#{@encode} parse_javascript=#{@parse_javascript} parse_javascript_delay=#{@parse_javascript_delay} ferrum_opts=#{@ferrum_opts}>"
end

#next_internal_links(doc, xpath: :default, allow_paths: nil, disallow_paths: nil) ⇒ `Array<Wgit::Url>` (protected)

Returns a doc's internal HTML page links in absolute form; used when crawling a site. By default, any <a> href returning HTML is returned; override this with xpath: if desired.

Use the allow and disallow paths params to partially and selectively crawl a site; the glob syntax is supported e.g. 'wiki/\*' etc. Note that each path should NOT start with a slash.

Parameters:

doc (Wgit::Document) —
The document from which to extract it's internal (absolute) page links.
xpath (String) (defaults to: :default) —
The xpath selecting links to be returned. Only links pointing to the doc.url domain are allowed. The :default is any href returning HTML. The allow/disallow paths will be applied to the returned value.

allow_paths (String, Array<String>) (defaults to: nil) —
Filters links by selecting them if their path File.fnmatch? one of allow_paths.
disallow_paths (String, Array<String>) (defaults to: nil) —
Filters links by rejecting them if their path File.fnmatch? one of disallow_paths.

Returns:

(Array<Wgit::Url>) —
The internal page links from doc.

# File 'lib/wgit/crawler.rb', line 439

def next_internal_links(
  doc, xpath: :default, allow_paths: nil, disallow_paths: nil
)
  links = if xpath && xpath != :default
            follow_xpath(doc, xpath)
          else
            follow_default(doc)
          end

  return links if allow_paths.nil? && disallow_paths.nil?

  process_paths(links, allow_paths, disallow_paths)
end

#resolve(url, response, follow_redirects: true) ⇒ `Object` (protected)

GETs the given url, resolving any redirects. The given response object will be enriched.

Parameters:

url (Wgit::Url) —
The URL to GET and resolve.
response (Wgit::Response) —
The response to enrich. Modifies by reference.
follow_redirects (Boolean, Symbol) (defaults to: true) —
Whether or not to follow redirects. Pass a Symbol to limit where the redirect is allowed to go e.g. :host only allows redirects within the same host. Choose from :origin, :host, :domain or :brand. See Wgit::Url#relative? opts param.

Raises:

(StandardError) —
If a redirect isn't allowed etc.

# File 'lib/wgit/crawler.rb', line 287

def resolve(url, response, follow_redirects: true)
  origin = url.to_origin # Record the origin before any redirects.
  follow_redirects, within = redirect?(follow_redirects)

  loop do
    get_http_response(url, response)
    break unless response.redirect?

    # Handle response 'Location' header.
    location = Wgit::Url.new(response.headers.fetch(:location, ''))
    raise 'Encountered redirect without Location header' if location.empty?

    yield(url, response, location) if block_given?

    # Validate if the redirect is allowed.
    raise "Redirect not allowed: #{location}" unless follow_redirects

    if within && !location.relative?(within => origin)
      raise "Redirect (outside of #{within}) is not allowed: '#{location}'"
    end

    raise "Too many redirects, exceeded: #{@redirect_limit}" \
    if response.redirect_count >= @redirect_limit

    # Process the location to be crawled next.
    location = url.to_origin.join(location) if location.relative?
    response.redirections[url.to_s] = location.to_s
    url.replace(location) # Update the url on redirect.
  end
end

Class: Wgit::Crawler

Overview

Constant Summary

Constants included from Assertable

Class Attribute Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Assertable

Constructor Details

#initialize(redirect_limit: 5, timeout: 5, encode: true, parse_javascript: false, parse_javascript_delay: 1, ferrum_opts: {}) ⇒ Crawler

Class Attribute Details

.supported_file_extensions ⇒ Object (readonly)

Instance Attribute Details

#encode ⇒ Object

#ferrum_opts ⇒ Object

#last_response ⇒ Object (readonly)

#parse_javascript ⇒ Object

#parse_javascript_delay ⇒ Object

#redirect_limit ⇒ Object

#timeout ⇒ Object

Instance Method Details

#browser_get(url) ⇒ Ferrum::Browser (protected)

#crawl_site(url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r

#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl_page

#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl, crawl_pages

#fetch(url, follow_redirects: true) ⇒ String? (protected)

#get_browser_response(url, response) {|browser| ... } ⇒ Wgit::Response (protected)

#get_http_response(url, response) ⇒ Wgit::Response (protected)

#http_get(url) ⇒ Typhoeus::Response (protected)

#inspect ⇒ String

#next_internal_links(doc, xpath: :default, allow_paths: nil, disallow_paths: nil) ⇒ Array<Wgit::Url> (protected)

#resolve(url, response, follow_redirects: true) ⇒ Object (protected)

#initialize(redirect_limit: 5, timeout: 5, encode: true, parse_javascript: false, parse_javascript_delay: 1, ferrum_opts: {}) ⇒ `Crawler`

.supported_file_extensions ⇒ `Object` (readonly)

#encode ⇒ `Object`

#ferrum_opts ⇒ `Object`

#last_response ⇒ `Object` (readonly)

#parse_javascript ⇒ `Object`

#parse_javascript_delay ⇒ `Object`

#redirect_limit ⇒ `Object`

#timeout ⇒ `Object`

#browser_get(url) ⇒ `Ferrum::Browser` (protected)

#crawl_site(url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Array<Wgit::Url>`^? Also known as: crawl_r

#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ `Wgit::Document` Also known as: crawl_page

#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ `Wgit::Document` Also known as: crawl, crawl_pages

#fetch(url, follow_redirects: true) ⇒ `String`^? (protected)

#get_browser_response(url, response) {|browser| ... } ⇒ `Wgit::Response` (protected)

#get_http_response(url, response) ⇒ `Wgit::Response` (protected)

#http_get(url) ⇒ `Typhoeus::Response` (protected)

#inspect ⇒ `String`

#next_internal_links(doc, xpath: :default, allow_paths: nil, disallow_paths: nil) ⇒ `Array<Wgit::Url>` (protected)

#resolve(url, response, follow_redirects: true) ⇒ `Object` (protected)