Method: Wgit::Crawler#crawl_site

Defined in:: lib/wgit/crawler.rb

#crawl_site(url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Array<Wgit::Url>`^? Also known as: crawl_r

Crawls an entire website's HTML pages by recursively going through its internal <a> links; this can be overridden with follow: xpath. Each crawled Document is yielded to a block. Use doc.empty? to determine if the crawled link was successful / is valid.

Use the allow and disallow paths params to partially and selectively crawl a site; the glob syntax is fully supported e.g. 'wiki/\*' etc.

Only redirects to the same host are followed. For example, the Url 'http://www.example.co.uk/how' has a host of 'www.example.co.uk' meaning a link which redirects to 'https://ftp.example.co.uk' or 'https://www.example.com' will not be followed. The only exception to this is the initially crawled url which is allowed to redirect anywhere; it's host is then used for other link redirections on the site, as described above.

Parameters:

url (Wgit::Url) —
The base URL of the website to be crawled. It is recommended that this URL be the index page of the site to give a greater chance of finding all pages within that site/host.
follow (String) (defaults to: :default) —
The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML.
max_pages (Integer) (defaults to: nil)
allow_paths (String, Array<String>) (defaults to: nil) —
Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.
disallow_paths (String, Array<String>) (defaults to: nil) —
Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

(doc) —
Given each crawled page (Wgit::Document) of the site. A block is the only way to interact with each crawled Document. Use doc.empty? to determine if the page is valid.

Returns:

(Array<Wgit::Url>, nil) —
Unique Array of external urls collected from all of the site's pages or nil if the given url could not be crawled successfully.

# File 'lib/wgit/crawler.rb', line 146

def crawl_site(
  url, follow: :default, max_pages: nil,
  allow_paths: nil, disallow_paths: nil, &block
)
  doc = crawl_url(url, &block)
  return nil if doc.empty?

  total_pages = 1
  limit_reached = max_pages && total_pages >= max_pages
  link_opts = { xpath: follow, allow_paths:, disallow_paths: }

  crawled   = Set.new(url.redirects_journey)
  externals = Set.new(doc.external_links)
  internals = Set.new(next_internal_links(doc, **link_opts))

  return externals.to_a if internals.empty?

  loop do
    if limit_reached
      Wgit.logger.debug("Crawled and reached the max_pages limit of: #{max_pages}")
      break
    end

    links = subtract_links(internals, crawled)
    break if links.empty?

    links.each do |link|
      limit_reached = max_pages && total_pages >= max_pages
      break if limit_reached

      doc = crawl_url(link, follow_redirects: :host, &block)

      crawled += link.redirects_journey
      next if doc.empty?

      total_pages += 1
      internals   += next_internal_links(doc, **link_opts)
      externals   += doc.external_links
    end
  end

  Wgit.logger.debug("Crawled #{total_pages} documents for the site: #{url}")

  externals.to_a
end

Method: Wgit::Crawler#crawl_site

#crawl_site(url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r

#crawl_site(url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ `Array<Wgit::Url>`^? Also known as: crawl_r