Method: Wgit::Crawler#crawl_site

Defined in:
lib/wgit/crawler.rb

#crawl_site(url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r

Crawls an entire website's HTML pages by recursively going through its internal <a> links; this can be overridden with follow: xpath. Each crawled Document is yielded to a block. Use doc.empty? to determine if the crawled link was successful / is valid.

Use the allow and disallow paths params to partially and selectively crawl a site; the glob syntax is fully supported e.g. 'wiki/\*' etc.

Only redirects to the same host are followed. For example, the Url 'http://www.example.co.uk/how' has a host of 'www.example.co.uk' meaning a link which redirects to 'https://ftp.example.co.uk' or 'https://www.example.com' will not be followed. The only exception to this is the initially crawled url which is allowed to redirect anywhere; it's host is then used for other link redirections on the site, as described above.

Parameters:

  • url (Wgit::Url)

    The base URL of the website to be crawled. It is recommended that this URL be the index page of the site to give a greater chance of finding all pages within that site/host.

  • follow (String) (defaults to: :default)

    The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML.

  • max_pages (Integer) (defaults to: nil)
  • allow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.

  • disallow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

  • (doc)

    Given each crawled page (Wgit::Document) of the site. A block is the only way to interact with each crawled Document. Use doc.empty? to determine if the page is valid.

Returns:

  • (Array<Wgit::Url>, nil)

    Unique Array of external urls collected from all of the site's pages or nil if the given url could not be crawled successfully.



146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
# File 'lib/wgit/crawler.rb', line 146

def crawl_site(
  url, follow: :default, max_pages: nil,
  allow_paths: nil, disallow_paths: nil, &block
)
  doc = crawl_url(url, &block)
  return nil if doc.empty?

  total_pages = 1
  limit_reached = max_pages && total_pages >= max_pages
  link_opts = { xpath: follow, allow_paths:, disallow_paths: }

  crawled   = Set.new(url.redirects_journey)
  externals = Set.new(doc.external_links)
  internals = Set.new(next_internal_links(doc, **link_opts))

  return externals.to_a if internals.empty?

  loop do
    if limit_reached
      Wgit.logger.debug("Crawled and reached the max_pages limit of: #{max_pages}")
      break
    end

    links = subtract_links(internals, crawled)
    break if links.empty?

    links.each do |link|
      limit_reached = max_pages && total_pages >= max_pages
      break if limit_reached

      doc = crawl_url(link, follow_redirects: :host, &block)

      crawled += link.redirects_journey
      next if doc.empty?

      total_pages += 1
      internals   += next_internal_links(doc, **link_opts)
      externals   += doc.external_links
    end
  end

  Wgit.logger.debug("Crawled #{total_pages} documents for the site: #{url}")

  externals.to_a
end