Class: Wgit::Crawler
Overview
The Crawler class provides a means of crawling web based HTTP Wgit::Url
s,
and serialising their HTML into Wgit::Document
instances. This is the
only Wgit class containing network logic (HTTP request/response handling).
Constant Summary
Constants included from Assertable
Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG
Class Attribute Summary collapse
-
.supported_file_extensions ⇒ Object
readonly
The URL file extensions (from
<a>
hrefs) which will be crawled by#crawl_site
.
Instance Attribute Summary collapse
-
#encode ⇒ Object
Whether or not to UTF-8 encode the response body once crawled.
-
#ferrum_opts ⇒ Object
The opts Hash passed directly to the ferrum Chrome browser when
parse_javascript: true
. -
#last_response ⇒ Object
readonly
The Wgit::Response of the most recently crawled URL.
-
#parse_javascript ⇒ Object
Whether or not to parse the Javascript of the crawled document.
-
#parse_javascript_delay ⇒ Object
The delay between checks in a page's HTML size.
-
#redirect_limit ⇒ Object
The amount of allowed redirects before raising an error.
-
#timeout ⇒ Object
The maximum amount of time (in seconds) a crawl request has to complete before raising an error.
Instance Method Summary collapse
-
#browser_get(url) ⇒ Ferrum::Browser
protected
Performs a HTTP GET request in a web browser and parses the response JS before returning the HTML body of the fully rendered webpage.
-
#crawl_site(url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>?
(also: #crawl_r)
Crawls an entire website's HTML pages by recursively going through its internal
<a>
links; this can be overridden withfollow: xpath
. -
#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document
(also: #crawl_page)
Crawl the url returning the response Wgit::Document or nil, if an error occurs.
-
#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document
(also: #crawl, #crawl_pages)
Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath.
-
#fetch(url, follow_redirects: true) ⇒ String?
protected
Returns the URL's HTML String or nil.
-
#get_browser_response(url, response) {|browser| ... } ⇒ Wgit::Response
protected
Makes a browser request and enriches the given Wgit::Response from it.
-
#get_http_response(url, response) ⇒ Wgit::Response
protected
Makes a HTTP request and enriches the given Wgit::Response from it.
-
#http_get(url) ⇒ Typhoeus::Response
protected
Performs a HTTP GET request and returns the response.
-
#initialize(redirect_limit: 5, timeout: 5, encode: true, parse_javascript: false, parse_javascript_delay: 1, ferrum_opts: {}) ⇒ Crawler
constructor
Initializes and returns a Wgit::Crawler instance.
-
#inspect ⇒ String
Overrides String#inspect to shorten the printed output of a Crawler.
-
#next_internal_links(doc, xpath: :default, allow_paths: nil, disallow_paths: nil) ⇒ Array<Wgit::Url>
protected
Returns a doc's internal HTML page links in absolute form; used when crawling a site.
-
#resolve(url, response, follow_redirects: true) ⇒ Object
protected
GETs the given url, resolving any redirects.
Methods included from Assertable
#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types
Constructor Details
#initialize(redirect_limit: 5, timeout: 5, encode: true, parse_javascript: false, parse_javascript_delay: 1, ferrum_opts: {}) ⇒ Crawler
Initializes and returns a Wgit::Crawler instance.
79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
# File 'lib/wgit/crawler.rb', line 79 def initialize(redirect_limit: 5, timeout: 5, encode: true, parse_javascript: false, parse_javascript_delay: 1, ferrum_opts: {}) assert_type(redirect_limit, Integer) assert_type(timeout, [Integer, Float]) assert_type(encode, [TrueClass, FalseClass]) assert_type(parse_javascript, [TrueClass, FalseClass]) assert_type(parse_javascript_delay, Integer) assert_type(ferrum_opts, Hash) @redirect_limit = redirect_limit @timeout = timeout @encode = encode @parse_javascript = parse_javascript @parse_javascript_delay = parse_javascript_delay @ferrum_opts = default_ferrum_opts.merge(ferrum_opts) end |
Class Attribute Details
.supported_file_extensions ⇒ Object (readonly)
The URL file extensions (from <a>
hrefs) which will be crawled by
#crawl_site
. The idea is to omit anything that isn't HTML and therefore
doesn't keep the crawl of the site going. All URL's without a file
extension will be crawled, because they're assumed to be HTML.
The #crawl
method will crawl anything since it's given the URL(s).
You can add your own site's URL file extension e.g.
Wgit::Crawler.supported_file_extensions << 'html5'
etc.
32 33 34 |
# File 'lib/wgit/crawler.rb', line 32 def supported_file_extensions @supported_file_extensions end |
Instance Attribute Details
#encode ⇒ Object
Whether or not to UTF-8 encode the response body once crawled. Set to false if crawling more than just HTML e.g. images.
46 47 48 |
# File 'lib/wgit/crawler.rb', line 46 def encode @encode end |
#ferrum_opts ⇒ Object
The opts Hash passed directly to the ferrum Chrome browser when
parse_javascript: true
.
See https://github.com/rubycdp/ferrum for details.
60 61 62 |
# File 'lib/wgit/crawler.rb', line 60 def ferrum_opts @ferrum_opts end |
#last_response ⇒ Object (readonly)
The Wgit::Response of the most recently crawled URL.
63 64 65 |
# File 'lib/wgit/crawler.rb', line 63 def last_response @last_response end |
#parse_javascript ⇒ Object
Whether or not to parse the Javascript of the crawled document. Parsing requires Chrome/Chromium to be installed and in $PATH.
50 51 52 |
# File 'lib/wgit/crawler.rb', line 50 def parse_javascript @parse_javascript end |
#parse_javascript_delay ⇒ Object
The delay between checks in a page's HTML size. When the page has stopped "growing", the Javascript has finished dynamically updating the DOM. The value should balance between a good UX and enough JS parse time.
55 56 57 |
# File 'lib/wgit/crawler.rb', line 55 def parse_javascript_delay @parse_javascript_delay end |
#redirect_limit ⇒ Object
The amount of allowed redirects before raising an error. Set to 0 to
disable redirects completely; or you can pass follow_redirects: false
to any Wgit::Crawler.crawl_* method.
38 39 40 |
# File 'lib/wgit/crawler.rb', line 38 def redirect_limit @redirect_limit end |
#timeout ⇒ Object
The maximum amount of time (in seconds) a crawl request has to complete before raising an error. Set to 0 to disable time outs completely.
42 43 44 |
# File 'lib/wgit/crawler.rb', line 42 def timeout @timeout end |
Instance Method Details
#browser_get(url) ⇒ Ferrum::Browser (protected)
Performs a HTTP GET request in a web browser and parses the response JS before returning the HTML body of the fully rendered webpage. This allows Javascript (SPA apps etc.) to generate HTML dynamically.
404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 |
# File 'lib/wgit/crawler.rb', line 404 def browser_get(url) @browser ||= Ferrum::Browser.new(**@ferrum_opts) @browser.goto(url) # Wait for the page's JS to finish dynamically manipulating the DOM. html = @browser.body loop do sleep @parse_javascript_delay break if html.size == @browser.body.size html = @browser.body end @browser end |
#crawl_site(url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r
Crawls an entire website's HTML pages by recursively going through
its internal <a>
links; this can be overridden with follow: xpath
.
Each crawled Document is yielded to a block. Use doc.empty?
to
determine if the crawled link was successful / is valid.
Use the allow and disallow paths params to partially and selectively
crawl a site; the glob syntax is fully supported e.g. 'wiki/\*'
etc.
Only redirects to the same host are followed. For example, the Url 'http://www.example.co.uk/how' has a host of 'www.example.co.uk' meaning a link which redirects to 'https://ftp.example.co.uk' or 'https://www.example.com' will not be followed. The only exception to this is the initially crawled url which is allowed to redirect anywhere; it's host is then used for other link redirections on the site, as described above.
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
# File 'lib/wgit/crawler.rb', line 138 def crawl_site( url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil, &block ) doc = crawl_url(url, &block) return nil if doc.empty? total_pages = 1 limit_reached = max_pages && total_pages >= max_pages link_opts = { xpath: follow, allow_paths:, disallow_paths: } crawled = Set.new(url.redirects_journey) externals = Set.new(doc.external_links) internals = Set.new(next_internal_links(doc, **link_opts)) return externals.to_a if internals.empty? loop do if limit_reached Wgit.logger.debug("Crawled and reached the max_pages limit of: #{max_pages}") break end links = subtract_links(internals, crawled) break if links.empty? links.each do |link| limit_reached = max_pages && total_pages >= max_pages break if limit_reached doc = crawl_url(link, follow_redirects: :host, &block) crawled += link.redirects_journey next if doc.empty? total_pages += 1 internals += next_internal_links(doc, **link_opts) externals += doc.external_links end end Wgit.logger.debug("Crawled #{total_pages} documents for the site: #{url}") externals.to_a end |
#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl_page
Crawl the url returning the response Wgit::Document or nil, if an error occurs.
223 224 225 226 227 228 229 230 231 232 233 234 |
# File 'lib/wgit/crawler.rb', line 223 def crawl_url(url, follow_redirects: true) # A String url isn't allowed because it's passed by value not reference, # meaning a redirect isn't reflected; A Wgit::Url is passed by reference. assert_type(url, Wgit::Url) html = fetch(url, follow_redirects:) doc = Wgit::Document.new(url, html, encode: @encode) yield(doc) if block_given? doc end |
#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl, crawl_pages
Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath. See Wgit::Crawler#crawl_site for crawling entire sites.
198 199 200 201 202 203 204 205 206 207 |
# File 'lib/wgit/crawler.rb', line 198 def crawl_urls(*urls, follow_redirects: true, &block) raise 'You must provide at least one Url' if urls.empty? opts = { follow_redirects: } doc = nil Wgit::Utils.each(urls) { |url| doc = crawl_url(url, **opts, &block) } doc end |
#fetch(url, follow_redirects: true) ⇒ String? (protected)
Returns the URL's HTML String or nil. Handles any errors that arise and sets the @last_response. Errors or any HTTP response that doesn't return a HTML body will be ignored, returning nil.
If @parse_javascript is true, then the final resolved URL will be browsed to and Javascript parsed allowing for dynamic HTML generation.
254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 |
# File 'lib/wgit/crawler.rb', line 254 def fetch(url, follow_redirects: true) response = Wgit::Response.new raise "Invalid url: #{url}" if url.invalid? resolve(url, response, follow_redirects:) get_browser_response(url, response) if @parse_javascript response.body_or_nil rescue StandardError => e Wgit.logger.debug("Wgit::Crawler#fetch('#{url}') exception: #{e}") nil ensure url.crawled = true # Sets date_crawled underneath. url.crawl_duration = response.total_time # Don't override previous url.redirects if response is fully resolved. url.redirects = response.redirects unless response.redirects.empty? @last_response = response end |
#get_browser_response(url, response) {|browser| ... } ⇒ Wgit::Response (protected)
Makes a browser request and enriches the given Wgit::Response from it.
357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 |
# File 'lib/wgit/crawler.rb', line 357 def get_browser_response(url, response) url = url.normalize if url.respond_to?(:normalize) browser = nil crawl_time = Benchmark.measure { browser = browser_get(url) }.real yield browser if block_given? # Enrich the given Wgit::Response object (on top of Typhoeus response). response.adapter_response = browser.network.response response.status = browser.network.response.status response.headers = browser.network.response.headers response.body = browser.body response.add_total_time(crawl_time) # Log the request/response details. log_net(:browser, response, crawl_time) # Handle a failed response. raise "No browser response (within timeout: #{@timeout} second(s))" \ if response.failure? end |
#get_http_response(url, response) ⇒ Wgit::Response (protected)
Makes a HTTP request and enriches the given Wgit::Response from it.
325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 |
# File 'lib/wgit/crawler.rb', line 325 def get_http_response(url, response) # Perform a HTTP GET request. orig_url = url.to_s url = url.normalize if url.respond_to?(:normalize) http_response = http_get(url) # Enrich the given Wgit::Response object. response.adapter_response = http_response response.url = orig_url response.status = http_response.code response.headers = http_response.headers response.body = http_response.body response.ip_address = http_response.primary_ip response.add_total_time(http_response.total_time) # Log the request/response details. log_net(:http, response, http_response.total_time) # Handle a failed response. raise "No response (within timeout: #{@timeout} second(s))" \ if response.failure? end |
#http_get(url) ⇒ Typhoeus::Response (protected)
Performs a HTTP GET request and returns the response.
383 384 385 386 387 388 389 390 391 392 393 394 395 396 |
# File 'lib/wgit/crawler.rb', line 383 def http_get(url) opts = { followlocation: false, timeout: @timeout, accept_encoding: 'gzip', headers: { 'User-Agent' => "wgit/#{Wgit::VERSION}", 'Accept' => 'text/html' } } # See https://rubydoc.info/gems/typhoeus for more info. Typhoeus.get(url, **opts) end |
#inspect ⇒ String
Overrides String#inspect to shorten the printed output of a Crawler.
100 101 102 |
# File 'lib/wgit/crawler.rb', line 100 def inspect "#<Wgit::Crawler timeout=#{@timeout} redirect_limit=#{@redirect_limit} encode=#{@encode} parse_javascript=#{@parse_javascript} parse_javascript_delay=#{@parse_javascript_delay} ferrum_opts=#{@ferrum_opts}>" end |
#next_internal_links(doc, xpath: :default, allow_paths: nil, disallow_paths: nil) ⇒ Array<Wgit::Url> (protected)
Returns a doc's internal HTML page links in absolute form; used when
crawling a site. By default, any <a>
href returning HTML is returned;
override this with xpath:
if desired.
Use the allow and disallow paths params to partially and selectively
crawl a site; the glob syntax is supported e.g. 'wiki/\*'
etc. Note
that each path should NOT start with a slash.
439 440 441 442 443 444 445 446 447 448 449 450 451 |
# File 'lib/wgit/crawler.rb', line 439 def next_internal_links( doc, xpath: :default, allow_paths: nil, disallow_paths: nil ) links = if xpath && xpath != :default follow_xpath(doc, xpath) else follow_default(doc) end return links if allow_paths.nil? && disallow_paths.nil? process_paths(links, allow_paths, disallow_paths) end |
#resolve(url, response, follow_redirects: true) ⇒ Object (protected)
GETs the given url, resolving any redirects. The given response object will be enriched.
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 |
# File 'lib/wgit/crawler.rb', line 287 def resolve(url, response, follow_redirects: true) origin = url.to_origin # Record the origin before any redirects. follow_redirects, within = redirect?(follow_redirects) loop do get_http_response(url, response) break unless response.redirect? # Handle response 'Location' header. location = Wgit::Url.new(response.headers.fetch(:location, '')) raise 'Encountered redirect without Location header' if location.empty? yield(url, response, location) if block_given? # Validate if the redirect is allowed. raise "Redirect not allowed: #{location}" unless follow_redirects if within && !location.relative?(within => origin) raise "Redirect (outside of #{within}) is not allowed: '#{location}'" end raise "Too many redirects, exceeded: #{@redirect_limit}" \ if response.redirect_count >= @redirect_limit # Process the location to be crawled next. location = url.to_origin.join(location) if location.relative? response.redirections[url.to_s] = location.to_s url.replace(location) # Update the url on redirect. end end |