Class: Wgit::Crawler
Overview
The Crawler class provides a means of crawling web based HTTP Wgit::Url's, serialising their HTML into Wgit::Document instances. This is the only Wgit class which contains network logic e.g. request/response handling.
Constant Summary collapse
- SUPPORTED_FILE_EXTENSIONS =
The URL file extensions (from
<a>hrefs) which will be crawled by#crawl_site. The idea is to omit anything that isn't HTML and therefore doesn't keep the crawl of the site going. All URL's without a file extension will be crawled, because they're assumed to be HTML. Set.new( %w[asp aspx cfm cgi htm html htmlx jsp php] )
Constants included from Assertable
Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::WRONG_METHOD_MSG
Instance Attribute Summary collapse
-
#encode ⇒ Object
Whether or not to UTF-8 encode the response body once crawled.
-
#last_response ⇒ Object
readonly
The Wgit::Response of the most recently crawled URL.
-
#redirect_limit ⇒ Object
The amount of allowed redirects before raising an error.
-
#time_out ⇒ Object
The maximum amount of time (in seconds) a crawl request has to complete before raising an error.
Instance Method Summary collapse
-
#crawl_site(url, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>?
(also: #crawl_r)
Crawls an entire website's HTML pages by recursively going through its internal
<a>links. -
#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document?
(also: #crawl_page)
Crawl the url returning the response Wgit::Document or nil, if an error occurs.
-
#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document
(also: #crawl, #crawl_pages)
Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath.
-
#fetch(url, follow_redirects: true) ⇒ String?
protected
Returns the url HTML String or nil.
-
#get_internal_links(doc, allow_paths: nil, disallow_paths: nil) ⇒ Array<Wgit::Url>
protected
Returns a doc's internal HTML page links in absolute form; used when crawling a site.
-
#get_response(url, response) ⇒ Wgit::Response
protected
Makes a HTTP request and enriches the given Wgit::Response from it.
-
#http_get(url) ⇒ Typhoeus::Response
protected
Performs a HTTP GET request and returns the response.
-
#initialize(redirect_limit: 5, time_out: 5, encode: true) ⇒ Crawler
constructor
Initializes and returns a Wgit::Crawler instance.
-
#resolve(url, response, follow_redirects: true) ⇒ Object
protected
GETs the given url, resolving any redirects.
Methods included from Assertable
#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types
Constructor Details
#initialize(redirect_limit: 5, time_out: 5, encode: true) ⇒ Crawler
Initializes and returns a Wgit::Crawler instance.
51 52 53 54 55 |
# File 'lib/wgit/crawler.rb', line 51 def initialize(redirect_limit: 5, time_out: 5, encode: true) @redirect_limit = redirect_limit @time_out = time_out @encode = encode end |
Instance Attribute Details
#encode ⇒ Object
Whether or not to UTF-8 encode the response body once crawled. Set to false if crawling more than just HTML e.g. images.
37 38 39 |
# File 'lib/wgit/crawler.rb', line 37 def encode @encode end |
#last_response ⇒ Object (readonly)
The Wgit::Response of the most recently crawled URL.
40 41 42 |
# File 'lib/wgit/crawler.rb', line 40 def last_response @last_response end |
#redirect_limit ⇒ Object
The amount of allowed redirects before raising an error. Set to 0 to
disable redirects completely; or you can pass follow_redirects: false
to any Wgit::Crawler.crawl_* method.
29 30 31 |
# File 'lib/wgit/crawler.rb', line 29 def redirect_limit @redirect_limit end |
#time_out ⇒ Object
The maximum amount of time (in seconds) a crawl request has to complete before raising an error. Set to 0 to disable time outs completely.
33 34 35 |
# File 'lib/wgit/crawler.rb', line 33 def time_out @time_out end |
Instance Method Details
#crawl_site(url, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r
Crawls an entire website's HTML pages by recursively going through
its internal <a> links. Each crawled Document is yielded to a block.
Use doc.empty? to determine if the crawled link is valid.
Use the allow and disallow paths params to partially and selectively
crawl a site; the glob syntax is fully supported e.g. 'wiki/\*' etc.
Note that each path must NOT start with a slash; the only exception being
a / on its own with no other characters, referring to the index page.
Only redirects to the same host are followed. For example, the Url 'http://www.example.co.uk/how' has a host of 'www.example.co.uk' meaning a link which redirects to 'https://ftp.example.co.uk' or 'https://www.example.com' will not be followed. The only exception to this is the initially crawled url which is allowed to redirect anywhere; it's host is then used for other link redirections on the site, as described above.
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
# File 'lib/wgit/crawler.rb', line 87 def crawl_site(url, allow_paths: nil, disallow_paths: nil, &block) doc = crawl_url(url, &block) return nil if doc.nil? path_opts = { allow_paths: allow_paths, disallow_paths: disallow_paths } alt_url = url.end_with?('/') ? url.chop : url + '/' crawled = Set.new([url, alt_url]) externals = Set.new(doc.external_links) internals = Set.new(get_internal_links(doc, path_opts)) return externals.to_a if internals.empty? loop do links = internals - crawled break if links.empty? links.each do |link| orig_link = link.dup doc = crawl_url(link, follow_redirects: :host, &block) crawled += [orig_link, link] # Push both links in case of redirects. next if doc.nil? internals += get_internal_links(doc, path_opts) externals += doc.external_links end end externals.to_a end |
#crawl_url(url, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document? Also known as: crawl_page
Crawl the url returning the response Wgit::Document or nil, if an error occurs.
155 156 157 158 159 160 161 162 163 164 165 166 |
# File 'lib/wgit/crawler.rb', line 155 def crawl_url(url, follow_redirects: true) # A String url isn't allowed because it's passed by value not reference, # meaning a redirect isn't reflected; A Wgit::Url is passed by reference. assert_type(url, Wgit::Url) html = fetch(url, follow_redirects: follow_redirects) doc = Wgit::Document.new(url, html, encode: @encode) yield(doc) if block_given? doc.empty? ? nil : doc end |
#crawl_urls(*urls, follow_redirects: true) {|doc| ... } ⇒ Wgit::Document Also known as: crawl, crawl_pages
Crawls one or more individual urls using Wgit::Crawler#crawl_url underneath. See Wgit::Crawler#crawl_site for crawling entire sites.
132 133 134 135 136 137 138 139 140 141 |
# File 'lib/wgit/crawler.rb', line 132 def crawl_urls(*urls, follow_redirects: true, &block) raise 'You must provide at least one Url' if urls.empty? opts = { follow_redirects: follow_redirects } doc = nil Wgit::Utils.each(urls) { |url| doc = crawl_url(url, opts, &block) } doc end |
#fetch(url, follow_redirects: true) ⇒ String? (protected)
Returns the url HTML String or nil. Handles any errors that arise and sets the @last_response. Errors or any HTTP response that doesn't return a HTML body will be ignored, returning nil.
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
# File 'lib/wgit/crawler.rb', line 183 def fetch(url, follow_redirects: true) response = Wgit::Response.new raise "Invalid url: #{url}" if url.invalid? resolve(url, response, follow_redirects: follow_redirects) response.body_or_nil rescue StandardError => e Wgit.logger.debug("Wgit::Crawler#fetch('#{url}') exception: #{e}") nil ensure url.crawled = true # Sets date_crawled underneath. url.crawl_duration = response.total_time @last_response = response end |
#get_internal_links(doc, allow_paths: nil, disallow_paths: nil) ⇒ Array<Wgit::Url> (protected)
Returns a doc's internal HTML page links in absolute form; used when
crawling a site. Use the allow and disallow paths params to partially
and selectively crawl a site; the glob syntax is supported e.g.
'wiki/\*' etc. Note that each path should NOT start with a slash.
Override this method in a subclass to change how a site
is crawled, not what is extracted from each page (Document extensions
should be used for this purpose instead). Just remember that only HTML
files containing <a> links keep the crawl going beyond the base URL.
309 310 311 312 313 314 315 316 317 318 319 320 321 322 |
# File 'lib/wgit/crawler.rb', line 309 def get_internal_links(doc, allow_paths: nil, disallow_paths: nil) links = doc .internal_absolute_links .map(&:omit_fragment) # Because fragments don't alter content. .uniq .select do |link| ext = link.to_extension ext ? SUPPORTED_FILE_EXTENSIONS.include?(ext.downcase) : true end return links if allow_paths.nil? && disallow_paths.nil? process_paths(links, allow_paths, disallow_paths) end |
#get_response(url, response) ⇒ Wgit::Response (protected)
Makes a HTTP request and enriches the given Wgit::Response from it.
249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 |
# File 'lib/wgit/crawler.rb', line 249 def get_response(url, response) # Perform a HTTP GET request. orig_url = url.to_s url = url.normalize if url.respond_to?(:normalize) http_response = http_get(url) # Enrich the given Wgit::Response object. response.adapter_response = http_response response.url = orig_url response.status = http_response.code response.headers = http_response.headers response.body = http_response.body response.ip_address = http_response.primary_ip response.add_total_time(http_response.total_time) # Log the request/response details. log_http(response) # Handle a failed response. raise "No response (within timeout: #{@time_out} second(s))" \ if response.failure? end |
#http_get(url) ⇒ Typhoeus::Response (protected)
Performs a HTTP GET request and returns the response.
277 278 279 280 281 282 283 284 285 286 287 288 289 290 |
# File 'lib/wgit/crawler.rb', line 277 def http_get(url) opts = { followlocation: false, timeout: @time_out, accept_encoding: 'gzip', headers: { 'User-Agent' => "wgit/#{Wgit::VERSION}", 'Accept' => 'text/html' } } # See https://rubydoc.info/gems/typhoeus for more info. Typhoeus.get(url, opts) end |
#resolve(url, response, follow_redirects: true) ⇒ Object (protected)
GETs the given url, resolving any redirects. The given response object will be enriched.
211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 |
# File 'lib/wgit/crawler.rb', line 211 def resolve(url, response, follow_redirects: true) orig_url_base = url.to_url.to_base # Recorded before any redirects. follow_redirects, within = redirect?(follow_redirects) loop do get_response(url, response) break unless response.redirect? # Handle response 'Location' header. location = Wgit::Url.new(response.headers.fetch(:location, '')) raise 'Encountered redirect without Location header' if location.empty? yield(url, response, location) if block_given? # Validate if the redirect is allowed. raise "Redirect not allowed: #{location}" unless follow_redirects if within && !location.relative?(within => orig_url_base) raise "Redirect (outside of #{within}) is not allowed: '#{location}'" end raise "Too many redirects, exceeded: #{@redirect_limit}" \ if response.redirect_count >= @redirect_limit # Process the location to be crawled next. location = url.to_base.concat(location) if location.relative? response.redirections[url.to_s] = location.to_s url.replace(location) # Update the url on redirect. end end |