Class: Wgit::Url
Overview
Class modeling a web based HTTP URL.
Can be an internal/relative link e.g. "about.html" or an absolute URL e.g. "http://www.google.co.uk". Is a subclass of String and uses 'uri' and 'addressable/uri' internally.
Most of the methods in this class return new Wgit::Url instances making the method calls chainable e.g. url.omit_base.omit_fragment etc. The methods also try to be idempotent where possible.
Constant Summary
Constants included from Assertable
Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::WRONG_METHOD_MSG
Instance Attribute Summary collapse
-
#crawl_duration ⇒ Object
The duration of the crawl for this Url (in seconds).
-
#crawled ⇒ Object
(also: #crawled?)
Whether or not the Url has been crawled or not.
-
#date_crawled ⇒ Object
The Time stamp of when this Url was crawled.
Class Method Summary collapse
-
.parse(obj) ⇒ Wgit::Url
Initialises a new Wgit::Url instance from a String or subclass of String e.g.
Instance Method Summary collapse
-
#absolute? ⇒ Boolean
(also: #is_absolute?)
Returns true if self is an absolute Url; false if relative.
-
#concat(other) ⇒ Wgit::Url
(also: #+)
Concats self and other together before returning a new Url.
-
#fragment? ⇒ Boolean
(also: #is_fragment?)
Returns true if self is a URL fragment e.g.
-
#initialize(url_or_obj, crawled: false, date_crawled: nil, crawl_duration: nil) ⇒ Url
constructor
Initializes a new instance of Wgit::Url which represents a web based HTTP URL.
-
#invalid? ⇒ Boolean
Returns if self is an invalid (e.g. relative) HTTP URL.
-
#normalize ⇒ Wgit::Url
(also: #normalise)
Normalises/escapes self and returns a new Wgit::Url.
-
#omit(*components) ⇒ Wgit::Url
Omits the given URL components from self and returns a new Wgit::Url.
-
#omit_base ⇒ Wgit::Url
Returns a new Wgit::Url with the base (proto and host) removed e.g.
-
#omit_fragment ⇒ Wgit::Url
Returns a new Wgit::Url with the fragment portion removed e.g.
-
#omit_leading_slash ⇒ Wgit::Url
Returns a new Wgit::Url containing self without a trailing slash.
-
#omit_query ⇒ Wgit::Url
Returns a new Wgit::Url with the query string portion removed e.g.
-
#omit_slashes ⇒ Wgit::Url
Returns a new Wgit::Url containing self without a leading or trailing slash.
-
#omit_trailing_slash ⇒ Wgit::Url
Returns a new Wgit::Url containing self without a trailing slash.
-
#prefix_base(doc) ⇒ Wgit::Url
Returns an absolute form of self within the context of doc.
-
#prefix_scheme(protocol: :http) ⇒ Wgit::Url
Returns self having prefixed a protocol scheme.
-
#query? ⇒ Boolean
(also: #is_query?)
Returns true if self is a URL query string e.g.
-
#relative?(opts = {}) ⇒ Boolean
(also: #is_relative?)
Returns true if self is a relative Url; false if absolute.
-
#replace(new_url) ⇒ String
Overrides String#replace setting the new_url @uri and String value.
-
#to_addressable_uri ⇒ Addressable::URI
Returns the Addressable::URI object for this URL.
-
#to_base ⇒ Wgit::Url?
(also: #base)
Returns only the base of this URL e.g.
-
#to_brand ⇒ Wgit::Url?
(also: #brand)
Returns a new Wgit::Url containing just the brand of this URL e.g.
-
#to_domain ⇒ Wgit::Url?
(also: #domain)
Returns a new Wgit::Url containing just the domain of this URL e.g.
-
#to_endpoint ⇒ Wgit::Url
(also: #endpoint)
Returns the endpoint of this URL e.g.
-
#to_extension ⇒ Wgit::Url?
(also: #extension)
Returns a new Wgit::Url containing just the file extension of this URL e.g.
-
#to_fragment ⇒ Wgit::Url?
(also: #fragment)
Returns a new Wgit::Url containing just the fragment string of this URL e.g.
-
#to_h ⇒ Hash
Returns a Hash containing this Url's instance vars excluding @uri.
-
#to_host ⇒ Wgit::Url?
(also: #host)
Returns a new Wgit::Url containing just the host of this URL e.g.
-
#to_path ⇒ Wgit::Url?
(also: #path)
Returns the path of this URL e.g.
-
#to_query ⇒ Wgit::Url?
(also: #query)
Returns a new Wgit::Url containing just the query string of this URL e.g.
-
#to_scheme ⇒ Wgit::Url?
(also: #scheme)
Returns a new Wgit::Url containing just the scheme of this URL e.g.
-
#to_uri ⇒ URI::HTTP, URI::HTTPS
(also: #uri)
Returns a normalised URI object for this URL.
-
#to_url ⇒ Wgit::Url
(also: #url)
Returns self.
-
#valid? ⇒ Boolean
(also: #is_valid?)
Returns if self is a valid and absolute HTTP URL or not.
Methods included from Assertable
#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types
Constructor Details
#initialize(url_or_obj, crawled: false, date_crawled: nil, crawl_duration: nil) ⇒ Url
Initializes a new instance of Wgit::Url which represents a web based HTTP URL.
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/wgit/url.rb', line 45 def initialize( url_or_obj, crawled: false, date_crawled: nil, crawl_duration: nil ) # Init from a URL String. if url_or_obj.is_a?(String) url = url_or_obj.to_s # Else init from a Hash like object e.g. database object. else obj = url_or_obj assert_respond_to(obj, :fetch) url = obj.fetch('url') # Should always be present. crawled = obj.fetch('crawled', false) date_crawled = obj.fetch('date_crawled', nil) crawl_duration = obj.fetch('crawl_duration', nil) end @uri = Addressable::URI.parse(url) @crawled = crawled @date_crawled = date_crawled @crawl_duration = crawl_duration super(url) end |
Instance Attribute Details
#crawl_duration ⇒ Object
The duration of the crawl for this Url (in seconds).
29 30 31 |
# File 'lib/wgit/url.rb', line 29 def crawl_duration @crawl_duration end |
#crawled ⇒ Object Also known as: crawled?
Whether or not the Url has been crawled or not. A custom crawled= method is provided by this class.
23 24 25 |
# File 'lib/wgit/url.rb', line 23 def crawled @crawled end |
#date_crawled ⇒ Object
The Time stamp of when this Url was crawled.
26 27 28 |
# File 'lib/wgit/url.rb', line 26 def date_crawled @date_crawled end |
Class Method Details
.parse(obj) ⇒ Wgit::Url
Initialises a new Wgit::Url instance from a String or subclass of String e.g. Wgit::Url. Any other obj type will raise an error.
If obj is already a Wgit::Url then it will be returned as is to maintain it's state. Otherwise, a new Wgit::Url is instantiated and returned. This differs from Wgit::Url.new which always instantiates a new Wgit::Url.
Note: Only use this method if you are allowing obj to be either a String or a Wgit::Url whose state you want to preserve e.g. when passing a URL to a crawl method which might redirect (calling Wgit::Url#replace). If you're sure of the type or don't care about preserving the state of the Wgit::Url, use Wgit::Url.new instead.
86 87 88 89 90 91 |
# File 'lib/wgit/url.rb', line 86 def self.parse(obj) raise 'Can only parse if obj#is_a?(String)' unless obj.is_a?(String) # Return a Wgit::Url as is to avoid losing state e.g. date_crawled etc. obj.is_a?(Wgit::Url) ? obj : new(obj) end |
Instance Method Details
#absolute? ⇒ Boolean Also known as: is_absolute?
Returns true if self is an absolute Url; false if relative.
182 183 184 |
# File 'lib/wgit/url.rb', line 182 def absolute? @uri.absolute? end |
#concat(other) ⇒ Wgit::Url Also known as: +
Concats self and other together before returning a new Url. Self is not modified.
211 212 213 214 215 216 217 218 219 220 221 222 |
# File 'lib/wgit/url.rb', line 211 def concat(other) other = Wgit::Url.new(other) raise 'other must be relative' unless other.relative? other = other.omit_leading_slash separator = %w[# ? .].include?(other[0]) ? '' : '/' # We use to_s below to call String#+, not Wgit::Url#+ (alias for concat). concatted = omit_trailing_slash.to_s + separator.to_s + other.to_s Wgit::Url.new(concatted) end |
#fragment? ⇒ Boolean Also known as: is_fragment?
Returns true if self is a URL fragment e.g. #top etc. Note this shouldn't be used to determine if self contains a fragment.
510 511 512 |
# File 'lib/wgit/url.rb', line 510 def fragment? start_with?('#') end |
#invalid? ⇒ Boolean
Returns if self is an invalid (e.g. relative) HTTP URL. See Wgit::Url#valid? for the inverse (and more information).
202 203 204 |
# File 'lib/wgit/url.rb', line 202 def invalid? !valid? end |
#normalize ⇒ Wgit::Url Also known as: normalise
Normalises/escapes self and returns a new Wgit::Url. Self isn't modified.
227 228 229 |
# File 'lib/wgit/url.rb', line 227 def normalize Wgit::Url.new(@uri.normalize.to_s) end |
#omit(*components) ⇒ Wgit::Url
Omits the given URL components from self and returns a new Wgit::Url.
Calls Addressable::URI#omit underneath and creates a new Wgit::Url from the output. See the Addressable::URI docs for more information.
421 422 423 424 |
# File 'lib/wgit/url.rb', line 421 def omit(*components) omitted = @uri.omit(*components) Wgit::Url.new(omitted.to_s) end |
#omit_base ⇒ Wgit::Url
Returns a new Wgit::Url with the base (proto and host) removed e.g. Given http://google.com/search?q=something#about, search?q=something#about is returned. If relative and base isn't present then self is returned. Leading and trailing slashes are always stripped from the return value.
460 461 462 463 464 465 466 467 |
# File 'lib/wgit/url.rb', line 460 def omit_base base_url = to_base omit_base = base_url ? gsub(base_url, '') : self return self if ['', '/'].include?(omit_base) Wgit::Url.new(omit_base).omit_slashes end |
#omit_fragment ⇒ Wgit::Url
Returns a new Wgit::Url with the fragment portion removed e.g. Given http://google.com/search#about, http://google.com/search is returned. Self is returned as is if no fragment is present. A URL consisting of only a fragment e.g. '#about' will return an empty URL. This method assumes that the fragment is correctly placed at the very end of the URL.
491 492 493 494 495 496 |
# File 'lib/wgit/url.rb', line 491 def omit_fragment fragment = to_fragment omit_fragment = fragment ? gsub("##{fragment}", '') : self Wgit::Url.new(omit_fragment) end |
#omit_leading_slash ⇒ Wgit::Url
Returns a new Wgit::Url containing self without a trailing slash. Is idempotent meaning self will always be returned regardless of whether there's a trailing slash or not.
431 432 433 |
# File 'lib/wgit/url.rb', line 431 def omit_leading_slash start_with?('/') ? Wgit::Url.new(self[1..-1]) : self end |
#omit_query ⇒ Wgit::Url
Returns a new Wgit::Url with the query string portion removed e.g. Given http://google.com/search?q=hello, http://google.com/search is returned. Self is returned as is if no query string is present. A URL consisting of only a query string e.g. '?q=hello' will return an empty URL.
476 477 478 479 480 481 |
# File 'lib/wgit/url.rb', line 476 def omit_query query = to_query omit_query_string = query ? gsub("?#{query}", '') : self Wgit::Url.new(omit_query_string) end |
#omit_slashes ⇒ Wgit::Url
Returns a new Wgit::Url containing self without a leading or trailing slash. Is idempotent and will return self regardless if there's slashes present or not.
449 450 451 452 |
# File 'lib/wgit/url.rb', line 449 def omit_slashes omit_leading_slash .omit_trailing_slash end |
#omit_trailing_slash ⇒ Wgit::Url
Returns a new Wgit::Url containing self without a trailing slash. Is idempotent meaning self will always be returned regardless of whether there's a trailing slash or not.
440 441 442 |
# File 'lib/wgit/url.rb', line 440 def omit_trailing_slash end_with?('/') ? Wgit::Url.new(chop) : self end |
#prefix_base(doc) ⇒ Wgit::Url
Returns an absolute form of self within the context of doc. Doesn't modify the receiver.
If self is absolute then it's returned as is, making this method
idempotent. The doc's
Typically used to build an absolute link obtained from a document.
251 252 253 254 255 |
# File 'lib/wgit/url.rb', line 251 def prefix_base(doc) assert_type(doc, Wgit::Document) absolute? ? self : doc.base_url(link: self).concat(self) end |
#prefix_scheme(protocol: :http) ⇒ Wgit::Url
Returns self having prefixed a protocol scheme. Doesn't modify receiver. Returns self even if absolute (with scheme); therefore is idempotent.
262 263 264 265 266 267 268 269 270 271 272 273 |
# File 'lib/wgit/url.rb', line 262 def prefix_scheme(protocol: :http) return self if absolute? case protocol when :http Wgit::Url.new("http://#{url}") when :https Wgit::Url.new("https://#{url}") else raise "protocol must be :http or :https, not :#{protocol}" end end |
#query? ⇒ Boolean Also known as: is_query?
Returns true if self is a URL query string e.g. ?q=hello etc. Note this shouldn't be used to determine if self contains a query.
502 503 504 |
# File 'lib/wgit/url.rb', line 502 def query? start_with?('?') end |
#relative?(opts = {}) ⇒ Boolean Also known as: is_relative?
Returns true if self is a relative Url; false if absolute.
An absolute URL must have a scheme prefix e.g. 'http://', otherwise the URL is regarded as being relative (regardless of whether it's valid or not). The only exception is if an opts arg is provided and self is a page belonging to that arg type e.g. host; then the link is relative.
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
# File 'lib/wgit/url.rb', line 145 def relative?(opts = {}) defaults = { base: nil, host: nil, domain: nil, brand: nil } opts = defaults.merge(opts) raise 'Url (self) cannot be empty' if empty? return true if @uri.relative? # Self is absolute but may be relative to the opts param e.g. host. opts.select! { |_k, v| v } raise "Provide only one of: #{defaults.keys}" if opts.length > 1 return false if opts.empty? type, url = opts.first url = Wgit::Url.new(url) if url.invalid? raise "Invalid opts param value, it must be absolute, containing a \ protocol scheme and domain (e.g. http://example.com): #{url}" end case type when :base # http://www.google.com to_base == url.to_base when :host # www.google.com to_host == url.to_host when :domain # google.com to_domain == url.to_domain when :brand # google to_brand == url.to_brand else raise "Unknown opts param: :#{type}, use one of: #{defaults.keys}" end end |
#replace(new_url) ⇒ String
Overrides String#replace setting the new_url @uri and String value.
109 110 111 112 113 |
# File 'lib/wgit/url.rb', line 109 def replace(new_url) @uri = Addressable::URI.parse(new_url) super(new_url) end |
#to_addressable_uri ⇒ Addressable::URI
Returns the Addressable::URI object for this URL.
295 296 297 |
# File 'lib/wgit/url.rb', line 295 def to_addressable_uri @uri end |
#to_base ⇒ Wgit::Url? Also known as: base
Returns only the base of this URL e.g. the protocol scheme and host combined.
347 348 349 350 351 352 |
# File 'lib/wgit/url.rb', line 347 def to_base return nil if @uri.scheme.nil? || @uri.host.nil? base = "#{@uri.scheme}://#{@uri.host}" Wgit::Url.new(base) end |
#to_brand ⇒ Wgit::Url? Also known as: brand
Returns a new Wgit::Url containing just the brand of this URL e.g. Given http://www.google.co.uk/about.html, google is returned.
337 338 339 340 |
# File 'lib/wgit/url.rb', line 337 def to_brand domain = to_domain domain ? Wgit::Url.new(domain.split('.').first) : nil end |
#to_domain ⇒ Wgit::Url? Also known as: domain
Returns a new Wgit::Url containing just the domain of this URL e.g. Given http://www.google.co.uk/about.html, google.co.uk is returned.
328 329 330 331 |
# File 'lib/wgit/url.rb', line 328 def to_domain domain = @uri.domain domain ? Wgit::Url.new(domain) : nil end |
#to_endpoint ⇒ Wgit::Url Also known as: endpoint
Returns the endpoint of this URL e.g. the bit after the host with any slashes included. For example: Wgit::Url.new("http://www.google.co.uk/about.html/").to_endpoint returns "/about.html/". See Wgit::Url#to_path if you don't want the slashes.
375 376 377 378 379 |
# File 'lib/wgit/url.rb', line 375 def to_endpoint endpoint = @uri.path endpoint = '/' + endpoint unless endpoint.start_with?('/') Wgit::Url.new(endpoint) end |
#to_extension ⇒ Wgit::Url? Also known as: extension
Returns a new Wgit::Url containing just the file extension of this URL e.g. Given http://google.com#about.html, html is returned.
403 404 405 406 407 408 409 |
# File 'lib/wgit/url.rb', line 403 def to_extension path = to_path return nil unless path segs = path.split('.') segs.length > 1 ? Wgit::Url.new(segs.last) : nil end |
#to_fragment ⇒ Wgit::Url? Also known as: fragment
Returns a new Wgit::Url containing just the fragment string of this URL e.g. Given http://google.com#about, #about is returned.
394 395 396 397 |
# File 'lib/wgit/url.rb', line 394 def to_fragment fragment = @uri.fragment fragment ? Wgit::Url.new(fragment) : nil end |
#to_h ⇒ Hash
Returns a Hash containing this Url's instance vars excluding @uri. Used when storing the URL in a Database e.g. MongoDB etc.
279 280 281 282 283 |
# File 'lib/wgit/url.rb', line 279 def to_h ignore = ['@uri'] h = Wgit::Utils.to_h(self, ignore: ignore) Hash[h.to_a.insert(0, ['url', self])] # Insert url at position 0. end |
#to_host ⇒ Wgit::Url? Also known as: host
Returns a new Wgit::Url containing just the host of this URL e.g. Given http://www.google.co.uk/about.html, www.google.co.uk is returned.
319 320 321 322 |
# File 'lib/wgit/url.rb', line 319 def to_host host = @uri.host host ? Wgit::Url.new(host) : nil end |
#to_path ⇒ Wgit::Url? Also known as: path
Returns the path of this URL e.g. the bit after the host without slashes. For example: Wgit::Url.new("http://www.google.co.uk/about.html/").to_path returns "about.html". See Wgit::Url#to_endpoint if you want the slashes.
360 361 362 363 364 365 366 |
# File 'lib/wgit/url.rb', line 360 def to_path path = @uri.path return nil if path.nil? || path.empty? return Wgit::Url.new('/') if path == '/' Wgit::Url.new(path).omit_slashes end |
#to_query ⇒ Wgit::Url? Also known as: query
Returns a new Wgit::Url containing just the query string of this URL e.g. Given http://google.com?q=ruby, '?q=ruby' is returned.
385 386 387 388 |
# File 'lib/wgit/url.rb', line 385 def to_query query = @uri.query query ? Wgit::Url.new(query) : nil end |
#to_scheme ⇒ Wgit::Url? Also known as: scheme
Returns a new Wgit::Url containing just the scheme of this URL e.g. Given http://www.google.co.uk, http is returned.
310 311 312 313 |
# File 'lib/wgit/url.rb', line 310 def to_scheme scheme = @uri.scheme scheme ? Wgit::Url.new(scheme) : nil end |
#to_uri ⇒ URI::HTTP, URI::HTTPS Also known as: uri
Returns a normalised URI object for this URL.
288 289 290 |
# File 'lib/wgit/url.rb', line 288 def to_uri URI(normalize) end |
#to_url ⇒ Wgit::Url Also known as: url
Returns self.
302 303 304 |
# File 'lib/wgit/url.rb', line 302 def to_url self end |
#valid? ⇒ Boolean Also known as: is_valid?
Returns if self is a valid and absolute HTTP URL or not. Self should always be crawlable if this method returns true.
190 191 192 193 194 195 196 |
# File 'lib/wgit/url.rb', line 190 def valid? return false if relative? return false unless to_base && to_domain return false if URI::DEFAULT_PARSER.make_regexp.match(normalize).nil? true end |