Class: Wgit::Url

Inherits:
String show all
Includes:
Assertable
Defined in:
lib/wgit/url.rb

Overview

Class modeling a web based HTTP URL.

Can be an internal/relative link e.g. "about.html" or an absolute URL e.g. "http://www.google.co.uk". Is a subclass of String and uses 'uri' and 'addressable/uri' internally.

Most of the methods in this class return new Wgit::Url instances making the method calls chainable e.g. url.omit_base.omit_fragment etc. The methods also try to be idempotent where possible.

Constant Summary

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::WRONG_METHOD_MSG

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Assertable

#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(url_or_obj, crawled: false, date_crawled: nil, crawl_duration: nil) ⇒ Url

Initializes a new instance of Wgit::Url which represents a web based HTTP URL.

Parameters:

  • Is either a String based URL or an object representing a Database record e.g. a MongoDB document/object.

  • (defaults to: false)

    Whether or not the HTML of the URL's web page has been crawled or not. Only used if url_or_obj is a String.

  • (defaults to: nil)

    Should only be provided if crawled is true. A suitable object can be returned from Wgit::Utils.time_stamp. Only used if url_or_obj is a String.

  • (defaults to: nil)

    Should only be provided if crawled is true. The duration of the crawl for this Url (in seconds).

Raises:

  • If url_or_obj is an Object with missing methods.



45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# File 'lib/wgit/url.rb', line 45

def initialize(
  url_or_obj, crawled: false, date_crawled: nil, crawl_duration: nil
)
  # Init from a URL String.
  if url_or_obj.is_a?(String)
    url = url_or_obj.to_s
  # Else init from a Hash like object e.g. database object.
  else
    obj = url_or_obj
    assert_respond_to(obj, :fetch)

    url            = obj.fetch('url') # Should always be present.
    crawled        = obj.fetch('crawled', false)
    date_crawled   = obj.fetch('date_crawled', nil)
    crawl_duration = obj.fetch('crawl_duration', nil)
  end

  @uri            = Addressable::URI.parse(url)
  @crawled        = crawled
  @date_crawled   = date_crawled
  @crawl_duration = crawl_duration

  super(url)
end

Instance Attribute Details

#crawl_durationObject

The duration of the crawl for this Url (in seconds).



29
30
31
# File 'lib/wgit/url.rb', line 29

def crawl_duration
  @crawl_duration
end

#crawledObject Also known as: crawled?

Whether or not the Url has been crawled or not. A custom crawled= method is provided by this class.



23
24
25
# File 'lib/wgit/url.rb', line 23

def crawled
  @crawled
end

#date_crawledObject

The Time stamp of when this Url was crawled.



26
27
28
# File 'lib/wgit/url.rb', line 26

def date_crawled
  @date_crawled
end

Class Method Details

.parse(obj) ⇒ Wgit::Url

Initialises a new Wgit::Url instance from a String or subclass of String e.g. Wgit::Url. Any other obj type will raise an error.

If obj is already a Wgit::Url then it will be returned as is to maintain it's state. Otherwise, a new Wgit::Url is instantiated and returned. This differs from Wgit::Url.new which always instantiates a new Wgit::Url.

Note: Only use this method if you are allowing obj to be either a String or a Wgit::Url whose state you want to preserve e.g. when passing a URL to a crawl method which might redirect (calling Wgit::Url#replace). If you're sure of the type or don't care about preserving the state of the Wgit::Url, use Wgit::Url.new instead.

Parameters:

  • The object to parse, which #is_a?(String).

Returns:

  • A Wgit::Url instance.

Raises:

  • If obj.is_a?(String) is false.



86
87
88
89
90
91
# File 'lib/wgit/url.rb', line 86

def self.parse(obj)
  raise 'Can only parse if obj#is_a?(String)' unless obj.is_a?(String)

  # Return a Wgit::Url as is to avoid losing state e.g. date_crawled etc.
  obj.is_a?(Wgit::Url) ? obj : new(obj)
end

Instance Method Details

#absolute?Boolean Also known as: is_absolute?

Returns true if self is an absolute Url; false if relative.

Returns:

  • True if absolute, false if relative.



182
183
184
# File 'lib/wgit/url.rb', line 182

def absolute?
  @uri.absolute?
end

#concat(other) ⇒ Wgit::Url Also known as: +

Concats self and other together before returning a new Url. Self is not modified.

Parameters:

  • The other to concat to the end of self.

Returns:

  • self + separator + other, separator depends on other.



211
212
213
214
215
216
217
218
219
220
221
222
# File 'lib/wgit/url.rb', line 211

def concat(other)
  other = Wgit::Url.new(other)
  raise 'other must be relative' unless other.relative?

  other = other.omit_leading_slash
  separator = %w[# ? .].include?(other[0]) ? '' : '/'

  # We use to_s below to call String#+, not Wgit::Url#+ (alias for concat).
  concatted = omit_trailing_slash.to_s + separator.to_s + other.to_s

  Wgit::Url.new(concatted)
end

#fragment?Boolean Also known as: is_fragment?

Returns true if self is a URL fragment e.g. #top etc. Note this shouldn't be used to determine if self contains a fragment.

Returns:

  • True if self is a fragment, false otherwise.



510
511
512
# File 'lib/wgit/url.rb', line 510

def fragment?
  start_with?('#')
end

#invalid?Boolean

Returns if self is an invalid (e.g. relative) HTTP URL. See Wgit::Url#valid? for the inverse (and more information).

Returns:

  • True if invalid, otherwise false.



202
203
204
# File 'lib/wgit/url.rb', line 202

def invalid?
  !valid?
end

#normalizeWgit::Url Also known as: normalise

Normalises/escapes self and returns a new Wgit::Url. Self isn't modified.

Returns:

  • An escaped version of self.



227
228
229
# File 'lib/wgit/url.rb', line 227

def normalize
  Wgit::Url.new(@uri.normalize.to_s)
end

#omit(*components) ⇒ Wgit::Url

Omits the given URL components from self and returns a new Wgit::Url.

Calls Addressable::URI#omit underneath and creates a new Wgit::Url from the output. See the Addressable::URI docs for more information.

Parameters:

  • One or more Symbols representing the URL components to omit. The following components are supported: :scheme, :user, :password, :userinfo, :host, :port, :authority, :path, :query, :fragment.

Returns:

  • Self's URL value with the given components omitted.



421
422
423
424
# File 'lib/wgit/url.rb', line 421

def omit(*components)
  omitted = @uri.omit(*components)
  Wgit::Url.new(omitted.to_s)
end

#omit_baseWgit::Url

Returns a new Wgit::Url with the base (proto and host) removed e.g. Given http://google.com/search?q=something#about, search?q=something#about is returned. If relative and base isn't present then self is returned. Leading and trailing slashes are always stripped from the return value.

Returns:

  • Self containing everything after the base.



460
461
462
463
464
465
466
467
# File 'lib/wgit/url.rb', line 460

def omit_base
  base_url = to_base
  omit_base = base_url ? gsub(base_url, '') : self

  return self if ['', '/'].include?(omit_base)

  Wgit::Url.new(omit_base).omit_slashes
end

#omit_fragmentWgit::Url

Returns a new Wgit::Url with the fragment portion removed e.g. Given http://google.com/search#about, http://google.com/search is returned. Self is returned as is if no fragment is present. A URL consisting of only a fragment e.g. '#about' will return an empty URL. This method assumes that the fragment is correctly placed at the very end of the URL.

Returns:

  • Self with the fragment portion removed.



491
492
493
494
495
496
# File 'lib/wgit/url.rb', line 491

def omit_fragment
  fragment = to_fragment
  omit_fragment = fragment ? gsub("##{fragment}", '') : self

  Wgit::Url.new(omit_fragment)
end

#omit_leading_slashWgit::Url

Returns a new Wgit::Url containing self without a trailing slash. Is idempotent meaning self will always be returned regardless of whether there's a trailing slash or not.

Returns:

  • Self without a trailing slash.



431
432
433
# File 'lib/wgit/url.rb', line 431

def omit_leading_slash
  start_with?('/') ? Wgit::Url.new(self[1..-1]) : self
end

#omit_queryWgit::Url

Returns a new Wgit::Url with the query string portion removed e.g. Given http://google.com/search?q=hello, http://google.com/search is returned. Self is returned as is if no query string is present. A URL consisting of only a query string e.g. '?q=hello' will return an empty URL.

Returns:

  • Self with the query string portion removed.



476
477
478
479
480
481
# File 'lib/wgit/url.rb', line 476

def omit_query
  query = to_query
  omit_query_string = query ? gsub("?#{query}", '') : self

  Wgit::Url.new(omit_query_string)
end

#omit_slashesWgit::Url

Returns a new Wgit::Url containing self without a leading or trailing slash. Is idempotent and will return self regardless if there's slashes present or not.

Returns:

  • Self without leading or trailing slashes.



449
450
451
452
# File 'lib/wgit/url.rb', line 449

def omit_slashes
  omit_leading_slash
    .omit_trailing_slash
end

#omit_trailing_slashWgit::Url

Returns a new Wgit::Url containing self without a trailing slash. Is idempotent meaning self will always be returned regardless of whether there's a trailing slash or not.

Returns:

  • Self without a trailing slash.



440
441
442
# File 'lib/wgit/url.rb', line 440

def omit_trailing_slash
  end_with?('/') ? Wgit::Url.new(chop) : self
end

#prefix_base(doc) ⇒ Wgit::Url

Returns an absolute form of self within the context of doc. Doesn't modify the receiver.

If self is absolute then it's returned as is, making this method idempotent. The doc's element is used if present, otherwise doc.url is used as the base; which is concatted with self.

Typically used to build an absolute link obtained from a document.

Examples:

link = Wgit::Url.new('/favicon.png')
doc  = Wgit::Document.new('http://example.com')

link.prefix_base(doc) # => "http://example.com/favicon.png"

Parameters:

  • The doc whose base Url is concatted with self.

Returns:

  • Self in absolute form.

Raises:

  • If doc isn't a Wgit::Document or if doc.base_url raises an Exception.



251
252
253
254
255
# File 'lib/wgit/url.rb', line 251

def prefix_base(doc)
  assert_type(doc, Wgit::Document)

  absolute? ? self : doc.base_url(link: self).concat(self)
end

#prefix_scheme(protocol: :http) ⇒ Wgit::Url

Returns self having prefixed a protocol scheme. Doesn't modify receiver. Returns self even if absolute (with scheme); therefore is idempotent.

Parameters:

  • (defaults to: :http)

    Either :http or :https.

Returns:

  • Self with a protocol scheme prefix.



262
263
264
265
266
267
268
269
270
271
272
273
# File 'lib/wgit/url.rb', line 262

def prefix_scheme(protocol: :http)
  return self if absolute?

  case protocol
  when :http
    Wgit::Url.new("http://#{url}")
  when :https
    Wgit::Url.new("https://#{url}")
  else
    raise "protocol must be :http or :https, not :#{protocol}"
  end
end

#query?Boolean Also known as: is_query?

Returns true if self is a URL query string e.g. ?q=hello etc. Note this shouldn't be used to determine if self contains a query.

Returns:

  • True if self is a query string, false otherwise.



502
503
504
# File 'lib/wgit/url.rb', line 502

def query?
  start_with?('?')
end

#relative?(opts = {}) ⇒ Boolean Also known as: is_relative?

Returns true if self is a relative Url; false if absolute.

An absolute URL must have a scheme prefix e.g. 'http://', otherwise the URL is regarded as being relative (regardless of whether it's valid or not). The only exception is if an opts arg is provided and self is a page belonging to that arg type e.g. host; then the link is relative.

Examples:

url = Wgit::Url.new('http://example.com/about')

url.relative? # => false
url.relative?(host: 'http://example.com') # => true

Parameters:

  • (defaults to: {})

    The options with which to check relativity. Only one opts param should be provided. The provided opts param Url must be absolute and be prefixed with a scheme. Consider using the output of Wgit::Url#to_base which should work (unless it's nil).

Options Hash (opts):

Returns:

  • True if relative, false if absolute.

Raises:

  • If self is invalid (e.g. empty) or an invalid opts param has been provided.



145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
# File 'lib/wgit/url.rb', line 145

def relative?(opts = {})
  defaults = { base: nil, host: nil, domain: nil, brand: nil }
  opts = defaults.merge(opts)
  raise 'Url (self) cannot be empty' if empty?

  return true if @uri.relative?

  # Self is absolute but may be relative to the opts param e.g. host.
  opts.select! { |_k, v| v }
  raise "Provide only one of: #{defaults.keys}" if opts.length > 1

  return false if opts.empty?

  type, url = opts.first
  url = Wgit::Url.new(url)
  if url.invalid?
    raise "Invalid opts param value, it must be absolute, containing a \
protocol scheme and domain (e.g. http://example.com): #{url}"
  end

  case type
  when :base   # http://www.google.com
    to_base   == url.to_base
  when :host   # www.google.com
    to_host   == url.to_host
  when :domain # google.com
    to_domain == url.to_domain
  when :brand  # google
    to_brand  == url.to_brand
  else
    raise "Unknown opts param: :#{type}, use one of: #{defaults.keys}"
  end
end

#replace(new_url) ⇒ String

Overrides String#replace setting the new_url @uri and String value.

Parameters:

  • The new URL value.

Returns:

  • The new URL value once set.



109
110
111
112
113
# File 'lib/wgit/url.rb', line 109

def replace(new_url)
  @uri = Addressable::URI.parse(new_url)

  super(new_url)
end

#to_addressable_uriAddressable::URI

Returns the Addressable::URI object for this URL.

Returns:

  • The Addressable::URI object of self.



295
296
297
# File 'lib/wgit/url.rb', line 295

def to_addressable_uri
  @uri
end

#to_baseWgit::Url? Also known as: base

Returns only the base of this URL e.g. the protocol scheme and host combined.

Returns:



347
348
349
350
351
352
# File 'lib/wgit/url.rb', line 347

def to_base
  return nil if @uri.scheme.nil? || @uri.host.nil?

  base = "#{@uri.scheme}://#{@uri.host}"
  Wgit::Url.new(base)
end

#to_brandWgit::Url? Also known as: brand

Returns a new Wgit::Url containing just the brand of this URL e.g. Given http://www.google.co.uk/about.html, google is returned.

Returns:

  • Containing just the brand or nil.



337
338
339
340
# File 'lib/wgit/url.rb', line 337

def to_brand
  domain = to_domain
  domain ? Wgit::Url.new(domain.split('.').first) : nil
end

#to_domainWgit::Url? Also known as: domain

Returns a new Wgit::Url containing just the domain of this URL e.g. Given http://www.google.co.uk/about.html, google.co.uk is returned.

Returns:

  • Containing just the domain or nil.



328
329
330
331
# File 'lib/wgit/url.rb', line 328

def to_domain
  domain = @uri.domain
  domain ? Wgit::Url.new(domain) : nil
end

#to_endpointWgit::Url Also known as: endpoint

Returns the endpoint of this URL e.g. the bit after the host with any slashes included. For example: Wgit::Url.new("http://www.google.co.uk/about.html/").to_endpoint returns "/about.html/". See Wgit::Url#to_path if you don't want the slashes.

Returns:

  • Endpoint of self e.g. /about.html/. For a URL without an endpoint, / is returned.



375
376
377
378
379
# File 'lib/wgit/url.rb', line 375

def to_endpoint
  endpoint = @uri.path
  endpoint = '/' + endpoint unless endpoint.start_with?('/')
  Wgit::Url.new(endpoint)
end

#to_extensionWgit::Url? Also known as: extension

Returns a new Wgit::Url containing just the file extension of this URL e.g. Given http://google.com#about.html, html is returned.

Returns:

  • Containing just the extension string or nil.



403
404
405
406
407
408
409
# File 'lib/wgit/url.rb', line 403

def to_extension
  path = to_path
  return nil unless path

  segs = path.split('.')
  segs.length > 1 ? Wgit::Url.new(segs.last) : nil
end

#to_fragmentWgit::Url? Also known as: fragment

Returns a new Wgit::Url containing just the fragment string of this URL e.g. Given http://google.com#about, #about is returned.

Returns:

  • Containing just the fragment string or nil.



394
395
396
397
# File 'lib/wgit/url.rb', line 394

def to_fragment
  fragment = @uri.fragment
  fragment ? Wgit::Url.new(fragment) : nil
end

#to_hHash

Returns a Hash containing this Url's instance vars excluding @uri. Used when storing the URL in a Database e.g. MongoDB etc.

Returns:

  • self's instance vars as a Hash.



279
280
281
282
283
# File 'lib/wgit/url.rb', line 279

def to_h
  ignore = ['@uri']
  h = Wgit::Utils.to_h(self, ignore: ignore)
  Hash[h.to_a.insert(0, ['url', self])] # Insert url at position 0.
end

#to_hostWgit::Url? Also known as: host

Returns a new Wgit::Url containing just the host of this URL e.g. Given http://www.google.co.uk/about.html, www.google.co.uk is returned.

Returns:

  • Containing just the host or nil.



319
320
321
322
# File 'lib/wgit/url.rb', line 319

def to_host
  host = @uri.host
  host ? Wgit::Url.new(host) : nil
end

#to_pathWgit::Url? Also known as: path

Returns the path of this URL e.g. the bit after the host without slashes. For example: Wgit::Url.new("http://www.google.co.uk/about.html/").to_path returns "about.html". See Wgit::Url#to_endpoint if you want the slashes.

Returns:

  • Path of self e.g. about.html or nil.



360
361
362
363
364
365
366
# File 'lib/wgit/url.rb', line 360

def to_path
  path = @uri.path
  return nil if path.nil? || path.empty?
  return Wgit::Url.new('/') if path == '/'

  Wgit::Url.new(path).omit_slashes
end

#to_queryWgit::Url? Also known as: query

Returns a new Wgit::Url containing just the query string of this URL e.g. Given http://google.com?q=ruby, '?q=ruby' is returned.

Returns:

  • Containing just the query string or nil.



385
386
387
388
# File 'lib/wgit/url.rb', line 385

def to_query
  query = @uri.query
  query ? Wgit::Url.new(query) : nil
end

#to_schemeWgit::Url? Also known as: scheme

Returns a new Wgit::Url containing just the scheme of this URL e.g. Given http://www.google.co.uk, http is returned.

Returns:

  • Containing just the scheme or nil.



310
311
312
313
# File 'lib/wgit/url.rb', line 310

def to_scheme
  scheme = @uri.scheme
  scheme ? Wgit::Url.new(scheme) : nil
end

#to_uriURI::HTTP, URI::HTTPS Also known as: uri

Returns a normalised URI object for this URL.

Returns:

  • The URI object of self.



288
289
290
# File 'lib/wgit/url.rb', line 288

def to_uri
  URI(normalize)
end

#to_urlWgit::Url Also known as: url

Returns self.

Returns:

  • This (self) Url.



302
303
304
# File 'lib/wgit/url.rb', line 302

def to_url
  self
end

#valid?Boolean Also known as: is_valid?

Returns if self is a valid and absolute HTTP URL or not. Self should always be crawlable if this method returns true.

Returns:

  • True if valid, absolute and crawable, otherwise false.



190
191
192
193
194
195
196
# File 'lib/wgit/url.rb', line 190

def valid?
  return false if relative?
  return false unless to_base && to_domain
  return false if URI::DEFAULT_PARSER.make_regexp.match(normalize).nil?

  true
end