Class: Wgit::Url

Inherits:

String

Object
String
Wgit::Url

show all

Includes:: Assertable

Defined in:: lib/wgit/url.rb

Overview

Class modeling/serialising a web based HTTP URL.

Can be an internal/relative link e.g. "about.html" or an absolute URL e.g. "http://www.google.co.uk". Is a subclass of String and uses URI and addressable/uri internally for parsing.

Most of the methods in this class return new Wgit::Url instances making the method calls chainable e.g. url.omit_base.omit_fragment etc. The methods also try to be idempotent where possible.

Constant Summary

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG

Instance Attribute Summary collapse

#crawl_duration ⇒ Object
The duration of the crawl for this Url (in seconds).
#crawled ⇒ Object (also: #crawled?)
Whether or not the Url has been crawled or not.
#date_crawled ⇒ Object
The Time stamp of when this Url was crawled.
#redirects ⇒ Object
Record the redirects from the initial Url to the final Url.

Class Method Summary collapse

.parse(obj) ⇒ Wgit::Url
Initialises a new Wgit::Url instance from a String or subclass of String e.g.
.parse?(obj) ⇒ Wgit::Url
Returns a Wgit::Url instance from Wgit::Url.parse, or nil if obj cannot be parsed successfully e.g.

Instance Method Summary collapse

#absolute? ⇒ Boolean (also: #is_absolute?)
Returns true if self is an absolute Url; false if relative.
#concat(other) ⇒ String
Overrides String#concat which oddly returns a Wgit::Url object, and instead returns a String.
#fragment? ⇒ Boolean (also: #is_fragment?)
Returns true if self is a URL fragment e.g.
#index? ⇒ Boolean (also: #is_index?)
Returns true if self equals '/' a.k.a.
#initialize(url_or_obj, crawled: false, date_crawled: nil, crawl_duration: nil) ⇒ Url constructor
Initializes a new instance of Wgit::Url which models a web based HTTP URL.
#inspect ⇒ String
Overrides String#inspect to distingiush this Url from a String.
#invalid? ⇒ Boolean
Returns if self is an invalid (e.g. relative) HTTP URL.
#join(other) ⇒ Wgit::Url
Joins self and other together before returning a new Url.
#make_absolute(doc) ⇒ Wgit::Url
Returns an absolute form of self within the context of doc.
#normalize ⇒ Wgit::Url
Normalizes/escapes self and returns a new Wgit::Url.
#omit(*components) ⇒ Wgit::Url
Omits the given URL components from self and returns a new Wgit::Url.
#omit_base ⇒ Wgit::Url
Returns a new Wgit::Url with the base (scheme and host) removed e.g.
#omit_fragment ⇒ Wgit::Url
Returns a new Wgit::Url with the fragment portion removed e.g.
#omit_leading_slash ⇒ Wgit::Url
Returns a new Wgit::Url containing self without a trailing slash.
#omit_origin ⇒ Wgit::Url
Returns a new Wgit::Url with the origin (base + port) removed e.g.
#omit_query ⇒ Wgit::Url
Returns a new Wgit::Url with the query string portion removed e.g.
#omit_slashes ⇒ Wgit::Url
Returns a new Wgit::Url containing self without a leading or trailing slash.
#omit_trailing_slash ⇒ Wgit::Url
Returns a new Wgit::Url containing self without a trailing slash.
#prefix_scheme(scheme = :http) ⇒ Wgit::Url
Returns self having prefixed a scheme/protocol.
#query? ⇒ Boolean (also: #is_query?)
Returns true if self is a URL query string e.g.
#redirects_journey ⇒ Array<Wgit::Url>
Returns the Wgit::Url's starting with the originally requested Url to be crawled, followed by each redirected to Url, finishing with the final crawled Url e.g.
#relative?(opts = {}) ⇒ Boolean (also: #is_relative?)
Returns true if self is a relative Url; false if absolute.
#replace(new_url) ⇒ String
Overrides String#replace setting the new_url @uri and String value.
#scheme_relative? ⇒ Boolean (also: #is_scheme_relative?)
Returns true if self starts with '//' a.k.a a scheme/protocol relative path.
#to_addressable_uri ⇒ Addressable::URI
Returns the Addressable::URI object for this URL.
#to_base ⇒ Wgit::Url^? (also: #base)
Returns only the base of this URL e.g.
#to_brand ⇒ Wgit::Url^? (also: #brand)
Returns a new Wgit::Url containing just the brand of this URL e.g.
#to_domain ⇒ Wgit::Url^? (also: #domain)
Returns a new Wgit::Url containing just the domain of this URL e.g.
#to_endpoint ⇒ Wgit::Url (also: #endpoint)
Returns the endpoint of this URL e.g.
#to_extension ⇒ Wgit::Url^? (also: #extension)
Returns a new Wgit::Url containing just the file extension of this URL e.g.
#to_fragment ⇒ Wgit::Url^? (also: #fragment)
Returns a new Wgit::Url containing just the fragment string of this URL e.g.
#to_h ⇒ Hash
Returns a Hash containing this Url's instance vars excluding @uri.
#to_host ⇒ Wgit::Url^? (also: #host)
Returns a new Wgit::Url containing just the host of this URL e.g.
#to_origin ⇒ Wgit::Url^? (also: #origin)
Returns only the origin of this URL e.g.
#to_password ⇒ Wgit::Url^? (also: #password)
Returns a new Wgit::Url containing just the password string of this URL e.g.
#to_path ⇒ Wgit::Url^? (also: #path)
Returns the path of this URL e.g.
#to_port ⇒ Wgit::Url^? (also: #port)
Returns a new Wgit::Url containing just the port of this URL e.g.
#to_query ⇒ Wgit::Url^? (also: #query)
Returns a new Wgit::Url containing just the query string of this URL e.g.
#to_query_hash(symbolize_keys: false) ⇒ Hash<String | Symbol, String> (also: #query_hash)
Returns a Hash containing just the query string parameters of this URL e.g.
#to_scheme ⇒ Wgit::Url^? (also: #scheme)
Returns a new Wgit::Url containing just the scheme of this URL e.g.
#to_sub_domain ⇒ Wgit::Url^? (also: #sub_domain)
Returns a new Wgit::Url containing just the sub domain of this URL e.g.
#to_uri ⇒ URI::HTTP, URI::HTTPS (also: #uri)
Returns a normalised URI object for this URL.
#to_url ⇒ Wgit::Url (also: #url)
Returns self.
#to_user ⇒ Wgit::Url^? (also: #user)
Returns a new Wgit::Url containing just the username string of this URL e.g.
#valid? ⇒ Boolean (also: #is_valid?)
Returns if self is a valid and absolute HTTP URL or not.

Methods included from Assertable

#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(url_or_obj, crawled: false, date_crawled: nil, crawl_duration: nil) ⇒ `Url`

Initializes a new instance of Wgit::Url which models a web based HTTP URL.

Parameters:

url_or_obj (String, Wgit::Url, #fetch#[]) —
Is either a String based URL or an object representing a Database record e.g. a MongoDB document/object.
crawled (Boolean) (defaults to: false) —
Whether or not the HTML of the URL's web page has been crawled or not. Only used if url_or_obj is a String.
date_crawled (Time) (defaults to: nil) —
Should only be provided if crawled is true. A suitable object can be returned from Wgit::Utils.time_stamp. Only used if url_or_obj is a String.
crawl_duration (Float) (defaults to: nil) —
Should only be provided if crawled is true. The duration of the crawl for this Url (in seconds).

Raises:

(StandardError) —
If url_or_obj is an Object with missing methods.

# File 'lib/wgit/url.rb', line 48

def initialize(
  url_or_obj, crawled: false, date_crawled: nil, crawl_duration: nil
)
  # Init from a URL String.
  if url_or_obj.is_a?(String)
    url = url_or_obj.to_s
  # Else init from a Hash like object e.g. database object.
  else
    obj = url_or_obj
    assert_respond_to(obj, :fetch)

    url            = obj.fetch("url") # Should always be present.
    crawled        = obj.fetch("crawled", false)
    date_crawled   = obj.fetch("date_crawled", nil)
    crawl_duration = obj.fetch("crawl_duration", nil)
    redirects      = obj.fetch("redirects", {})
  end

  @uri            = Addressable::URI.parse(url)
  @crawled        = crawled
  @date_crawled   = date_crawled
  @crawl_duration = crawl_duration
  @redirects      = redirects || {}

  super(url)
end

Instance Attribute Details

#crawl_duration ⇒ `Object`

The duration of the crawl for this Url (in seconds).



29
30
31

# File 'lib/wgit/url.rb', line 29

def crawl_duration
  @crawl_duration
end

#crawled ⇒ `Object` Also known as: crawled?

Whether or not the Url has been crawled or not. A custom crawled= method is provided by this class.



23
24
25

# File 'lib/wgit/url.rb', line 23

def crawled
  @crawled
end

#date_crawled ⇒ `Object`

The Time stamp of when this Url was crawled.



26
27
28

# File 'lib/wgit/url.rb', line 26

def date_crawled
  @date_crawled
end

#redirects ⇒ `Object`

Record the redirects from the initial Url to the final Url.



32
33
34

# File 'lib/wgit/url.rb', line 32

def redirects
  @redirects
end

Class Method Details

.parse(obj) ⇒ `Wgit::Url`

Initialises a new Wgit::Url instance from a String or subclass of String e.g. Wgit::Url. Any other obj type will raise an error.

If obj is already a Wgit::Url then it will be returned as is to maintain it's state. Otherwise, a new Wgit::Url is instantiated and returned. This differs from Wgit::Url.new which always instantiates a new Wgit::Url.

Note: Only use this method if you are allowing obj to be either a String or a Wgit::Url whose state you want to preserve e.g. when passing a URL to a crawl method which might redirect (calling Wgit::Url#replace). If you're sure of the type or don't care about preserving the state of the Wgit::Url, use Wgit::Url.new instead.

Parameters:

obj (Object) —
The object to parse, which #is_a?(String).

Returns:

(Wgit::Url) —
A Wgit::Url instance.

Raises:

(StandardError) —
If obj.is_a?(String) is false.

# File 'lib/wgit/url.rb', line 91

def self.parse(obj)
  raise "Can only parse if obj#is_a?(String)" unless obj.is_a?(String)

  # Return a Wgit::Url as is to avoid losing state e.g. date_crawled etc.
  obj.is_a?(Wgit::Url) ? obj : new(obj)
end

.parse?(obj) ⇒ `Wgit::Url`

Returns a Wgit::Url instance from Wgit::Url.parse, or nil if obj cannot be parsed successfully e.g. the String is invalid.

Use this method when you can't gaurentee that obj is parsable as a URL. See Wgit::Url.parse for more information.

Parameters:

obj (Object) —
The object to parse, which #is_a?(String).

Returns:

(Wgit::Url) —
A Wgit::Url instance or nil (if obj is invalid).

Raises:

(StandardError) —
If obj.is_a?(String) is false.

# File 'lib/wgit/url.rb', line 107

def self.parse?(obj)
  parse(obj)
rescue Addressable::URI::InvalidURIError
  Wgit.logger.debug("Wgit::Url.parse?('#{obj}') exception: \
Addressable::URI::InvalidURIError")
  nil
end

Instance Method Details

#absolute? ⇒ `Boolean` Also known as: is_absolute?

Returns true if self is an absolute Url; false if relative.

Returns:

(Boolean) —
True if absolute, false if relative.



265
266
267

# File 'lib/wgit/url.rb', line 265

def absolute?
  @uri.absolute?
end

#concat(other) ⇒ `String`

Overrides String#concat which oddly returns a Wgit::Url object, and instead returns a String. Therefore this method works the same as if you call String#concat, or its alias String#+, which is desired for this method. If you want to join two Urls, use Wgit::Url#join method.

Parameters:

other (String) —
The String to concat onto this one.

Returns:

(String) —
The new concatted String, not a Wgit::Url.



139
140
141

# File 'lib/wgit/url.rb', line 139

def concat(other)
  to_s.concat(other.to_s)
end

#fragment? ⇒ `Boolean` Also known as: is_fragment?

Returns true if self is a URL fragment e.g. #top etc. Note this shouldn't be used to determine if self contains a fragment.

Returns:

(Boolean) —
True if self is a fragment, false otherwise.



687
688
689

# File 'lib/wgit/url.rb', line 687

def fragment?
  start_with?("#")
end

#index? ⇒ `Boolean` Also known as: is_index?

Returns true if self equals '/' a.k.a. index.

Returns:

(Boolean) —
True if self equals '/', false otherwise.



694
695
696

# File 'lib/wgit/url.rb', line 694

def index?
  self == "/"
end

#inspect ⇒ `String`

Overrides String#inspect to distingiush this Url from a String.

Returns:

(String) —
A short textual representation of this Url.



118
119
120

# File 'lib/wgit/url.rb', line 118

def inspect
  "#<Wgit::Url url=\"#{self}\" crawled=#{@crawled}>"
end

#invalid? ⇒ `Boolean`

Returns if self is an invalid (e.g. relative) HTTP URL. See Wgit::Url#valid? for the inverse (and more information).

Returns:

(Boolean) —
True if invalid, otherwise false.



285
286
287

# File 'lib/wgit/url.rb', line 285

def invalid?
  !valid?
end

#join(other) ⇒ `Wgit::Url`

Joins self and other together before returning a new Url. Self is not modified. Some magic occurs depending on what is being joined, see the source code for more information.

Parameters:

other (Wgit::Url, String) —
The other (relative) Url to join to the end of self.

Returns:

(Wgit::Url) —
self + separator + other, separator depends on other.

# File 'lib/wgit/url.rb', line 296

def join(other)
  other = Wgit::Url.new(other)
  raise "other must be relative" unless other.relative?

  other = other.omit_leading_slash
  separator = %w[# ? .].include?(other[0]) ? "" : "/"
  separator = "" if end_with?("/")
  joined = self + separator + other

  Wgit::Url.new(joined)
end

#make_absolute(doc) ⇒ `Wgit::Url`

Returns an absolute form of self within the context of doc. Doesn't modify the receiver.

If self is absolute then it's returned as is, making this method idempotent. The doc's <base> element is used if present, otherwise doc.url is used as the base; which is joined with self.

Typically used to build an absolute link obtained from a document.

Examples:

link = Wgit::Url.new('/favicon.png')
doc  = Wgit::Document.new('http://example.com')

link.make_absolute(doc) # => "http://example.com/favicon.png"

Parameters:

doc (Wgit::Document) —
The doc whose base Url is joined with self.

Returns:

(Wgit::Url) —
Self in absolute form.

Raises:

(StandardError) —
If doc isn't a Wgit::Document or if doc.base_url raises an Exception.

# File 'lib/wgit/url.rb', line 336

def make_absolute(doc)
  assert_type(doc, Wgit::Document)
  raise "Cannot make absolute when Document @url is not valid" \
  unless doc.url.valid?

  return prefix_scheme(doc.url.to_scheme&.to_sym) if scheme_relative?

  absolute? ? self : doc.base_url(link: self).join(self)
end

#normalize ⇒ `Wgit::Url`

Normalizes/escapes self and returns a new Wgit::Url. Self isn't modified. This should be used before GET'ing the url, in case it has IRI chars.

Returns:

(Wgit::Url) —
An escaped version of self.



312
313
314

# File 'lib/wgit/url.rb', line 312

def normalize
  Wgit::Url.new(@uri.normalize.to_s)
end

#omit(*components) ⇒ `Wgit::Url`

Omits the given URL components from self and returns a new Wgit::Url.

Calls Addressable::URI#omit underneath and creates a new Wgit::Url from the output. See the Addressable::URI docs for more information.

Parameters:

components (*Symbol) —
One or more Symbols representing the URL components to omit. The following components are supported: :scheme, :user, :password, :userinfo, :host, :port, :authority, :path, :query, :fragment.

Returns:

(Wgit::Url) —
Self's URL value with the given components omitted.

# File 'lib/wgit/url.rb', line 583

def omit(*components)
  omitted = @uri.omit(*components)
  Wgit::Url.new(omitted.to_s)
end

#omit_base ⇒ `Wgit::Url`

Returns a new Wgit::Url with the base (scheme and host) removed e.g. Given http://google.com/search?q=something#about, search?q=something#about is returned. If relative and base isn't present then self is returned. Leading and trailing slashes are always stripped from the return value.

Returns:

(Wgit::Url) —
Self containing everything after the base.

# File 'lib/wgit/url.rb', line 622

def omit_base
  base_url = to_base
  omit_base = base_url ? gsub(base_url, "") : self

  return self if ["", "/"].include?(omit_base)

  Wgit::Url.new(omit_base).omit_leading_slash
end

#omit_fragment ⇒ `Wgit::Url`

Returns a new Wgit::Url with the fragment portion removed e.g. Given http://google.com/search#about, http://google.com/search is returned. Self is returned as is if no fragment is present. A URL consisting of only a fragment e.g. '#about' will return an empty URL. This method assumes that the fragment is correctly placed at the very end of the URL.

Returns:

(Wgit::Url) —
Self with the fragment portion removed.

# File 'lib/wgit/url.rb', line 668

def omit_fragment
  fragment = to_fragment
  omit_fragment = fragment ? gsub("##{fragment}", "") : self

  Wgit::Url.new(omit_fragment)
end

#omit_leading_slash ⇒ `Wgit::Url`

Returns a new Wgit::Url containing self without a trailing slash. Is idempotent meaning self will always be returned regardless of whether there's a trailing slash or not.

Returns:

(Wgit::Url) —
Self without a trailing slash.



593
594
595

# File 'lib/wgit/url.rb', line 593

def omit_leading_slash
  start_with?("/") ? Wgit::Url.new(self[1..]) : self
end

#omit_origin ⇒ `Wgit::Url`

Returns a new Wgit::Url with the origin (base + port) removed e.g. Given http://google.com:81/search?q=something#about, search?q=something#about is returned. If relative and base isn't present then self is returned. Leading and trailing slashes are always stripped from the return value.

Returns:

(Wgit::Url) —
Self containing everything after the origin.

# File 'lib/wgit/url.rb', line 637

def omit_origin
  origin = to_origin
  omit_origin = origin ? gsub(origin, "") : self

  return self if ["", "/"].include?(omit_origin)

  Wgit::Url.new(omit_origin).omit_leading_slash
end

#omit_query ⇒ `Wgit::Url`

Returns a new Wgit::Url with the query string portion removed e.g. Given http://google.com/search?q=hello, http://google.com/search is returned. Self is returned as is if no query string is present. A URL consisting of only a query string e.g. '?q=hello' will return an empty URL.

Returns:

(Wgit::Url) —
Self with the query string portion removed.

# File 'lib/wgit/url.rb', line 653

def omit_query
  query = to_query
  omit_query_string = query ? gsub("?#{query}", "") : self

  Wgit::Url.new(omit_query_string)
end

#omit_slashes ⇒ `Wgit::Url`

Returns a new Wgit::Url containing self without a leading or trailing slash. Is idempotent and will return self regardless if there's slashes present or not.

Returns:

(Wgit::Url) —
Self without leading or trailing slashes.

# File 'lib/wgit/url.rb', line 611

def omit_slashes
  omit_leading_slash
    .omit_trailing_slash
end

#omit_trailing_slash ⇒ `Wgit::Url`

Returns a new Wgit::Url containing self without a trailing slash. Is idempotent meaning self will always be returned regardless of whether there's a trailing slash or not.

Returns:

(Wgit::Url) —
Self without a trailing slash.



602
603
604

# File 'lib/wgit/url.rb', line 602

def omit_trailing_slash
  end_with?("/") ? Wgit::Url.new(chop) : self
end

#prefix_scheme(scheme = :http) ⇒ `Wgit::Url`

Returns self having prefixed a scheme/protocol. Doesn't modify receiver. Returns self even if absolute (with scheme); therefore is idempotent.

Parameters:

scheme (Symbol) (defaults to: :http) —
Either :http or :https.

Returns:

(Wgit::Url) —
Self with a scheme prefix.

# File 'lib/wgit/url.rb', line 351

def prefix_scheme(scheme = :http)
  unless %i[http https].include?(scheme)
    raise "scheme must be :http or :https, not :#{scheme}"
  end

  return self if absolute? && !scheme_relative?

  separator = scheme_relative? ? "" : "//"
  Wgit::Url.new("#{scheme}:#{separator}#{self}")
end

#query? ⇒ `Boolean` Also known as: is_query?

Returns true if self is a URL query string e.g. ?q=hello etc. Note this shouldn't be used to determine if self contains a query.

Returns:

(Boolean) —
True if self is a query string, false otherwise.



679
680
681

# File 'lib/wgit/url.rb', line 679

def query?
  start_with?("?")
end

#redirects_journey ⇒ `Array<Wgit::Url>`

Returns the Wgit::Url's starting with the originally requested Url to be crawled, followed by each redirected to Url, finishing with the final crawled Url e.g.

Example Url redirects journey (dictated by the webserver):

http://example.com   => 301 to https://example.com
https://example.com  => 301 to https://example.com/
https://example.com/ => 200 OK (no more redirects, crawl complete)

Would return an Array of Wgit::Url's in the form of:

%w(
  http://example.com
  https://example.com
  https://example.com/
)

Returns:

(Array<Wgit::Url>) —
Each redirected to Url's finishing with the final (successfully) crawled Url. If no redirects took place, then just the originally requested Url is returned inside the Array.



193
194
195

# File 'lib/wgit/url.rb', line 193

def redirects_journey
  [redirects.keys, self].flatten
end

#relative?(opts = {}) ⇒ `Boolean` Also known as: is_relative?

Returns true if self is a relative Url; false if absolute.

An absolute URL must have a scheme prefix e.g. 'http://', otherwise the URL is regarded as being relative (regardless of whether it's valid or not). The only exception is if an opts arg is provided and self is a page belonging to that arg type e.g. host; then the link is relative.

Examples:

url = Wgit::Url.new('http://example.com/about')

url.relative? # => false
url.relative?(host: 'http://example.com') # => true

Parameters:

opts (Hash) (defaults to: {}) —
The options with which to check relativity. Only one opts param should be provided. The provided opts param Url must be absolute and be prefixed with a scheme. Consider using the output of Wgit::Url#to_origin which should work (unless it's nil).

Options Hash (opts):

:origin (Wgit::Url, String) —
The Url origin e.g. http://www.google.com:81/how which gives a origin of 'http://www.google.com:81'.
:host (Wgit::Url, String) —
The Url host e.g. http://www.google.com/how which gives a host of 'www.google.com'.
:domain (Wgit::Url, String) —
The Url domain e.g. http://www.google.com/how which gives a domain of 'google.com'.
:brand (Wgit::Url, String) —
The Url brand e.g. http://www.google.com/how which gives a domain of 'google'.

Returns:

(Boolean) —
True if relative, false if absolute.

Raises:

(StandardError) —
If self is invalid (e.g. empty) or an invalid opts param has been provided.

# File 'lib/wgit/url.rb', line 227

def relative?(opts = {})
  defaults = { origin: nil, host: nil, domain: nil, brand: nil }
  opts = defaults.merge(opts)
  raise "Url (self) cannot be empty" if empty?

  return false if scheme_relative?
  return true  if @uri.relative?

  # Self is absolute but may be relative to the opts param e.g. host.
  opts.select! { |_k, v| v }
  raise "Provide only one of: #{defaults.keys}" if opts.length > 1

  return false if opts.empty?

  type, url = opts.first
  url = Wgit::Url.new(url)
  if url.invalid?
    raise "Invalid opts param value, it must be absolute, containing a \
protocol scheme and domain (e.g. http://example.com): #{url}"
  end

  case type
  when :origin # http://www.google.com:81
    to_origin == url.to_origin
  when :host   # www.google.com
    to_host   == url.to_host
  when :domain # google.com
    to_domain == url.to_domain
  when :brand  # google
    to_brand  == url.to_brand
  else
    raise "Unknown opts param: :#{type}, use one of: #{defaults.keys}"
  end
end

#replace(new_url) ⇒ `String`

Overrides String#replace setting the new_url @uri and String value.

Parameters:

new_url (Wgit::Url, String) —
The new URL value.

Returns:

(String) —
The new URL value once set.

# File 'lib/wgit/url.rb', line 126

def replace(new_url)
  @uri = Addressable::URI.parse(new_url)

  super(new_url)
end

#scheme_relative? ⇒ `Boolean` Also known as: is_scheme_relative?

Returns true if self starts with '//' a.k.a a scheme/protocol relative path.

Returns:

(Boolean) —
True if self starts with '//', false otherwise.



702
703
704

# File 'lib/wgit/url.rb', line 702

def scheme_relative?
  start_with?("//")
end

#to_addressable_uri ⇒ `Addressable::URI`

Returns the Addressable::URI object for this URL.

Returns:

(Addressable::URI) —
The Addressable::URI object of self.



381
382
383

# File 'lib/wgit/url.rb', line 381

def to_addressable_uri
  @uri
end

#to_base ⇒ `Wgit::Url`^? Also known as: base

Returns only the base of this URL e.g. the protocol scheme and host combined.

Returns:

(Wgit::Url, nil) —
The base of self e.g. http://www.google.co.uk or nil.

# File 'lib/wgit/url.rb', line 461

def to_base
  return nil unless @uri.scheme && @uri.host

  base = "#{@uri.scheme}://#{@uri.host}"
  Wgit::Url.new(base)
end

#to_brand ⇒ `Wgit::Url`^? Also known as: brand

Returns a new Wgit::Url containing just the brand of this URL e.g. Given http://www.google.co.uk/about.html, google is returned.

Returns:

(Wgit::Url, nil) —
Containing just the brand or nil.

# File 'lib/wgit/url.rb', line 451

def to_brand
  domain = to_domain
  domain ? Wgit::Url.new(domain.split(".").first) : nil
end

#to_domain ⇒ `Wgit::Url`^? Also known as: domain

Returns a new Wgit::Url containing just the domain of this URL e.g. Given http://www.google.co.uk/about.html, google.co.uk is returned.

Returns:

(Wgit::Url, nil) —
Containing just the domain or nil.

# File 'lib/wgit/url.rb', line 428

def to_domain
  domain = @uri.domain
  domain ? Wgit::Url.new(domain) : nil
end

#to_endpoint ⇒ `Wgit::Url` Also known as: endpoint

Returns the endpoint of this URL e.g. the bit after the host with any slashes included. For example: Wgit::Url.new("http://www.google.co.uk/about.html/").to_endpoint returns "/about.html/". See Wgit::Url#to_path if you don't want the slashes.

Returns:

(Wgit::Url) —
Endpoint of self e.g. /about.html/. For a URL without an endpoint, / is returned.

# File 'lib/wgit/url.rb', line 501

def to_endpoint
  endpoint = @uri.path
  endpoint = "/#{endpoint}" unless endpoint.start_with?("/")
  Wgit::Url.new(endpoint)
end

#to_extension ⇒ `Wgit::Url`^? Also known as: extension

Returns a new Wgit::Url containing just the file extension of this URL e.g. Given http://google.com#about.html, html is returned.

Returns:

(Wgit::Url, nil) —
Containing just the extension string or nil.

# File 'lib/wgit/url.rb', line 547

def to_extension
  path = to_path&.omit_trailing_slash
  return nil unless path

  segs = path.split(".")
  segs.length > 1 ? Wgit::Url.new(segs.last) : nil
end

#to_fragment ⇒ `Wgit::Url`^? Also known as: fragment

Returns a new Wgit::Url containing just the fragment string of this URL e.g. Given http://google.com#about, #about is returned.

Returns:

(Wgit::Url, nil) —
Containing just the fragment string or nil.

# File 'lib/wgit/url.rb', line 538

def to_fragment
  fragment = @uri.fragment
  fragment ? Wgit::Url.new(fragment) : nil
end

#to_h ⇒ `Hash`

Returns a Hash containing this Url's instance vars excluding @uri. Used when storing the URL in a Database e.g. MongoDB etc.

Returns:

(Hash) —
self's instance vars as a Hash.

# File 'lib/wgit/url.rb', line 366

def to_h
  h = Wgit::Utils.to_h(self, ignore: ["@uri"])
  Hash[h.to_a.insert(0, ["url", to_s])] # Insert url at position 0.
end

#to_host ⇒ `Wgit::Url`^? Also known as: host

Returns a new Wgit::Url containing just the host of this URL e.g. Given http://www.google.co.uk/about.html, www.google.co.uk is returned.

Returns:

(Wgit::Url, nil) —
Containing just the host or nil.

# File 'lib/wgit/url.rb', line 405

def to_host
  host = @uri.host
  host ? Wgit::Url.new(host) : nil
end

#to_origin ⇒ `Wgit::Url`^? Also known as: origin

Returns only the origin of this URL e.g. the protocol scheme, host and port combined. For http://localhost:3000/api, http://localhost:3000 gets returned. If there's no port present, then to_base is returned.

Returns:

(Wgit::Url, nil) —
The origin of self or nil.

# File 'lib/wgit/url.rb', line 473

def to_origin
  return nil unless to_base
  return to_base unless to_port

  Wgit::Url.new("#{to_base}:#{to_port}")
end

#to_password ⇒ `Wgit::Url`^? Also known as: password

Returns a new Wgit::Url containing just the password string of this URL e.g. Given http://me:[email protected], pass1 is returned.

Returns:

(Wgit::Url, nil) —
Containing just the password string or nil.

# File 'lib/wgit/url.rb', line 568

def to_password
  password = @uri.password
  password ? Wgit::Url.new(password) : nil
end

#to_path ⇒ `Wgit::Url`^? Also known as: path

Returns the path of this URL e.g. the bit after the host without slashes. For example: Wgit::Url.new("http://www.google.co.uk/about.html/").to_path returns "about.html". See Wgit::Url#to_endpoint if you want the slashes.

Returns:

(Wgit::Url, nil) —
Path of self e.g. about.html or nil.

# File 'lib/wgit/url.rb', line 486

def to_path
  path = @uri.path
  return nil if path.nil? || path.empty?
  return Wgit::Url.new("/") if path == "/"

  Wgit::Url.new(path).omit_leading_slash
end

#to_port ⇒ `Wgit::Url`^? Also known as: port

Returns a new Wgit::Url containing just the port of this URL e.g. Given http://www.google.co.uk:443/about.html, '443' is returned.

Returns:

(Wgit::Url, nil) —
Containing just the port or nil.

# File 'lib/wgit/url.rb', line 414

def to_port
  port = @uri.port

  # @uri.port defaults port to 80/443 if missing, so we check for :#{port}.
  return nil unless port
  return nil unless include?(":#{port}")

  Wgit::Url.new(port.to_s)
end

#to_query ⇒ `Wgit::Url`^? Also known as: query

Returns a new Wgit::Url containing just the query string of this URL e.g. Given http://google.com?q=foo&bar=1, 'q=ruby&bar=1' is returned.

Returns:

(Wgit::Url, nil) —
Containing just the query string or nil.

# File 'lib/wgit/url.rb', line 511

def to_query
  query = @uri.query
  query ? Wgit::Url.new(query) : nil
end

#to_query_hash(symbolize_keys: false) ⇒ `Hash<String | Symbol, String>` Also known as: query_hash

Returns a Hash containing just the query string parameters of this URL e.g. Given http://google.com?q=ruby, "{ 'q' => 'ruby' }" is returned.

Parameters:

symbolize_keys (Boolean) (defaults to: false) —
The returned Hash keys will be Symbols if true, Strings otherwise.

Returns:

(Hash<String | Symbol, String>) —
Containing the query string params or empty if the URL doesn't contain any query parameters.

# File 'lib/wgit/url.rb', line 523

def to_query_hash(symbolize_keys: false)
  query_str = to_query
  return {} unless query_str

  query_str.split("&").each_with_object({}) do |param, hash|
    k, v = param.split("=")
    k = k.to_sym if symbolize_keys
    hash[k] = v
  end
end

#to_scheme ⇒ `Wgit::Url`^? Also known as: scheme

Returns a new Wgit::Url containing just the scheme of this URL e.g. Given http://www.google.co.uk, http is returned.

Returns:

(Wgit::Url, nil) —
Containing just the scheme or nil.

# File 'lib/wgit/url.rb', line 396

def to_scheme
  scheme = @uri.scheme
  scheme ? Wgit::Url.new(scheme) : nil
end

#to_sub_domain ⇒ `Wgit::Url`^? Also known as: sub_domain

Returns a new Wgit::Url containing just the sub domain of this URL e.g. Given http://scripts.dev.google.com, scripts.dev is returned.

Returns:

(Wgit::Url, nil) —
Containing just the sub domain or nil.

# File 'lib/wgit/url.rb', line 437

def to_sub_domain
  return nil unless to_host

  dot_domain = ".#{to_domain}"
  return nil unless include?(dot_domain)

  sub_domain = to_host.sub(dot_domain, "")
  Wgit::Url.new(sub_domain)
end

#to_uri ⇒ `URI::HTTP`, `URI::HTTPS` Also known as: uri

Returns a normalised URI object for this URL.

Returns:

(URI::HTTP, URI::HTTPS) —
The URI object of self.



374
375
376

# File 'lib/wgit/url.rb', line 374

def to_uri
  URI(normalize)
end

#to_url ⇒ `Wgit::Url` Also known as: url

Returns self.

Returns:

(Wgit::Url) —
This (self) Url.



388
389
390

# File 'lib/wgit/url.rb', line 388

def to_url
  self
end

#to_user ⇒ `Wgit::Url`^? Also known as: user

Returns a new Wgit::Url containing just the username string of this URL e.g. Given http://me:[email protected], me is returned.

Returns:

(Wgit::Url, nil) —
Containing just the user string or nil.

# File 'lib/wgit/url.rb', line 559

def to_user
  user = @uri.user
  user ? Wgit::Url.new(user) : nil
end

#valid? ⇒ `Boolean` Also known as: is_valid?

Returns if self is a valid and absolute HTTP URL or not. Self should always be crawlable if this method returns true.

Returns:

(Boolean) —
True if valid, absolute and crawable, otherwise false.

# File 'lib/wgit/url.rb', line 273

def valid?
  return false if relative?
  return false unless to_origin && to_domain
  return false unless URI::DEFAULT_PARSER.make_regexp.match(normalize)

  true
end

Class: Wgit::Url

Overview

Constant Summary

Constants included from Assertable

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Assertable

Constructor Details

#initialize(url_or_obj, crawled: false, date_crawled: nil, crawl_duration: nil) ⇒ Url

Instance Attribute Details

#crawl_duration ⇒ Object

#crawled ⇒ Object Also known as: crawled?

#date_crawled ⇒ Object

#redirects ⇒ Object

Class Method Details

.parse(obj) ⇒ Wgit::Url

.parse?(obj) ⇒ Wgit::Url

Instance Method Details

#absolute? ⇒ Boolean Also known as: is_absolute?

#concat(other) ⇒ String

#fragment? ⇒ Boolean Also known as: is_fragment?

#index? ⇒ Boolean Also known as: is_index?

#inspect ⇒ String

#invalid? ⇒ Boolean

#join(other) ⇒ Wgit::Url

#make_absolute(doc) ⇒ Wgit::Url

Examples:

#normalize ⇒ Wgit::Url

#omit(*components) ⇒ Wgit::Url

#omit_base ⇒ Wgit::Url

#omit_fragment ⇒ Wgit::Url

#omit_leading_slash ⇒ Wgit::Url

#omit_origin ⇒ Wgit::Url

#omit_query ⇒ Wgit::Url

#omit_slashes ⇒ Wgit::Url

#omit_trailing_slash ⇒ Wgit::Url

#prefix_scheme(scheme = :http) ⇒ Wgit::Url

#query? ⇒ Boolean Also known as: is_query?

#redirects_journey ⇒ Array<Wgit::Url>

#relative?(opts = {}) ⇒ Boolean Also known as: is_relative?

Examples:

#replace(new_url) ⇒ String

#scheme_relative? ⇒ Boolean Also known as: is_scheme_relative?

#to_addressable_uri ⇒ Addressable::URI

#to_base ⇒ Wgit::Url? Also known as: base

#to_brand ⇒ Wgit::Url? Also known as: brand

#to_domain ⇒ Wgit::Url? Also known as: domain

#to_endpoint ⇒ Wgit::Url Also known as: endpoint

#to_extension ⇒ Wgit::Url? Also known as: extension

#to_fragment ⇒ Wgit::Url? Also known as: fragment

#to_h ⇒ Hash

#to_host ⇒ Wgit::Url? Also known as: host

#to_origin ⇒ Wgit::Url? Also known as: origin

#to_password ⇒ Wgit::Url? Also known as: password

#to_path ⇒ Wgit::Url? Also known as: path

#to_port ⇒ Wgit::Url? Also known as: port

#to_query ⇒ Wgit::Url? Also known as: query

#to_query_hash(symbolize_keys: false) ⇒ Hash<String | Symbol, String> Also known as: query_hash

#to_scheme ⇒ Wgit::Url? Also known as: scheme

#to_sub_domain ⇒ Wgit::Url? Also known as: sub_domain

#to_uri ⇒ URI::HTTP, URI::HTTPS Also known as: uri

#to_url ⇒ Wgit::Url Also known as: url

#to_user ⇒ Wgit::Url? Also known as: user

#valid? ⇒ Boolean Also known as: is_valid?

#initialize(url_or_obj, crawled: false, date_crawled: nil, crawl_duration: nil) ⇒ `Url`

#crawl_duration ⇒ `Object`

#crawled ⇒ `Object` Also known as: crawled?

#date_crawled ⇒ `Object`

#redirects ⇒ `Object`

.parse(obj) ⇒ `Wgit::Url`

.parse?(obj) ⇒ `Wgit::Url`

#absolute? ⇒ `Boolean` Also known as: is_absolute?

#concat(other) ⇒ `String`

#fragment? ⇒ `Boolean` Also known as: is_fragment?

#index? ⇒ `Boolean` Also known as: is_index?

#inspect ⇒ `String`

#invalid? ⇒ `Boolean`

#join(other) ⇒ `Wgit::Url`

#make_absolute(doc) ⇒ `Wgit::Url`

#normalize ⇒ `Wgit::Url`

#omit(*components) ⇒ `Wgit::Url`

#omit_base ⇒ `Wgit::Url`

#omit_fragment ⇒ `Wgit::Url`

#omit_leading_slash ⇒ `Wgit::Url`

#omit_origin ⇒ `Wgit::Url`

#omit_query ⇒ `Wgit::Url`

#omit_slashes ⇒ `Wgit::Url`

#omit_trailing_slash ⇒ `Wgit::Url`

#prefix_scheme(scheme = :http) ⇒ `Wgit::Url`

#query? ⇒ `Boolean` Also known as: is_query?

#redirects_journey ⇒ `Array<Wgit::Url>`

#relative?(opts = {}) ⇒ `Boolean` Also known as: is_relative?

#replace(new_url) ⇒ `String`

#scheme_relative? ⇒ `Boolean` Also known as: is_scheme_relative?

#to_addressable_uri ⇒ `Addressable::URI`

#to_base ⇒ `Wgit::Url`^? Also known as: base

#to_brand ⇒ `Wgit::Url`^? Also known as: brand

#to_domain ⇒ `Wgit::Url`^? Also known as: domain

#to_endpoint ⇒ `Wgit::Url` Also known as: endpoint

#to_extension ⇒ `Wgit::Url`^? Also known as: extension

#to_fragment ⇒ `Wgit::Url`^? Also known as: fragment

#to_h ⇒ `Hash`

#to_host ⇒ `Wgit::Url`^? Also known as: host

#to_origin ⇒ `Wgit::Url`^? Also known as: origin

#to_password ⇒ `Wgit::Url`^? Also known as: password

#to_path ⇒ `Wgit::Url`^? Also known as: path

#to_port ⇒ `Wgit::Url`^? Also known as: port

#to_query ⇒ `Wgit::Url`^? Also known as: query

#to_query_hash(symbolize_keys: false) ⇒ `Hash<String | Symbol, String>` Also known as: query_hash

#to_scheme ⇒ `Wgit::Url`^? Also known as: scheme

#to_sub_domain ⇒ `Wgit::Url`^? Also known as: sub_domain

#to_uri ⇒ `URI::HTTP`, `URI::HTTPS` Also known as: uri

#to_url ⇒ `Wgit::Url` Also known as: url

#to_user ⇒ `Wgit::Url`^? Also known as: user

#valid? ⇒ `Boolean` Also known as: is_valid?