Class: URLNormalizer

Inherits:
Object
  • Object
show all
Defined in:
lib/url_normalizer.rb

Overview

Class with methods related to normalizing URLs.

Class Method Summary collapse

Class Method Details

.normalize_entry_url(url, entry) ⇒ Object

Normalize an entry URL:

  • make sure that it is an absolute URL, prepending the feed host if necessary

  • make sure that the URL has an http or https scheme, using the feed's scheme by default

  • If the URL contains non-ascii characters, convert to ASCII using punycode

(see en.wikipedia.org/wiki/Internationalized_domain_name)

Receives as argument an URL string.

If a nil or empty string is passed, returns nil.


63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# File 'lib/url_normalizer.rb', line 63

def self.normalize_entry_url(url, entry)
  # Check that the passed string contains something
  return nil if url.blank?

  normalized_url = strip_url url

  # Addressable treats scheme-relative URIs as relative URIs, but we do not want to add the feed host etc to
  # scheme-relative URIs in entries. So, if URI is scheme-relative, skip the manipulations performed on relative
  # URIs.
  if normalized_url =~ /\A\/\//
    Rails.logger.info "Value #{url} is a scheme relative URI, leaving it unchanged"
  # Data-uris do not need to be further manipulated
  elsif normalized_url =~ /\Adata:/
    Rails.logger.info "Value #{url} is a data-uri, leaving it unchanged"
  # Object-URLs (pointing to in-memory blobs) are not allowed because of security concerns, they are removed
  elsif normalized_url =~ /\Ablob:/
    Rails.logger.info "Value #{url} is an object-url (blob), removing it"
    normalized_url = ''
  else
    # if the entry url is relative, try to make it absolute using the feed's host
    begin
      parsed_uri = Addressable::URI.parse normalized_url
      if parsed_uri.relative?
        # Path must begin with a '/'
        normalized_url = "/#{normalized_url}" if parsed_uri.path[0] != '/'

        # Use host from feed URL, or if the feed only has a fetch URL use it instead.
        uri_feed = entry_feed_uri entry
        normalized_url = "#{uri_feed.host}#{normalized_url}"
      end

      # If url has no http or https scheme, add http://
      unless normalized_url =~ /\Ahttp:\/\//i || normalized_url =~ /\Ahttps:\/\//i
        # Do not recalculate feed URI if previously calculated.
        uri_feed ||= entry_feed_uri entry
        Rails.logger.info "Value #{url} has no http or https URI scheme, trying to add scheme from feed url #{uri_feed}"
        normalized_url = "#{uri_feed.scheme}://#{normalized_url}"
      end
    rescue Addressable::URI::InvalidURIError => e
      Rails.logger.warn "URL #{url} is not a parseable URL, removing it"
      normalized_url = ''
    end
  end

  begin
    normalized_url = Addressable::URI.parse(normalized_url).normalize.to_s
  rescue Addressable::URI::InvalidURIError => e
    Rails.logger.warn "URL #{normalized_url} cannot be normalized, removing it"
    normalized_url = ''
  end

  return normalized_url
end

.normalize_feed_url(url) ⇒ Object

Normalize the passed URL:

  • Make sure that the URL passed as argument has an http:// or scheme.

  • If the URL contains non-ascii characters, convert to ASCII using punycode

(see en.wikipedia.org/wiki/Internationalized_domain_name)

Receives as argument an URL string.

The algorithm for scheme manipulations performed by the method is:

  • If the URL has no scheme it is returned prepended with http://

  • If the URL has a feed: or feed:// scheme, it is removed and an http:// scheme added if necessary.

For details about this uri-scheme see en.wikipedia.org/wiki/Feed_URI_scheme

  • If the URL has an http:// or https:// scheme, it is returned untouched.

If a nil or empty string is passed, returns nil.


22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# File 'lib/url_normalizer.rb', line 22

def self.normalize_feed_url(url)
  # Check that the passed string contains something
  return nil if url.blank?

  normalized_url = strip_url url

  # If the url has the feed:// or feed: uri-schemes, remove them.
  # The order in which these removals happen is critical, don't change it!!!
  normalized_url.sub! /\Afeed:\/\//i, ''
  normalized_url.sub! /\Afeed:/i, ''

  # If the url is scheme relative, remove the leading '//', later 'http://' will be prepended
  normalized_url.sub! /\A\/\//, ''

  # If url has no http or https scheme, add http://
  unless normalized_url =~ /\Ahttp:\/\//i || normalized_url =~ /\Ahttps:\/\//i
    Rails.logger.info "Value #{url} has no http or https URI scheme, trying to add http:// scheme"
    normalized_url = "http://#{normalized_url}"
  end

  begin
    normalized_url = Addressable::URI.parse(normalized_url).normalize.to_s
  rescue Addressable::URI::InvalidURIError => e
    Rails.logger.warn "URL #{normalized_url} cannot be parsed, removing it"
    normalized_url = ''
  end

  return normalized_url
end