UrlParser

Extended URI capabilities built on top of Addressable::URI. Parse URIs into granular components, unescape encoded characters, extract embedded URIs, normalize URIs, handle canonical url generation, and validate domains. Inspired by PostRank-URI and URI.js.

Installation

Add this line to your application's Gemfile:

gem 'url_parser'

And then execute:

$ bundle

Or install it yourself as:

$ gem install url_parser

Example

uri = UrlParser.parse('foo://username:[email protected]:123/hello/world/there.html?name=ferret#foo')

uri.class               #=> UrlParser::URI
uri.scheme              #=> 'foo'
uri.username            #=> 'username'
uri.user                #=> 'username' # Alias for #username
uri.password            #=> 'password'
uri.userinfo            #=> 'username:password'
uri.hostname            #=> 'ww2.foo.bar.example.com'
uri.naked_hostname      #=> 'foo.bar.example.com'
uri.port                #=> 123
uri.host                #=> 'ww2.foo.bar.example.com:123'
uri.www                 #=> 'ww2'
uri.tld                 #=> 'com'
uri.top_level_domain    #=> 'com' # Alias for #tld
uri.extension           #=> 'com' # Alias for #tld
uri.sld                 #=> 'example'
uri.second_level_domain #=> 'example' # Alias for #sld
uri.domain_name         #=> 'example' # Alias for #sld
uri.trd                 #=> 'ww2.foo.bar'
uri.third_level_domain  #=> 'ww2.foo.bar' # Alias for #trd
uri.subdomains          #=> 'ww2.foo.bar' # Alias for #trd
uri.naked_trd           #=> 'foo.bar'
uri.naked_subdomain     #=> 'foo.bar' # Alias for #naked_trd
uri.domain              #=> 'example.com'
uri.subdomain           #=> 'ww2.foo.bar.example.com'
uri.origin              #=> 'foo://ww2.foo.bar.example.com:123'
uri.authority           #=> 'username:[email protected]:123'
uri.site                #=> 'foo://username:[email protected]:123'
uri.path                #=> '/hello/world/there.html'
uri.segment             #=> 'there.html'
uri.directory           #=> '/hello/world'
uri.filename            #=> 'there.html'
uri.suffix              #=> 'html'
uri.query               #=> 'name=ferret'
uri.query_values        #=> { 'name' => 'ferret' }
uri.fragment            #=> 'foo'
uri.resource            #=> 'there.html?name=ferret#foo'
uri.location            #=> '/hello/world/there.html?name=ferret#foo'

Usage

Parse

Parse takes the provided URI and breaks it down into its component parts. To see a full list components provided, see URI Data Model. If you provide an instance of Addressable::URI, it will consider the URI already parsed.

uri = UrlParser.parse('http://example.org/foo?bar=baz')
uri.class 
#=> UrlParser::URI

Unembed, canonicalize, normalize, and clean all rely on parse.

Unembed

Unembed searches the provided URI's query values for redirection urls. By default, it searches the u and url params, however you can configure custom params to search.

uri = UrlParser.unembed('http://energy.gov/exit?url=https%3A//twitter.com/energy')
uri.to_s 
#=> "https://twitter.com/energy"

With custom embedded params keys:

uri = UrlParser.unembed('https://www.upwork.com/leaving?ref=https%3A%2F%2Fwww.example.com', embedded_params: [ 'u', 'url', 'ref' ])
uri.to_s 
#=> "https://www.example.com/"

Canonicalize

Canonicalize applies filters on param keys to remove common tracking params, attempting to make it easier to identify duplicate URIs. For a full list of params, see db.yml.

uri = UrlParser.canonicalize('https://en.wikipedia.org/wiki/Ruby_(programming_language)?source=ABCD&utm_source=EFGH')
uri.to_s 
#=> "https://en.wikipedia.org/wiki/Ruby_(programming_language)?"

Normalize

Normalize standardizes paths, query strings, anchors, whitespace, hostnames, and trailing slashes.

# Normalize paths
uri = UrlParser.normalize('http://example.com/a/b/../../')
uri.to_s 
#=> "http://example.com/"

# Normalize query strings
uri = UrlParser.normalize('http://example.com/?')
uri.to_s 
#=> "http://example.com/"

# Normalize anchors
uri = UrlParser.normalize('http://example.com/#test')
uri.to_s 
#=> "http://example.com/"

# Normalize whitespace
uri = UrlParser.normalize('http://example.com/a/../? #test')
uri.to_s 
#=> "http://example.com/"

# Normalize hostnames
uri = UrlParser.normalize("💩.la")
uri.to_s
#=> "http://xn--ls8h.la/"

# Normalize trailing slashes
uri = UrlParser.normalize('http://example.com/a/b/')
uri.to_s
#=> "http://example.com/a/b"

Clean

Clean combines parsing, unembedding, canonicalization, and normalization into a single call. It is designed to provide a method for cross-referencing identical urls.

uri = UrlParser.clean('http://example.com/a/../?url=https%3A//💩.la/&utm_source=google')
uri.to_s 
#=> "https://xn--ls8h.la/"

uri = UrlParser.clean('https://en.wikipedia.org/wiki/Ruby_(programming_language)?source=ABCD&utm_source%3Danalytics')
uri.to_s 
#=> "https://en.wikipedia.org/wiki/Ruby_(programming_language)"

UrlParser::URI

Parsing a URI with UrlParser returns an instance of UrlParser::URI, with the following methods available:

URI Data Model

 * :scheme              # Top level URI naming structure / protocol.
 * :username            # Username portion of the userinfo.
 * :user                # Alias for #username.
 * :password            # Password portion of the userinfo.
 * :userinfo            # URI username and password for authentication.
 * :hostname            # Fully qualified domain name or IP address.
 * :naked_hostname      # Hostname without any ww? prefix.
 * :port                # Port number.
 * :host                # Hostname and port.
 * :www                 # The ww? portion of the subdomain.
 * :tld                 # Returns the top level domain portion, aka the extension.
 * :top_level_domain    # Alias for #tld.
 * :extension           # Alias for #tld.
 * :sld                 # Returns the second level domain portion, aka the domain part.
 * :second_level_domain # Alias for #sld.
 * :domain_name         # Alias for #sld.
 * :trd                 # Returns the third level domain portion, aka the subdomain part.
 * :third_level_domain  # Alias for #trd.
 * :subdomains          # Alias for #trd.
 * :naked_trd           # Any non-ww? subdomains.
 * :naked_subdomain     # Alias for #naked_trd.
 * :domain              # The domain name with the tld.
 * :subdomain           # All subdomains, include ww?.
 * :origin              # Scheme and host.
 * :authority           # Userinfo and host.
 * :site                # Scheme, userinfo, and host.
 * :path                # Directory and segment.
 * :segment             # Last portion of the path.
 * :directory           # Any directories following the site within the URI.
 * :filename            # Segment if a file extension is present.
 * :suffix              # The file extension of the filename.
 * :query               # Params and values as a string.
 * :query_values        # A hash of params and values.
 * :fragment            # Fragment identifier.
 * :resource            # Path, query, and fragment.
 * :location            # Directory and resource - everything after the site.

Additional URI Methods

uri = UrlParser.clean('#')
uri.unescaped?      #=> true
uri.parsed?         #=> true 
uri.unembedded?     #=> true 
uri.canonicalized?  #=> true
uri.normalized?     #=> true
uri.cleaned?        #=> true 

# IP / localhost methods 
uri.localhost? 
uri.ip_address?
uri.ipv4?
uri.ipv6? 
uri.ipv4 #=> returns IPv4 address if applicable 
uri.ipv6 #=> returns IPv6 address if applicable 

# UrlParser::URI#relative? 
uri = UrlParser.parse('/')
uri.relative?       
#=> true 

# UrlParser::URI#absolute? 
uri = UrlParser.parse('http://example.com/')
uri.absolute?       
#=> true 

# UrlParser::URI#clean - return a cleaned string 
uri = UrlParser.parse('http://example.com/?utm_source=google')
uri.clean 
#=> "http://example.com/"

# UrlParser::URI#canonical - cleans and strips the scheme 
uri = UrlParser.parse('http://example.com/?utm_source%3Danalytics')
uri.canonical
#=> "//example.com/"

# Joining URIs
uri = UrlParser.parse('http://foo.com/zee/zaw/zoom.html')
joined_uri = uri + '/bar#id'
joined_uri.to_s
#=> "http://foo.com/bar#id"

# UrlParser::URI #raw / #to_s - return the URI as a string 
uri = UrlParser.parse('http://example.com/')
uri.raw 
#=> "http://example.com/"

# Compare URIs 
# Taking into account the scheme: 
uri = UrlParser.parse('http://example.com/a/../?')
uri == 'http://example.com/'
#=> true 
uri == 'https://example.com/'
#=> false

# Ignoring the scheme: 
uri =~ 'https://example.com/'
#=> true

# UrlParser::URI#valid? - checks if URI is absolute and domain is valid 
uri = UrlParser.parse('http://example.qqq/')
uri.valid?          
#=> false

Configuration

embedded_params

Set the params the unembed parser uses to search for embedded URIs. Default is [ 'u', 'url ]. Set to an empty array to disable unembedding.

UrlParser.configure do |config|
  config.embedded_params = [ 'ref' ]
end

uri = UrlParser.unembed('https://www.upwork.com/leaving?ref=https%3A%2F%2Fwww.example.com')
uri.to_s 
#=> "https://www.example.com/"

default_scheme

Set a default scheme if one is not present. Can also be set to nil if there should not be a default scheme. Default is 'http'.

UrlParser.configure do |config|
  config.default_scheme = 'https'
end

uri = UrlParser.parse('example.com')
uri.to_s 
#=> "https://example.com/"

scheme_map

Replace scheme keys in the 'map' with the corresponding value. Useful for replacing invalid or outdated schemes. Default is an empty hash.

UrlParser.configure do |config|
  config.scheme_map = { 'feed' => 'http' }
end

uri = UrlParser.parse('feed://feeds.feedburner.com/YourBlog')
uri.to_s 
#=> "http://feeds.feedburner.com/YourBlog"

TODO

Extract URIs from text
Enable custom rules for normalization, canonicaliztion, escaping, and extraction

Contributing

Fork it ( https://github.com/[my-github-username]/url_parser/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request