Class: Arachni::URI

Inherits:
Object show all
Extended by:
Arachni::UI::Output, Utilities
Includes:
Arachni::UI::Output, Utilities
Defined in:
lib/arachni/uri.rb

Overview

The URI class automatically normalizes the URLs it is passed to parse while maintaining compatibility with Ruby’s URI core class by delegating missing methods to it – thus, you can treat it like a Ruby URI and enjoy some extra perks along the line.

It also provides cached (to maintain a low latency) helper class methods to ease common operations such as:

Author:

Defined Under Namespace

Classes: Error

Constant Summary collapse

CACHE_SIZES =
{
    parse:       600,
    ruby_parse:  600,
    cheap_parse: 600,
    normalize:   1000,
    to_absolute: 1000
}
CACHE =
{
    parser:      ::URI::Parser.new,
    ruby_parse:  Support::Cache::RandomReplacement.new( CACHE_SIZES[:ruby_parse] ),
    parse:       Support::Cache::RandomReplacement.new( CACHE_SIZES[:parse] ),
    cheap_parse: Support::Cache::RandomReplacement.new( CACHE_SIZES[:cheap_parse] ),
    normalize:   Support::Cache::RandomReplacement.new( CACHE_SIZES[:normalize] ),
    to_absolute: Support::Cache::RandomReplacement.new( CACHE_SIZES[:to_absolute] )
}

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Arachni::UI::Output

debug?, debug_off, debug_on, disable_only_positives, error_logfile, flush_buffer, log_error, mute, muted?, old_reset_output_options, only_positives, only_positives?, print_bad, print_debug, print_debug_backtrace, print_debug_pp, print_error, print_error_backtrace, print_info, print_line, print_ok, print_status, print_verbose, reroute_to_file, reroute_to_file?, reset_output_options, set_buffer_cap, set_error_logfile, uncap_buffer, unmute, verbose, verbose?

Methods included from Utilities

available_port, cookie_encode, cookies_from_document, cookies_from_file, cookies_from_response, exception_jail, exclude_path?, extract_domain, follow_protocol?, form_decode, form_encode, form_parse_request_body, forms_from_document, forms_from_response, generate_token, get_path, html_decode, html_encode, include_path?, links_from_document, links_from_response, normalize_url, page_from_response, page_from_url, parse_query, parse_set_cookie, parse_url_vars, path_in_domain?, path_too_deep?, port_available?, rand_port, redundant_path?, remove_constants, seed, skip_page?, skip_path?, skip_resource?, uri_decode, uri_encode, uri_parse, uri_parser, url_sanitize

Constructor Details

#initialize(url) ⇒ URI

Normalizes and parses the provided URL.

Will discard the fragment component, if there is one.

Parameters:



468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
# File 'lib/arachni/uri.rb', line 468

def initialize( url )
    @arachni_opts = Options.instance

    @parsed_url = case url
                      when String
                          self.class.ruby_parse( url )

                      when ::URI
                          url.dup

                      when Hash
                          ::URI::Generic.build( url )

                      when Arachni::URI
                          self.parsed_url = url.parsed_url.dup

                      else
                          to_string = url.to_s rescue ''
                          msg = "Argument must either be String, URI or Hash"
                          msg << " -- #{url.class.name} '#{to_string}' passed."
                          fail TypeError.new( msg )
                  end

    fail Error, 'Failed to parse URL.' if !@parsed_url
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(sym, *args, &block) ⇒ Object (private)

Delegates unimplemented methods to Ruby’s ‘URI::Generic` class for compatibility.



665
666
667
668
669
670
671
# File 'lib/arachni/uri.rb', line 665

def method_missing( sym, *args, &block )
    if @parsed_url.respond_to?( sym )
        @parsed_url.send( sym, *args, &block )
    else
        super
    end
end

Class Method Details

.addressable_parse(url) ⇒ Hash

Note:

The Hash is suitable for passing to ‘::URI::Generic.build` – if however you plan on doing that you’ll be better off just using ruby_parse which does the same thing and caches the results for some extra schnell.

Performs a parse using the ‘URI::Addressable` lib while normalizing the URL (will also discard the fragment).

This method is not cached and solely exists as a fallback used by cheap_parse.

Parameters:

Returns:

  • (Hash)

    URL components:

    * `:scheme` -- HTTP or HTTPS
    * `:userinfo` -- `username:password`
    * `:host`
    * `:port`
    * `:path`
    * `:query`
    


351
352
353
354
355
356
357
358
359
360
361
362
# File 'lib/arachni/uri.rb', line 351

def self.addressable_parse( url )
    u = Addressable::URI.parse( html_decode( url.to_s ) ).normalize
    u.fragment = nil
    h = u.to_hash

    h[:path].gsub!( /\/+/, '/' ) if h[:path]
    if h[:user]
        h[:userinfo] = h.delete( :user )
        h[:userinfo] << ":#{h.delete( :password )}" if h[:password]
    end
    h
end

.cheap_parse(url) ⇒ Hash

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Note:

The Hash is suitable for passing to ‘::URI::Generic.build` – if however you plan on doing that you’ll be better off just using ruby_parse which does the same thing and caches the results for some extra schnell.

Performs a parse that is less resource intensive than Ruby’s URI lib’s method while normalizing the URL (will also discard the fragment and path parameters).

Parameters:

Returns:

  • (Hash)

    URL components (frozen):

    * `:scheme` -- HTTP or HTTPS
    * `:userinfo` -- `username:password`
    * `:host`
    * `:port`
    * `:path`
    * `:query`
    


199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
# File 'lib/arachni/uri.rb', line 199

def self.cheap_parse( url )
    return if !url || url.empty?

    cache = CACHE[__method__]

    url   = url.to_s.dup
    c_url = url.to_s.dup

    components = {
        scheme:   nil,
        userinfo: nil,
        host:     nil,
        port:     nil,
        path:     nil,
        query:    nil
    }

    valid_schemes = %w(http https)

    begin
        if (v = cache[url]) && v == :err
            return
        elsif v
            return v
        end

        # we're not smart enough for scheme-less URLs and if we're to go
        # into heuristics then there's no reason to not just use Addressable's parser
        if url.start_with?( '//' )
            return cache[c_url] = addressable_parse( c_url ).freeze
        end

        url = url.encode( 'UTF-8', undef: :replace, invalid: :replace )

        # remove the fragment if there is one
        url = url.split( '#', 2 )[0...-1].join if url.include?( '#' )

        url = html_decode( url )

        dupped_url = url.dup
        has_path = true

        splits = url.split( ':' )
        if !splits.empty? && valid_schemes.include?( splits.first.downcase )
            splits = url.split( '://', 2 )
            components[:scheme] = splits.shift
            components[:scheme].downcase! if components[:scheme]

            if url = splits.shift
                splits = url.split( '?' ).first.split( '@', 2 )

                if splits.size > 1
                    components[:userinfo] = splits.first
                    url = splits.shift
                end

                if !splits.empty?
                    splits = splits.last.split( '/', 2 )
                    url = splits.last

                    splits = splits.first.split( ':', 2 )
                    if splits.size == 2
                        host = splits.first
                        components[:port] = Integer( splits.last ) if splits.last && !splits.last.empty?
                        components[:port] = nil if components[:port] == 80
                        url.gsub!( ':' + components[:port].to_s, '' )
                    else
                        host = splits.last
                    end

                    if components[:host] = host
                        url.gsub!( host, '' )
                        components[:host].downcase!
                    end
                else
                    has_path = false
                end
            else
                has_path = false
            end
        end

        if has_path
            splits = url.split( '?', 2 )
            if components[:path] = splits.shift
                components[:path] = '/' + components[:path] if components[:scheme]
                components[:path].gsub!( /\/+/, '/' )
                components[:path] =
                    encode( decode( components[:path] ),
                            Addressable::URI::CharacterClasses::PATH )
            end

            if c_url.include?( '?' ) && !(query = dupped_url.split( '?', 2 ).last).empty?
                components[:query] = (query.split( '&', -1 ).map do |pair|
                    Addressable::URI.normalize_component( pair,
                        Addressable::URI::CharacterClasses::QUERY.sub( '\\&', '' )
                    )
                end).join( '&' )
            end
        end

        components[:path] ||= components[:scheme] ? '/' : nil

        # Remove path params
        if components[:path]
            components[:path] = components[:path].split( ';', 2 ).first
        end

        components.values.each( &:freeze )

        cache[c_url] = components.freeze
    rescue => e
        begin
            print_debug "Failed to fast-parse '#{c_url}', falling back to slow-parse."
            print_debug "Error: #{e}"
            print_debug_backtrace( e )

            cache[c_url] = addressable_parse( c_url ).freeze
        rescue => ex
            print_debug "Failed to parse '#{c_url}'."
            print_debug "Error: #{ex}"
            print_debug_backtrace( ex )

            cache[c_url] = :err
            nil
        end
    end
end

.decode(string) ⇒ String

URL decodes a string.

Parameters:

Returns:



108
109
110
# File 'lib/arachni/uri.rb', line 108

def self.decode( string )
    Addressable::URI.unencode( string )
end

.deep_decode(string) ⇒ String

Iteratively URL decodes a String until there are no more characters to be unescaped.

Parameters:

Returns:



120
121
122
# File 'lib/arachni/uri.rb', line 120

def self.deep_decode( string )
    string = decode( string ) while string =~ /%[a-fA-F0-9]{2}/
end

.encode(string, bad_characters = nil) ⇒ String

URL encodes a string.

Parameters:

  • string (String)
  • bad_characters (String, Regexp) (defaults to: nil)

    Class of characters to encode – if String is passed, it should formatted as a regexp (for ‘Regexp.new`).

Returns:



97
98
99
# File 'lib/arachni/uri.rb', line 97

def self.encode( string, bad_characters = nil )
    Addressable::URI.encode_component( *[string, bad_characters].compact )
end

.normalize(url) ⇒ String

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Uses cheap_parse to parse and normalize the URL and then converts it to a common String format.

Parameters:

Returns:

  • (String)

    Normalized URL (frozen).



415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
# File 'lib/arachni/uri.rb', line 415

def self.normalize( url )
    return if !url || url.empty?

    cache = CACHE[__method__]

    url   = url.to_s.strip.dup
    c_url = url.to_s.strip.dup

    begin
        if (v = cache[url]) && v == :err
            return
        elsif v
            return v
        end

        components = cheap_parse( url )

        normalized = ''
        normalized << components[:scheme] + '://' if components[:scheme]

        if components[:userinfo]
            normalized << components[:userinfo]
            normalized << '@'
        end

        if components[:host]
            normalized << components[:host]
            normalized << ':' + components[:port].to_s if components[:port]
        end

        normalized << components[:path] if components[:path]
        normalized << '?' + components[:query] if components[:query]

        cache[c_url] = normalized.freeze
    rescue => e
        print_debug "Failed to normalize '#{c_url}'."
        print_debug "Error: #{e}"
        print_debug_backtrace( e )

        cache[c_url] = :err
        nil
    end
end

.parse(url) ⇒ Object

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Cached version of #initialize, if there’s a chance that the same URL will be needed to be parsed multiple times you should use this method.

See Also:



134
135
136
137
138
139
140
141
142
143
144
# File 'lib/arachni/uri.rb', line 134

def self.parse( url )
    return url if !url || url.is_a?( Arachni::URI )
    CACHE[__method__][url] ||= begin
        new( url )
    rescue => e
        print_debug "Failed to parse '#{url}'."
        print_debug "Error: #{e}"
        print_debug_backtrace( e )
        nil
    end
end

.parserURI::Parser

Returns cached URI parser.

Returns:

  • (URI::Parser)

    cached URI parser



83
84
85
# File 'lib/arachni/uri.rb', line 83

def self.parser
    CACHE[__method__]
end

.ruby_parse(url) ⇒ URI

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Normalizes ‘url` and uses Ruby’s core URI lib to parse it.

Parameters:

  • url (String)

    URL to parse

Returns:



157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
# File 'lib/arachni/uri.rb', line 157

def self.ruby_parse( url )
    return url if url.to_s.empty? || url.is_a?( ::URI )
    CACHE[__method__][url] ||= begin
        ::URI::Generic.build( cheap_parse( url ) )
    rescue
        begin
            parser.parse( normalize( url ).dup )
        rescue => e
            print_debug "Failed to parse '#{url}'."
            print_debug "Error: #{e}"
            print_debug_backtrace( e )
            nil
        end
    end
end

.to_absolute(relative, reference = Options.instance.url.to_s) ⇒ String

Note:

This method’s results are cached for performance reasons. If you plan on doing something destructive with its return value duplicate it first because there may be references to it elsewhere.

Normalizes and converts a ‘relative` URL to an absolute one by merging in with a `reference` URL.

Pretty much a cached version of #to_absolute.

Parameters:

  • relative (String)
  • reference (String) (defaults to: Options.instance.url.to_s)

    absolute url to use as a reference

Returns:

  • (String)

    absolute URL (frozen)



379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
# File 'lib/arachni/uri.rb', line 379

def self.to_absolute( relative, reference = Options.instance.url.to_s )
    return reference if !relative || relative.empty?
    key = relative + ' :: ' + reference

    cache = CACHE[__method__]
    begin
        if (v = cache[key]) && v == :err
            return
        elsif v
            return v
        end

        parsed_ref = parse( reference )

        # scheme-less URLs are expensive to parse so let's resolve the issue here
        relative = "#{parsed_ref.scheme}:#{relative}" if relative.start_with?( '//' )

        cache[key] = parse( relative ).to_absolute( parsed_ref ).to_s.freeze
    rescue
        cache[key] = :err
        nil
    end
end

Instance Method Details

#==(other) ⇒ Object



494
495
496
# File 'lib/arachni/uri.rb', line 494

def ==( other )
    to_s == other.to_s
end

#domainString

Returns domain_name.tld.

Returns:

  • (String)

    domain_name.tld



547
548
549
550
551
552
553
554
555
# File 'lib/arachni/uri.rb', line 547

def domain
    return host if ip_address?

    s = host.split( '.' )
    return s.first if s.size == 1
    return host if s.size == 2

    s[1..-1].join( '.' )
end

#exclude?(patterns) ⇒ Bool

Checks if self should be excluded based on the provided ‘patterns`.

Parameters:

Returns:

  • (Bool)

    ‘true` if self matches a pattern, `false` otherwise.



581
582
583
584
585
# File 'lib/arachni/uri.rb', line 581

def exclude?( patterns )
    fail TypeError.new( 'Array<Regexp,String> expected, got nil instead' ) if patterns.nil?
    ensure_patterns( patterns ).each { |pattern| return true if to_s =~ pattern }
    false
end

#hashObject



633
634
635
# File 'lib/arachni/uri.rb', line 633

def hash
    to_s.hash
end

#in_domain?(include_subdomain, other) ⇒ Bool

Returns ‘true` if self is in the same domain as the `other` URL, false otherwise.

Parameters:

  • include_subdomain (Bool)

    Match subdomains too? If true will compare full hostnames, otherwise will discard subdomains.

  • other (Arachni::URI, URI, Hash, String)

    Reference URL.

Returns:

  • (Bool)

    ‘true` if self is in the same domain as the `other` URL, false otherwise.



615
616
617
618
619
620
621
622
# File 'lib/arachni/uri.rb', line 615

def in_domain?( include_subdomain, other )
    return true if !other

    other = self.class.new( other ) if !other.is_a?( Arachni::URI )
    include_subdomain ? other.host == host : other.domain == domain
rescue
    false
end

#include?(patterns) ⇒ Bool

Checks if self should be included based on the provided ‘patterns`.

Parameters:

Returns:

  • (Bool)

    ‘true` if self matches a pattern (or `patterns` are `nil` or empty), `false` otherwise.



596
597
598
599
600
601
602
603
604
# File 'lib/arachni/uri.rb', line 596

def include?( patterns )
    fail TypeError.new( 'Array<Regexp,String> expected, got nil instead' ) if patterns.nil?

    rules = ensure_patterns( patterns )
    return true if !rules || rules.empty?

    rules.each { |pattern| return true if to_s =~ pattern }
    false
end

#ip_address?Boolean

Returns ‘true` if the URI contains an IP address, `false` otherwise.

Returns:

  • (Boolean)

    ‘true` if the URI contains an IP address, `false` otherwise.



559
560
561
# File 'lib/arachni/uri.rb', line 559

def ip_address?
    !(IPAddr.new( host ) rescue nil).nil?
end

#mailto?Boolean

Returns:

  • (Boolean)


624
625
626
# File 'lib/arachni/uri.rb', line 624

def mailto?
    scheme == 'mailto'
end

#persistent_hashObject



637
638
639
# File 'lib/arachni/uri.rb', line 637

def persistent_hash
    to_s.persistent_hash
end

#resource_extensionString

Returns The extension of the URI resource.

Returns:

  • (String)

    The extension of the URI resource.



525
526
527
528
529
# File 'lib/arachni/uri.rb', line 525

def resource_extension
    resource_name = path.split( '/' ).last.to_s
    return if !resource_name.include?( '.' )
    resource_name.split( '.' ).last
end

#to_absolute(reference) ⇒ Arachni::URI

Converts self into an absolute URL using ‘reference` to fill in the missing data.

Parameters:

Returns:



505
506
507
508
509
510
511
512
513
514
515
516
# File 'lib/arachni/uri.rb', line 505

def to_absolute( reference )
    absolute = case reference
                   when Arachni::URI
                       reference.parsed_url
                   when ::URI
                       reference
                   else
                       self.class.new( reference.to_s ).parsed_url
               end.merge( @parsed_url )

    self.class.new( absolute )
end

#to_sString

Returns URL.

Returns:



629
630
631
# File 'lib/arachni/uri.rb', line 629

def to_s
    @parsed_url.to_s
end

#too_deep?(depth) ⇒ Bool

Checks if self exceeds a given directory ‘depth`.

Parameters:

  • depth (Integer)

    Depth to check for.

Returns:

  • (Bool)

    ‘true` if self is deeper than `depth`, `false` otherwise.



570
571
572
# File 'lib/arachni/uri.rb', line 570

def too_deep?( depth )
    depth.to_i > 0 && (depth + 1) <= path.to_s.count( '/' )
end

#up_to_pathString

Returns The URL up to its path component (no resource name, query, fragment, etc).

Returns:

  • (String)

    The URL up to its path component (no resource name, query, fragment, etc).



533
534
535
536
537
538
539
540
541
542
543
544
# File 'lib/arachni/uri.rb', line 533

def up_to_path
    return if !path
    uri_path = path.dup

    uri_path = File.dirname( uri_path ) if !File.extname( path ).empty?

    uri_path << '/' if uri_path[-1] != '/'

    uri_str = "#{scheme}://#{host}"
    uri_str << ':' + port.to_s if port && port != 80
    uri_str << uri_path
end

#without_queryString

Returns The URL up to its resource component (query, fragment, etc).

Returns:

  • (String)

    The URL up to its resource component (query, fragment, etc).



520
521
522
# File 'lib/arachni/uri.rb', line 520

def without_query
    to_s.split( '?', 2 ).first.to_s
end