Class: ContentUrls

Inherits:
Object
  • Object
show all
Defined in:
lib/content_urls.rb,
lib/content_urls/version.rb,
lib/content_urls/parsers/css_parser.rb,
lib/content_urls/parsers/html_parser.rb,
lib/content_urls/parsers/java_script_parser.rb
more...

Overview

ContentUrls parses various file types (HTML, CSS, JavaScript, …) for URLs and provides methods for iterating through URLs and changing URLs.

Defined Under Namespace

Modules: Version Classes: CssParser, HtmlParser, JavaScriptParser

Class Method Summary collapse

Class Method Details

.rewrite_each_url(content, type, &block) ⇒ Object

Rewrites each URL in the content by calling the supplied block with each URL.

Examples:

Rewrite URLs in HTML code

content = '<html><a href="index.htm">Home</a></html>'
content = ContentUrls.rewrite_each_url(content, 'text/html') {|url| 'gone.html'}
puts "Rewritten: #{content}"
# => "Rewritten: <html><a href="gone.html">Home</a></html>"

Parameters:

  • content (String)

    the HTML content.

  • type (String)

    the media type of the content.

[View source]

49
50
51
52
53
54
55
56
57
# File 'lib/content_urls.rb', line 49

def self.rewrite_each_url(content, type, &block)
  if (parser = get_parser(type))
    parser.rewrite_each_url(content) do |url|
      replacement = yield url
      (replacement.nil? ? url : replacement)
    end
  end
  content
end

.to_absolute(url, base_url) ⇒ Object

Convert a relative URL to an absolute URL using base_url (for example, the content’s original location or an HTML document’s href attribute of the base tag).

Examples:

Obtain absolute URL of “../index.html” of page obtained from “example.com/one/two/sample.html

puts ContentUrls.to_absolute("../index.html", "http://example.com/folder/sample.html")
# => "http://example.com/index.html"
[View source]

65
66
67
68
69
70
71
72
# File 'lib/content_urls.rb', line 65

def self.to_absolute(url, base_url)
  return nil if url.nil?

  url = URI.encode(URI.decode(url.to_s.gsub(/#[a-zA-Z0-9_-]*$/,'')))  # remove anchor
  absolute = URI(base_url).merge(url)
  absolute.path = '/' if absolute.path.empty?
  absolute.to_s
end

.urls(content, type) ⇒ Array

Returns the URLs found in the content.

Examples:

Parse HTML code for URLs

content = '<html><a href="index.html">Home</a></html>'
ContentUrls.urls(content, 'text/html').each do |url|
  puts "Found URL: #{url}"
end
# => "Found URL: index.html"

Parse content obtained from a robot

response = Net::HTTP.get_response(URI('http://example.com/sample-1'))
puts "URLs found at http://example.com/sample-1:"
ContentUrls.urls(response.body, response.content_type).each do |url|
  puts "  #{url}"
end
# => [a list of URLs found in the content located at http://example.com/sample-1]

Parameters:

  • content (String)

    the content.

  • type (String)

    the media type of the content.

Returns:

  • (Array)

    the unique URLs found in the content.

[View source]

29
30
31
32
33
34
35
# File 'lib/content_urls.rb', line 29

def self.urls(content, type)
  urls = []
  if (parser = get_parser(type))
    parser.urls(content).each { |url| urls << url }
  end
  urls
end