Class: HtmlPageTitle

Inherits:
Object
  • Object
show all
Defined in:
lib/html_page_title.rb

Overview

A simple class for finding the title of a given http url by fetching the given url, following all eventual redirects and finally parsing it through hpricot.

You can either use the shorthand form or initialize the instance properly:

* HtmlPageTitle('http://github.com')
* HtmlPageTitle.new('http://github.com')

Those calls are equivalent, except for one subtle difference: The shorthand form will swallow SocketErrors and return nil (i.e. this will happen for invalid urls), while the regular instantiation via new will throw that error.

You can either get the title, the heading (which will be the content of the first h1 tag in the body) or the label, which will be (in the following order by availability) either the heading, or the title, or the target url after redirecting. Note that if the title or the heading can not be found (e.g. a non-HTML document), both methods will return nil, so the label method is the only one that will always return some kind of string

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(original_url) ⇒ HtmlPageTitle

Returns a new instance of HtmlPageTitle.


35
36
37
38
# File 'lib/html_page_title.rb', line 35

def initialize(original_url)
  @original_url = original_url
  title # retrieve data so exceptions can be thrown
end

Instance Attribute Details

#original_urlObject (readonly)

Returns the value of attribute original_url.


34
35
36
# File 'lib/html_page_title.rb', line 34

def original_url
  @original_url
end

Instance Method Details

#bodyObject

Returns the body of the document at the (redirected?) target


77
78
79
# File 'lib/html_page_title.rb', line 77

def body
  redirect.body
end

#documentObject


40
41
42
# File 'lib/html_page_title.rb', line 40

def document
  @document ||= Hpricot(redirect.body)
end

#headingObject

Retrieves the first h1 tag in the page and returns it's content


52
53
54
55
56
57
# File 'lib/html_page_title.rb', line 52

def heading
  return @heading if @heading
  if heading_tag = document.at('body h1')
    @heading = HTMLEntities.new.decode(heading_tag.inner_html.strip.chomp)
  end
end

#labelObject

Returns either the heading, or the title, or the url in this order by availability


61
62
63
# File 'lib/html_page_title.rb', line 61

def label
  heading or title or url
end

#redirectObject

Returns the redirect follower instance used for resolving this instances url


67
68
69
# File 'lib/html_page_title.rb', line 67

def redirect
  @redirect = RedirectFollower.new(original_url)    
end

#titleObject


44
45
46
47
48
49
# File 'lib/html_page_title.rb', line 44

def title
  return @title if @title
  if title_tag = document.at('head title')
    @title = HTMLEntities.new.decode(title_tag.inner_html.strip.chomp)
  end
end

#urlObject

Returns the target url after all redirects


72
73
74
# File 'lib/html_page_title.rb', line 72

def url
  redirect.url
end