Class: MetaInspector::Scraper

Inherits:

Object

Object
MetaInspector::Scraper

show all

Defined in:: lib/meta_inspector/scraper.rb

Instance Attribute Summary collapse

#allow_redirections ⇒ Object readonly

Returns the value of attribute allow_redirections.
#content_type ⇒ Object readonly

Returns the content_type of the fetched document.
#errors ⇒ Object readonly

Returns the value of attribute errors.
#host ⇒ Object readonly

Returns the value of attribute host.
#html_content_only ⇒ Object readonly

Returns the value of attribute html_content_only.
#root_url ⇒ Object readonly

Returns the value of attribute root_url.
#scheme ⇒ Object readonly

Returns the value of attribute scheme.
#timeout ⇒ Object readonly

Returns the value of attribute timeout.
#url ⇒ Object readonly

Returns the value of attribute url.
#verbose ⇒ Object readonly

Returns the value of attribute verbose.

Instance Method Summary collapse

#charset ⇒ Object

Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />.
#description ⇒ Object

A description getter that first checks for a meta description and if not present will guess by looking at the first paragraph with more than 120 characters.
#document ⇒ Object

Returns the original, unparsed document.
#external_links ⇒ Object

External links found on the page, as absolute URLs.
#feed ⇒ Object

Returns the parsed document meta rss link.
#image ⇒ Object

Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/.
#images ⇒ Object

Images found on the page, as absolute URLs.
#initialize(url, options = {}) ⇒ Scraper constructor

Initializes a new instance of MetaInspector, setting the URL to the one given Options: => timeout: defaults to 20 seconds => html_content_type_only: if an exception should be raised if request content-type is not text/html.
#internal_links ⇒ Object

Internal links found on the page, as absolute URLs.
#links ⇒ Object

Links found on the page, as absolute URLs.
#ok? ⇒ Boolean

Returns true if there are no errors.
#parsed_document ⇒ Object

Returns the whole parsed document.
#title ⇒ Object

Returns the parsed document title, from the content of the <title> tag.
#to_hash ⇒ Object

Returns all parsed data as a nested Hash.

Constructor Details

#initialize(url, options = {}) ⇒ `Scraper`

Initializes a new instance of MetaInspector, setting the URL to the one given Options:

> timeout: defaults to 20 seconds

> html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false

> allow_redirections: when :safe, allows HTTP => HTTPS redirections. When :all, it also allows HTTPS => HTTP

> document: the html of the url as a string

> verbose: if the errors should be logged to the screen

# File 'lib/meta_inspector/scraper.rb', line 22

def initialize(url, options = {})
  options   = defaults.merge(options)

  @url      = with_default_scheme(encode_url(url))
  @scheme   = URI.parse(@url).scheme
  @host     = URI.parse(@url).host
  @root_url = "#{@scheme}://#{@host}/"
  @timeout  = options[:timeout]
  @data     = Hashie::Rash.new
  @errors   = []
  @html_content_only  = options[:html_content_only]
  @allow_redirections = options[:allow_redirections]
  @verbose            = options[:verbose]
  @document           = options[:document]
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(method_name) ⇒ `Object` (private)

Scrapers for all meta_tags in the form of “meta_name” are automatically defined. This has been tested for meta name: keywords, description, robots, generator meta http-equiv: content-language, Content-Type

It will first try with meta name=“…” and if nothing found, with meta http-equiv=“…”, substituting “_” by “-” TODO: define respond_to? to return true on the meta_name methods

# File 'lib/meta_inspector/scraper.rb', line 151

def method_missing(method_name)
  if method_name.to_s =~ /^meta_(.*)/
    key = $1
    key = "og:#{$1}" if key =~ /^og_(.*)/ # special treatment for og:

    scrape_meta_data

    @data.meta.name && (@data.meta.name[key.downcase]) || (@data.meta.property && @data.meta.property[key.downcase])
  else
    super
  end
end

Instance Attribute Details

#allow_redirections ⇒ `Object` (readonly)

Returns the value of attribute allow_redirections.



13
14
15

# File 'lib/meta_inspector/scraper.rb', line 13

def allow_redirections
  @allow_redirections
end

#content_type ⇒ `Object` (readonly)

Returns the content_type of the fetched document



125
126
127

# File 'lib/meta_inspector/scraper.rb', line 125

def content_type
  @content_type
end

#errors ⇒ `Object` (readonly)

Returns the value of attribute errors.



12
13
14

# File 'lib/meta_inspector/scraper.rb', line 12

def errors
  @errors
end

#host ⇒ `Object` (readonly)

Returns the value of attribute host.



12
13
14

# File 'lib/meta_inspector/scraper.rb', line 12

def host
  @host
end

#html_content_only ⇒ `Object` (readonly)

Returns the value of attribute html_content_only.



12
13
14

# File 'lib/meta_inspector/scraper.rb', line 12

def html_content_only
  @html_content_only
end

#root_url ⇒ `Object` (readonly)

Returns the value of attribute root_url.



12
13
14

# File 'lib/meta_inspector/scraper.rb', line 12

def root_url
  @root_url
end

#scheme ⇒ `Object` (readonly)

Returns the value of attribute scheme.



12
13
14

# File 'lib/meta_inspector/scraper.rb', line 12

def scheme
  @scheme
end

#timeout ⇒ `Object` (readonly)

Returns the value of attribute timeout.



12
13
14

# File 'lib/meta_inspector/scraper.rb', line 12

def timeout
  @timeout
end

#url ⇒ `Object` (readonly)

Returns the value of attribute url.



12
13
14

# File 'lib/meta_inspector/scraper.rb', line 12

def url
  @url
end

#verbose ⇒ `Object` (readonly)

Returns the value of attribute verbose.



13
14
15

# File 'lib/meta_inspector/scraper.rb', line 13

def verbose
  @verbose
end

Instance Method Details

#charset ⇒ `Object`

Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />



85
86
87

# File 'lib/meta_inspector/scraper.rb', line 85

def charset
  @charset ||= (charset_from_meta_charset || charset_from_content_type)
end

#description ⇒ `Object`

A description getter that first checks for a meta description and if not present will guess by looking at the first paragraph with more than 120 characters



46
47
48

# File 'lib/meta_inspector/scraper.rb', line 46

def description
  meta_description.nil? ? secondary_description : meta_description
end

#document ⇒ `Object`

Returns the original, unparsed document

# File 'lib/meta_inspector/scraper.rb', line 114

def document
  @document ||= if html_content_only && content_type != "text/html"
                  raise "The url provided contains #{content_type} content instead of text/html content" and nil
                else
                  request.read
                end
  rescue Exception => e
    add_fatal_error "Scraping exception: #{e.message}"
end

#external_links ⇒ `Object`

External links found on the page, as absolute URLs



61
62
63

# File 'lib/meta_inspector/scraper.rb', line 61

def external_links
  @external_links ||= links.select {|link| host_from_url(link) != host }
end

#feed ⇒ `Object`

Returns the parsed document meta rss link



78
79
80

# File 'lib/meta_inspector/scraper.rb', line 78

def feed
  @feed ||= (parsed_feed('rss') || parsed_feed('atom'))
end

#image ⇒ `Object`

Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/



73
74
75

# File 'lib/meta_inspector/scraper.rb', line 73

def image
  meta_og_image
end

#images ⇒ `Object`

Images found on the page, as absolute URLs



66
67
68

# File 'lib/meta_inspector/scraper.rb', line 66

def images
  @images ||= parsed_images.map{ |i| absolutify_url(i) }
end

#internal_links ⇒ `Object`

Internal links found on the page, as absolute URLs



56
57
58

# File 'lib/meta_inspector/scraper.rb', line 56

def internal_links
  @internal_links ||= links.select {|link| host_from_url(link) == host }
end

#links ⇒ `Object`

Links found on the page, as absolute URLs



51
52
53

# File 'lib/meta_inspector/scraper.rb', line 51

def links
  @links ||= parsed_links.map{ |l| absolutify_url(unrelativize_url(l)) }.compact
end

#ok? ⇒ `Boolean`

Returns true if there are no errors

Returns:

(Boolean)



130
131
132

# File 'lib/meta_inspector/scraper.rb', line 130

def ok?
  errors.empty?
end

#parsed_document ⇒ `Object`

Returns the whole parsed document

# File 'lib/meta_inspector/scraper.rb', line 107

def parsed_document
  @parsed_document ||= Nokogiri::HTML(document)
  rescue Exception => e
    add_fatal_error "Parsing exception: #{e.message}"
end

#title ⇒ `Object`

Returns the parsed document title, from the content of the <title> tag. This is not the same as the meta_title tag



40
41
42

# File 'lib/meta_inspector/scraper.rb', line 40

def title
  @title ||= parsed_document.css('title').inner_html.gsub(/\t|\n|\r/, '') rescue nil
end

#to_hash ⇒ `Object`

Returns all parsed data as a nested Hash

# File 'lib/meta_inspector/scraper.rb', line 90

def to_hash
  scrape_meta_data

  {
    'url' => url,
    'title' => title,
    'links' => links,
    'internal_links' => internal_links,
    'external_links' => external_links,
    'images' => images,
    'charset' => charset,
    'feed' => feed,
    'content_type' => content_type
  }.merge @data.to_hash
end

Class: MetaInspector::Scraper

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(url, options = {}) ⇒ Scraper

> timeout: defaults to 20 seconds

> html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false

> allow_redirections: when :safe, allows HTTP => HTTPS redirections. When :all, it also allows HTTPS => HTTP

> document: the html of the url as a string

> verbose: if the errors should be logged to the screen

Dynamic Method Handling

#method_missing(method_name) ⇒ Object (private)

Instance Attribute Details

#allow_redirections ⇒ Object (readonly)

#content_type ⇒ Object (readonly)

#errors ⇒ Object (readonly)

#host ⇒ Object (readonly)

#html_content_only ⇒ Object (readonly)

#root_url ⇒ Object (readonly)

#scheme ⇒ Object (readonly)

#timeout ⇒ Object (readonly)

#url ⇒ Object (readonly)

#verbose ⇒ Object (readonly)

Instance Method Details

#charset ⇒ Object

#description ⇒ Object

#document ⇒ Object

#external_links ⇒ Object

#feed ⇒ Object

#image ⇒ Object

#images ⇒ Object

#internal_links ⇒ Object

#links ⇒ Object

#ok? ⇒ Boolean

#parsed_document ⇒ Object

#title ⇒ Object

#to_hash ⇒ Object