Class: MetaInspector::Scraper
- Inherits:
-
Object
- Object
- MetaInspector::Scraper
- Defined in:
- lib/meta_inspector/scraper.rb
Instance Attribute Summary collapse
-
#allow_redirections ⇒ Object
readonly
Returns the value of attribute allow_redirections.
-
#content_type ⇒ Object
readonly
Returns the content_type of the fetched document.
-
#errors ⇒ Object
readonly
Returns the value of attribute errors.
-
#host ⇒ Object
readonly
Returns the value of attribute host.
-
#html_content_only ⇒ Object
readonly
Returns the value of attribute html_content_only.
-
#root_url ⇒ Object
readonly
Returns the value of attribute root_url.
-
#scheme ⇒ Object
readonly
Returns the value of attribute scheme.
-
#timeout ⇒ Object
readonly
Returns the value of attribute timeout.
-
#url ⇒ Object
readonly
Returns the value of attribute url.
-
#verbose ⇒ Object
readonly
Returns the value of attribute verbose.
Instance Method Summary collapse
-
#charset ⇒ Object
Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />.
-
#description ⇒ Object
A description getter that first checks for a meta description and if not present will guess by looking at the first paragraph with more than 120 characters.
-
#document ⇒ Object
Returns the original, unparsed document.
-
#external_links ⇒ Object
External links found on the page, as absolute URLs.
-
#feed ⇒ Object
Returns the parsed document meta rss link.
-
#image ⇒ Object
Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/.
-
#images ⇒ Object
Images found on the page, as absolute URLs.
-
#initialize(url, options = {}) ⇒ Scraper
constructor
Initializes a new instance of MetaInspector, setting the URL to the one given Options: => timeout: defaults to 20 seconds => html_content_type_only: if an exception should be raised if request content-type is not text/html.
-
#internal_links ⇒ Object
Internal links found on the page, as absolute URLs.
-
#links ⇒ Object
Links found on the page, as absolute URLs.
-
#ok? ⇒ Boolean
Returns true if there are no errors.
-
#parsed_document ⇒ Object
Returns the whole parsed document.
-
#title ⇒ Object
Returns the parsed document title, from the content of the <title> tag.
-
#to_hash ⇒ Object
Returns all parsed data as a nested Hash.
Constructor Details
#initialize(url, options = {}) ⇒ Scraper
Initializes a new instance of MetaInspector, setting the URL to the one given Options:
> timeout: defaults to 20 seconds
> html_content_type_only: if an exception should be raised if request content-type is not text/html. Defaults to false
> allow_redirections: when :safe, allows HTTP => HTTPS redirections. When :all, it also allows HTTPS => HTTP
> document: the html of the url as a string
> verbose: if the errors should be logged to the screen
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
# File 'lib/meta_inspector/scraper.rb', line 22 def initialize(url, = {}) = defaults.merge() @url = with_default_scheme(encode_url(url)) @scheme = URI.parse(@url).scheme @host = URI.parse(@url).host @root_url = "#{@scheme}://#{@host}/" @timeout = [:timeout] @data = Hashie::Rash.new @errors = [] @html_content_only = [:html_content_only] @allow_redirections = [:allow_redirections] @verbose = [:verbose] @document = [:document] end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(method_name) ⇒ Object (private)
Scrapers for all meta_tags in the form of “meta_name” are automatically defined. This has been tested for meta name: keywords, description, robots, generator meta http-equiv: content-language, Content-Type
It will first try with meta name=“…” and if nothing found, with meta http-equiv=“…”, substituting “_” by “-” TODO: define respond_to? to return true on the meta_name methods
151 152 153 154 155 156 157 158 159 160 161 162 |
# File 'lib/meta_inspector/scraper.rb', line 151 def method_missing(method_name) if method_name.to_s =~ /^meta_(.*)/ key = $1 key = "og:#{$1}" if key =~ /^og_(.*)/ # special treatment for og: @data..name && (@data..name[key.downcase]) || (@data..property && @data..property[key.downcase]) else super end end |
Instance Attribute Details
#allow_redirections ⇒ Object (readonly)
Returns the value of attribute allow_redirections.
13 14 15 |
# File 'lib/meta_inspector/scraper.rb', line 13 def allow_redirections @allow_redirections end |
#content_type ⇒ Object (readonly)
Returns the content_type of the fetched document
125 126 127 |
# File 'lib/meta_inspector/scraper.rb', line 125 def content_type @content_type end |
#errors ⇒ Object (readonly)
Returns the value of attribute errors.
12 13 14 |
# File 'lib/meta_inspector/scraper.rb', line 12 def errors @errors end |
#host ⇒ Object (readonly)
Returns the value of attribute host.
12 13 14 |
# File 'lib/meta_inspector/scraper.rb', line 12 def host @host end |
#html_content_only ⇒ Object (readonly)
Returns the value of attribute html_content_only.
12 13 14 |
# File 'lib/meta_inspector/scraper.rb', line 12 def html_content_only @html_content_only end |
#root_url ⇒ Object (readonly)
Returns the value of attribute root_url.
12 13 14 |
# File 'lib/meta_inspector/scraper.rb', line 12 def root_url @root_url end |
#scheme ⇒ Object (readonly)
Returns the value of attribute scheme.
12 13 14 |
# File 'lib/meta_inspector/scraper.rb', line 12 def scheme @scheme end |
#timeout ⇒ Object (readonly)
Returns the value of attribute timeout.
12 13 14 |
# File 'lib/meta_inspector/scraper.rb', line 12 def timeout @timeout end |
#url ⇒ Object (readonly)
Returns the value of attribute url.
12 13 14 |
# File 'lib/meta_inspector/scraper.rb', line 12 def url @url end |
#verbose ⇒ Object (readonly)
Returns the value of attribute verbose.
13 14 15 |
# File 'lib/meta_inspector/scraper.rb', line 13 def verbose @verbose end |
Instance Method Details
#charset ⇒ Object
Returns the charset from the meta tags, looking for it in the following order: <meta charset=‘utf-8’ /> <meta http-equiv=“Content-Type” content=“text/html; charset=windows-1252” />
85 86 87 |
# File 'lib/meta_inspector/scraper.rb', line 85 def charset @charset ||= ( || charset_from_content_type) end |
#description ⇒ Object
A description getter that first checks for a meta description and if not present will guess by looking at the first paragraph with more than 120 characters
46 47 48 |
# File 'lib/meta_inspector/scraper.rb', line 46 def description .nil? ? secondary_description : end |
#document ⇒ Object
Returns the original, unparsed document
114 115 116 117 118 119 120 121 122 |
# File 'lib/meta_inspector/scraper.rb', line 114 def document @document ||= if html_content_only && content_type != "text/html" raise "The url provided contains #{content_type} content instead of text/html content" and nil else request.read end rescue Exception => e add_fatal_error "Scraping exception: #{e.}" end |
#external_links ⇒ Object
External links found on the page, as absolute URLs
61 62 63 |
# File 'lib/meta_inspector/scraper.rb', line 61 def external_links @external_links ||= links.select {|link| host_from_url(link) != host } end |
#feed ⇒ Object
Returns the parsed document meta rss link
78 79 80 |
# File 'lib/meta_inspector/scraper.rb', line 78 def feed @feed ||= (parsed_feed('rss') || parsed_feed('atom')) end |
#image ⇒ Object
Returns the parsed image from Facebook’s open graph property tags Most all major websites now define this property and is usually very relevant See doc at developers.facebook.com/docs/opengraph/
73 74 75 |
# File 'lib/meta_inspector/scraper.rb', line 73 def image end |
#images ⇒ Object
Images found on the page, as absolute URLs
66 67 68 |
# File 'lib/meta_inspector/scraper.rb', line 66 def images @images ||= parsed_images.map{ |i| absolutify_url(i) } end |
#internal_links ⇒ Object
Internal links found on the page, as absolute URLs
56 57 58 |
# File 'lib/meta_inspector/scraper.rb', line 56 def internal_links @internal_links ||= links.select {|link| host_from_url(link) == host } end |
#links ⇒ Object
Links found on the page, as absolute URLs
51 52 53 |
# File 'lib/meta_inspector/scraper.rb', line 51 def links @links ||= parsed_links.map{ |l| absolutify_url(unrelativize_url(l)) }.compact end |
#ok? ⇒ Boolean
Returns true if there are no errors
130 131 132 |
# File 'lib/meta_inspector/scraper.rb', line 130 def ok? errors.empty? end |
#parsed_document ⇒ Object
Returns the whole parsed document
107 108 109 110 111 |
# File 'lib/meta_inspector/scraper.rb', line 107 def parsed_document @parsed_document ||= Nokogiri::HTML(document) rescue Exception => e add_fatal_error "Parsing exception: #{e.}" end |
#title ⇒ Object
Returns the parsed document title, from the content of the <title> tag. This is not the same as the meta_title tag
40 41 42 |
# File 'lib/meta_inspector/scraper.rb', line 40 def title @title ||= parsed_document.css('title').inner_html.gsub(/\t|\n|\r/, '') rescue nil end |
#to_hash ⇒ Object
Returns all parsed data as a nested Hash
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# File 'lib/meta_inspector/scraper.rb', line 90 def to_hash { 'url' => url, 'title' => title, 'links' => links, 'internal_links' => internal_links, 'external_links' => external_links, 'images' => images, 'charset' => charset, 'feed' => feed, 'content_type' => content_type }.merge @data.to_hash end |