Class: Spidr::Page
- Inherits:
-
Object
- Object
- Spidr::Page
- Includes:
- Enumerable
- Defined in:
- lib/spidr/page.rb,
lib/spidr/page/html.rb,
lib/spidr/page/cookies.rb,
lib/spidr/page/status_codes.rb,
lib/spidr/page/content_types.rb
Overview
Represents a requested page from a website.
Constant Summary collapse
- RESERVED_COOKIE_NAMES =
Reserved names used within Cookie strings
/^(?:Path|Expires|Domain|Secure|HTTPOnly)$/i
Instance Attribute Summary collapse
-
#headers ⇒ Object
readonly
Headers returned with the body.
-
#response ⇒ Object
readonly
HTTP Response.
-
#url ⇒ Object
readonly
URL of the page.
Instance Method Summary collapse
-
#at(*arguments) ⇒ Nokogiri::HTML::Node, ...
(also: #%)
Searches for the first occurrence an XPath or CSS Path expression.
-
#atom? ⇒ Boolean
Determines if the page is an Atom feed.
-
#bad_request? ⇒ Boolean
Determines if the response code is
400
. -
#body ⇒ String
(also: #to_s)
The body of the response.
-
#code ⇒ Integer
The response code from the page.
-
#content_charset ⇒ String?
The charset included in the Content-Type.
-
#content_type ⇒ String
The Content-Type of the page.
-
#content_types ⇒ Array<String>
The content types of the page.
-
#cookie ⇒ String
(also: #raw_cookie)
The raw Cookie String sent along with the page.
-
#cookie_params ⇒ Hash{String => String}
The Cookie key -> value pairs returned with the response.
-
#cookies ⇒ Array<String>
The Cookie values sent along with the page.
-
#css? ⇒ Boolean
Determines if the page is a CSS stylesheet.
-
#directory? ⇒ Boolean
Determines if the page is a Directory Listing.
-
#doc ⇒ Nokogiri::HTML::Document, ...
Returns a parsed document object for HTML, XML, RSS and Atom pages.
-
#each_link {|link| ... } ⇒ Enumerator
Enumerates over every link in the page.
-
#each_mailto {|link| ... } ⇒ Enumerator
Enumerates over every
mailto:
link in the page. -
#each_meta_redirect {|link| ... } ⇒ Enumerator
Enumerates over the meta-redirect links in the page.
-
#each_redirect {|link| ... } ⇒ Enumerator
Enumerates over every HTTP or meta-redirect link in the page.
-
#each_url {|url| ... } ⇒ Enumerator
(also: #each)
Enumerates over every absolute URL in the page.
-
#gif? ⇒ Boolean
Determines if the page is a GIF image.
-
#had_internal_server_error? ⇒ Boolean
Determines if the response code is
500
. -
#html? ⇒ Boolean
Determines if the page is HTML document.
-
#ico? ⇒ Boolean
(also: #icon?)
Determines if the page is a ICO image.
-
#initialize(url, response) ⇒ Page
constructor
Creates a new Page object.
-
#is_content_type?(type) ⇒ Boolean
Determines if any of the content-types of the page include a given type.
-
#is_forbidden? ⇒ Boolean
(also: #forbidden?)
Determines if the response code is
403
. -
#is_missing? ⇒ Boolean
(also: #missing?)
Determines if the response code is
404
. -
#is_ok? ⇒ Boolean
(also: #ok?)
Determines if the response code is
200
. -
#is_redirect? ⇒ Boolean
(also: #redirect?)
Determines if the response code is
300
,301
,302
,303
or307
. -
#is_timedout? ⇒ Boolean
(also: #timedout?)
Determines if the response code is
408
. -
#is_unauthorized? ⇒ Boolean
(also: #unauthorized?)
Determines if the response code is
401
. -
#javascript? ⇒ Boolean
Determines if the page is JavaScript.
-
#jpeg? ⇒ Boolean
Determines if the page is a JPEG image.
-
#json? ⇒ Boolean
Determines if the page is JSON.
-
#links ⇒ Array<String>
The links from within the page.
-
#mailtos ⇒ Array<String>
mailto:
links in the page. -
#meta_redirect ⇒ Array<String>
deprecated
Deprecated.
Deprecated in 0.3.0 and will be removed in 0.4.0. Use #meta_redirects instead.
-
#meta_redirect? ⇒ Boolean
Returns a boolean indicating whether or not page-level meta redirects are present in this page.
-
#meta_redirects ⇒ Array<String>
The meta-redirect links of the page.
-
#method_missing(name, *arguments, &block) ⇒ String
protected
Provides transparent access to the values in #headers.
-
#ms_word? ⇒ Boolean
Determines if the page is a MS Word document.
-
#pdf? ⇒ Boolean
Determines if the page is a PDF document.
-
#plain_text? ⇒ Boolean
(also: #txt?)
Determines if the page is plain-text.
-
#png? ⇒ Boolean
Determines if the page is a PNG image.
-
#redirects_to ⇒ Array<String>
URLs that this document redirects to.
-
#rss? ⇒ Boolean
Determines if the page is a RSS feed.
-
#search(*paths) ⇒ Array
(also: #/)
Searches the document for XPath or CSS Path paths.
-
#title ⇒ String
The title of the HTML page.
-
#to_absolute(link) ⇒ URI::HTTP
Normalizes and expands a given link into a proper URI.
-
#urls ⇒ Array<URI::HTTP>
Absolute URIs from within the page.
-
#xml? ⇒ Boolean
Determines if the page is XML document.
-
#xsl? ⇒ Boolean
Determines if the page is XML Stylesheet (XSL).
-
#zip? ⇒ Boolean
Determines if the page is a ZIP archive.
Constructor Details
#initialize(url, response) ⇒ Page
Creates a new Page object.
25 26 27 28 29 30 |
# File 'lib/spidr/page.rb', line 25 def initialize(url,response) @url = url @response = response @headers = response.to_hash @doc = nil end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(name, *arguments, &block) ⇒ String (protected)
Provides transparent access to the values in #headers.
134 135 136 137 138 139 140 141 142 143 144 |
# File 'lib/spidr/page.rb', line 134 def method_missing(name,*arguments,&block) if (arguments.empty? && block.nil?) header_name = name.to_s.tr('_','-') if @response.key?(header_name) return @response[header_name] end end return super(name,*arguments,&block) end |
Instance Attribute Details
#headers ⇒ Object (readonly)
Headers returned with the body
14 15 16 |
# File 'lib/spidr/page.rb', line 14 def headers @headers end |
#response ⇒ Object (readonly)
HTTP Response
11 12 13 |
# File 'lib/spidr/page.rb', line 11 def response @response end |
#url ⇒ Object (readonly)
URL of the page
8 9 10 |
# File 'lib/spidr/page.rb', line 8 def url @url end |
Instance Method Details
#at(*arguments) ⇒ Nokogiri::HTML::Node, ... Also known as: %
Searches for the first occurrence an XPath or CSS Path expression.
108 109 110 111 112 |
# File 'lib/spidr/page.rb', line 108 def at(*arguments) if doc doc.at(*arguments) end end |
#atom? ⇒ Boolean
Determines if the page is an Atom feed.
191 192 193 |
# File 'lib/spidr/page/content_types.rb', line 191 def atom? is_content_type?('application/atom+xml') end |
#bad_request? ⇒ Boolean
Determines if the response code is 400
.
31 32 33 |
# File 'lib/spidr/page/status_codes.rb', line 31 def bad_request? code == 400 end |
#body ⇒ String Also known as: to_s
The body of the response.
38 39 40 |
# File 'lib/spidr/page.rb', line 38 def body (response.body || '') end |
#code ⇒ Integer
The response code from the page.
9 10 11 |
# File 'lib/spidr/page/status_codes.rb', line 9 def code @response.code.to_i end |
#content_charset ⇒ String?
The charset included in the Content-Type.
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# File 'lib/spidr/page/content_types.rb', line 33 def content_charset content_types.each do |value| if value.include?(';') value.split(';').each do |param| param.strip! if param.start_with?('charset=') return param.split('=',2).last end end end end return nil end |
#content_type ⇒ String
The Content-Type of the page.
9 10 11 |
# File 'lib/spidr/page/content_types.rb', line 9 def content_type @response['Content-Type'] || '' end |
#content_types ⇒ Array<String>
The content types of the page.
21 22 23 |
# File 'lib/spidr/page/content_types.rb', line 21 def content_types @response.get_fields('content-type') || [] end |
#cookie ⇒ String Also known as:
The raw Cookie String sent along with the page.
16 17 18 |
# File 'lib/spidr/page/cookies.rb', line 16 def @response['Set-Cookie'] || '' end |
#cookie_params ⇒ Hash{String => String}
The Cookie key -> value pairs returned with the response.
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
# File 'lib/spidr/page/cookies.rb', line 42 def params = {} .each do |value| value.split(';').each do |param| param.strip! name, value = param.split('=',2) unless name =~ RESERVED_COOKIE_NAMES params[name] = (value || '') end end end return params end |
#cookies ⇒ Array<String>
The Cookie values sent along with the page.
30 31 32 |
# File 'lib/spidr/page/cookies.rb', line 30 def (@response.get_fields('Set-Cookie') || []) end |
#css? ⇒ Boolean
Determines if the page is a CSS stylesheet.
170 171 172 |
# File 'lib/spidr/page/content_types.rb', line 170 def css? is_content_type?('text/css') end |
#directory? ⇒ Boolean
Determines if the page is a Directory Listing.
106 107 108 |
# File 'lib/spidr/page/content_types.rb', line 106 def directory? is_content_type?('text/directory') end |
#doc ⇒ Nokogiri::HTML::Document, ...
Returns a parsed document object for HTML, XML, RSS and Atom pages.
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
# File 'lib/spidr/page.rb', line 55 def doc unless body.empty? doc_class = if html? Nokogiri::HTML::Document elsif rss? || atom? || xml? || xsl? Nokogiri::XML::Document end if doc_class begin @doc ||= doc_class.parse(body, @url.to_s, content_charset) rescue end end end end |
#each_link {|link| ... } ⇒ Enumerator
Enumerates over every link in the page.
180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 |
# File 'lib/spidr/page/html.rb', line 180 def each_link(&block) return enum_for(__method__) unless block_given? each_redirect(&block) if is_redirect? if (html? && doc) doc.search('//a[@href[string()]]').each do |a| yield a.get_attribute('href') end doc.search('//frame[@src[string()]]').each do |iframe| yield iframe.get_attribute('src') end doc.search('//iframe[@src[string()]]').each do |iframe| yield iframe.get_attribute('src') end doc.search('//link[@href[string()]]').each do |link| yield link.get_attribute('href') end doc.search('//script[@src[string()]]').each do |script| yield script.get_attribute('src') end end end |
#each_mailto {|link| ... } ⇒ Enumerator
Enumerates over every mailto:
link in the page.
144 145 146 147 148 149 150 151 152 |
# File 'lib/spidr/page/html.rb', line 144 def each_mailto return enum_for(__method__) unless block_given? if (html? && doc) doc.search('//a[starts-with(@href,"mailto:")]').each do |a| yield a.get_attribute('href')[7..-1] end end end |
#each_meta_redirect {|link| ... } ⇒ Enumerator
Enumerates over the meta-redirect links in the page.
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/spidr/page/html.rb', line 35 def return enum_for(__method__) unless block_given? if (html? && doc) search('//meta[@http-equiv and @content]').each do |node| if node.get_attribute('http-equiv') =~ /refresh/i content = node.get_attribute('content') if (redirect = content.match(/url=(\S+)$/)) yield redirect[1] end end end end end |
#each_redirect {|link| ... } ⇒ Enumerator
Enumerates over every HTTP or meta-redirect link in the page.
105 106 107 108 109 110 111 112 113 114 115 116 117 |
# File 'lib/spidr/page/html.rb', line 105 def each_redirect(&block) return enum_for(__method__) unless block locations = @response.get_fields('Location') unless (locations.nil? || locations.empty?) # Location headers override any meta-refresh redirects in the HTML locations.each(&block) else # check page-level meta redirects if there isn't a location header (&block) end end |
#each_url {|url| ... } ⇒ Enumerator Also known as: each
Enumerates over every absolute URL in the page.
233 234 235 236 237 238 239 240 241 |
# File 'lib/spidr/page/html.rb', line 233 def each_url return enum_for(__method__) unless block_given? each_link do |link| if (url = to_absolute(link)) yield url end end end |
#gif? ⇒ Boolean
Determines if the page is a GIF image.
245 246 247 |
# File 'lib/spidr/page/content_types.rb', line 245 def gif? is_content_type?('image/gif') end |
#had_internal_server_error? ⇒ Boolean
Determines if the response code is 500
.
89 90 91 |
# File 'lib/spidr/page/status_codes.rb', line 89 def had_internal_server_error? code == 500 end |
#html? ⇒ Boolean
Determines if the page is HTML document.
116 117 118 |
# File 'lib/spidr/page/content_types.rb', line 116 def html? is_content_type?('text/html') end |
#ico? ⇒ Boolean Also known as: icon?
Determines if the page is a ICO image.
269 270 271 272 |
# File 'lib/spidr/page/content_types.rb', line 269 def ico? is_content_type?('image/x-icon') || is_content_type?('image/vnd.microsoft.icon') end |
#is_content_type?(type) ⇒ Boolean
Determines if any of the content-types of the page include a given type.
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/spidr/page/content_types.rb', line 67 def is_content_type?(type) if type.include?('/') # otherwise only match the first param content_types.any? do |value| value = value.split(';',2).first value == type end else # otherwise only match the sub-type content_types.any? do |value| value = value.split(';',2).first value = value.split('/',2).last value == type end end end |
#is_forbidden? ⇒ Boolean Also known as: forbidden?
Determines if the response code is 403
.
53 54 55 |
# File 'lib/spidr/page/status_codes.rb', line 53 def is_forbidden? code == 403 end |
#is_missing? ⇒ Boolean Also known as: missing?
Determines if the response code is 404
.
65 66 67 |
# File 'lib/spidr/page/status_codes.rb', line 65 def is_missing? code == 404 end |
#is_ok? ⇒ Boolean Also known as: ok?
Determines if the response code is 200
.
19 20 21 |
# File 'lib/spidr/page/status_codes.rb', line 19 def is_ok? code == 200 end |
#is_redirect? ⇒ Boolean Also known as: redirect?
Determines if the response code is 300
, 301
, 302
, 303
or 307
. Also checks for "soft" redirects added at the page
level by a meta refresh tag.
101 102 103 104 105 106 107 108 109 110 |
# File 'lib/spidr/page/status_codes.rb', line 101 def is_redirect? case code when 300..303, 307 true when 200 else false end end |
#is_timedout? ⇒ Boolean Also known as: timedout?
Determines if the response code is 408
.
77 78 79 |
# File 'lib/spidr/page/status_codes.rb', line 77 def is_timedout? code == 408 end |
#is_unauthorized? ⇒ Boolean Also known as:
Determines if the response code is 401
.
41 42 43 |
# File 'lib/spidr/page/status_codes.rb', line 41 def code == 401 end |
#javascript? ⇒ Boolean
Determines if the page is JavaScript.
147 148 149 150 |
# File 'lib/spidr/page/content_types.rb', line 147 def javascript? is_content_type?('text/javascript') || \ is_content_type?('application/javascript') end |
#jpeg? ⇒ Boolean
Determines if the page is a JPEG image.
257 258 259 |
# File 'lib/spidr/page/content_types.rb', line 257 def jpeg? is_content_type?('image/jpeg') end |
#json? ⇒ Boolean
Determines if the page is JSON.
160 161 162 |
# File 'lib/spidr/page/content_types.rb', line 160 def json? is_content_type?('application/json') end |
#links ⇒ Array<String>
The links from within the page.
215 216 217 |
# File 'lib/spidr/page/html.rb', line 215 def links each_link.to_a end |
#mailtos ⇒ Array<String>
mailto:
links in the page.
162 163 164 |
# File 'lib/spidr/page/html.rb', line 162 def mailtos each_mailto.to_a end |
#meta_redirect ⇒ Array<String>
Deprecated in 0.3.0 and will be removed in 0.4.0. Use #meta_redirects instead.
The meta-redirect links of the page.
84 85 86 87 88 89 |
# File 'lib/spidr/page/html.rb', line 84 def warn 'DEPRECATION: Spidr::Page#meta_redirect will be removed in 0.3.0' warn 'DEPRECATION: Use Spidr::Page#meta_redirects instead' end |
#meta_redirect? ⇒ Boolean
Returns a boolean indicating whether or not page-level meta redirects are present in this page.
58 59 60 |
# File 'lib/spidr/page/html.rb', line 58 def !.first.nil? end |
#meta_redirects ⇒ Array<String>
The meta-redirect links of the page.
70 71 72 |
# File 'lib/spidr/page/html.rb', line 70 def .to_a end |
#ms_word? ⇒ Boolean
Determines if the page is a MS Word document.
201 202 203 |
# File 'lib/spidr/page/content_types.rb', line 201 def ms_word? is_content_type?('application/msword') end |
#pdf? ⇒ Boolean
Determines if the page is a PDF document.
211 212 213 |
# File 'lib/spidr/page/content_types.rb', line 211 def pdf? is_content_type?('application/pdf') end |
#plain_text? ⇒ Boolean Also known as: txt?
Determines if the page is plain-text.
92 93 94 |
# File 'lib/spidr/page/content_types.rb', line 92 def plain_text? is_content_type?('text/plain') end |
#png? ⇒ Boolean
Determines if the page is a PNG image.
233 234 235 |
# File 'lib/spidr/page/content_types.rb', line 233 def png? is_content_type?('image/png') end |
#redirects_to ⇒ Array<String>
URLs that this document redirects to.
126 127 128 |
# File 'lib/spidr/page/html.rb', line 126 def redirects_to each_redirect.to_a end |
#rss? ⇒ Boolean
Determines if the page is a RSS feed.
180 181 182 183 |
# File 'lib/spidr/page/content_types.rb', line 180 def rss? is_content_type?('application/rss+xml') || \ is_content_type?('application/rdf+xml') end |
#search(*paths) ⇒ Array Also known as: /
Searches the document for XPath or CSS Path paths.
88 89 90 91 92 93 94 |
# File 'lib/spidr/page.rb', line 88 def search(*paths) if doc doc.search(*paths) else [] end end |
#title ⇒ String
The title of the HTML page.
14 15 16 17 18 |
# File 'lib/spidr/page/html.rb', line 14 def title if (node = at('//title')) node.inner_text end end |
#to_absolute(link) ⇒ URI::HTTP
Normalizes and expands a given link into a proper URI.
264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 |
# File 'lib/spidr/page/html.rb', line 264 def to_absolute(link) link = link.to_s new_url = begin url.merge(link) rescue Exception return end if (!new_url.opaque) && (path = new_url.path) # ensure that paths begin with a leading '/' for URI::FTP if (new_url.scheme == 'ftp' && !path.start_with?('/')) path.insert(0,'/') end # make sure the path does not contain any .. or . directories, # since URI::Generic#merge cannot normalize paths such as # "/stuff/../" new_url.path = URI.(path) end return new_url end |
#urls ⇒ Array<URI::HTTP>
Absolute URIs from within the page.
251 252 253 |
# File 'lib/spidr/page/html.rb', line 251 def urls each_url.to_a end |
#xml? ⇒ Boolean
Determines if the page is XML document.
126 127 128 129 |
# File 'lib/spidr/page/content_types.rb', line 126 def xml? is_content_type?('text/xml') || \ is_content_type?('application/xml') end |
#xsl? ⇒ Boolean
Determines if the page is XML Stylesheet (XSL).
137 138 139 |
# File 'lib/spidr/page/content_types.rb', line 137 def xsl? is_content_type?('text/xsl') end |
#zip? ⇒ Boolean
Determines if the page is a ZIP archive.
221 222 223 |
# File 'lib/spidr/page/content_types.rb', line 221 def zip? is_content_type?('application/zip') end |