Class: Spidr::Page
- Inherits:
-
Object
- Object
- Spidr::Page
- Includes:
- Enumerable
- Defined in:
- lib/spidr/page.rb,
lib/spidr/page/html.rb,
lib/spidr/page/cookies.rb,
lib/spidr/page/status_codes.rb,
lib/spidr/page/content_types.rb
Overview
Represents a requested page from a website.
Constant Summary collapse
- RESERVED_COOKIE_NAMES =
Reserved names used within Cookie strings
/^(?:Path|Expires|Domain|Secure|HTTPOnly)$/i
Instance Attribute Summary collapse
-
#headers ⇒ Object
readonly
Headers returned with the body.
-
#response ⇒ Object
readonly
HTTP Response.
-
#url ⇒ Object
readonly
URL of the page.
Instance Method Summary collapse
-
#at(*arguments) ⇒ Nokogiri::HTML::Node, ...
(also: #%)
Searches for the first occurrence an XPath or CSS Path expression.
-
#atom? ⇒ Boolean
Determines if the page is an Atom feed.
-
#bad_request? ⇒ Boolean
Determines if the response code is
400
. -
#body ⇒ String
(also: #to_s)
The body of the response.
-
#code ⇒ Integer
The response code from the page.
-
#content_charset ⇒ String?
The charset included in the Content-Type.
-
#content_type ⇒ String
The Content-Type of the page.
-
#content_types ⇒ Array<String>
The content types of the page.
-
#cookie ⇒ String
(also: #raw_cookie)
The raw Cookie String sent along with the page.
-
#cookie_params ⇒ Hash{String => String}
The Cookie key -> value pairs returned with the response.
-
#cookies ⇒ Array<String>
The Cookie values sent along with the page.
-
#css? ⇒ Boolean
Determines if the page is a CSS stylesheet.
-
#directory? ⇒ Boolean
Determines if the page is a Directory Listing.
-
#doc ⇒ Nokogiri::HTML::Document, ...
Returns a parsed document object for HTML, XML, RSS and Atom pages.
-
#each_link {|link| ... } ⇒ Enumerator
Enumerates over every link in the page.
-
#each_mailto {|link| ... } ⇒ Enumerator
Enumerates over every
mailto:
link in the page. -
#each_meta_redirect {|link| ... } ⇒ Enumerator
Enumerates over the meta-redirect links in the page.
-
#each_redirect {|link| ... } ⇒ Enumerator
Enumerates over every HTTP or meta-redirect link in the page.
-
#each_url {|url| ... } ⇒ Enumerator
(also: #each)
Enumerates over every absolute URL in the page.
-
#gif? ⇒ Boolean
Determines if the page is a GIF image.
-
#had_internal_server_error? ⇒ Boolean
Determines if the response code is
500
. -
#html? ⇒ Boolean
Determines if the page is HTML document.
-
#ico? ⇒ Boolean
(also: #icon?)
Determines if the page is a ICO image.
-
#initialize(url, response) ⇒ Page
constructor
Creates a new Page object.
-
#is_content_type?(type) ⇒ Boolean
Determines if any of the content-types of the page include a given type.
-
#is_forbidden? ⇒ Boolean
(also: #forbidden?)
Determines if the response code is
403
. -
#is_missing? ⇒ Boolean
(also: #missing?)
Determines if the response code is
404
. -
#is_ok? ⇒ Boolean
(also: #ok?)
Determines if the response code is
200
. -
#is_redirect? ⇒ Boolean
(also: #redirect?)
Determines if the response code is
300
,301
,302
,303
or307
. -
#is_timedout? ⇒ Boolean
(also: #timedout?)
Determines if the response code is
408
. -
#is_unauthorized? ⇒ Boolean
(also: #unauthorized?)
Determines if the response code is
401
. -
#javascript? ⇒ Boolean
Determines if the page is JavaScript.
-
#jpeg? ⇒ Boolean
Determines if the page is a JPEG image.
-
#json? ⇒ Boolean
Determines if the page is JSON.
-
#links ⇒ Array<String>
The links from within the page.
-
#mailtos ⇒ Array<String>
mailto:
links in the page. -
#meta_redirect ⇒ Array<String>
deprecated
Deprecated.
Deprecated in 0.3.0 and will be removed in 0.4.0. Use #meta_redirects instead.
-
#meta_redirect? ⇒ Boolean
Returns a boolean indicating whether or not page-level meta redirects are present in this page.
-
#meta_redirects ⇒ Array<String>
The meta-redirect links of the page.
-
#method_missing(name, *arguments, &block) ⇒ String
protected
Provides transparent access to the values in #headers.
-
#ms_word? ⇒ Boolean
Determines if the page is a MS Word document.
-
#pdf? ⇒ Boolean
Determines if the page is a PDF document.
-
#plain_text? ⇒ Boolean
(also: #txt?)
Determines if the page is plain-text.
-
#png? ⇒ Boolean
Determines if the page is a PNG image.
-
#redirects_to ⇒ Array<String>
URLs that this document redirects to.
-
#rss? ⇒ Boolean
Determines if the page is a RSS feed.
-
#search(*paths) ⇒ Array
(also: #/)
Searches the document for XPath or CSS Path paths.
-
#title ⇒ String
The title of the HTML page.
-
#to_absolute(link) ⇒ URI::HTTP
Normalizes and expands a given link into a proper URI.
-
#urls ⇒ Array<URI::HTTP>
Absolute URIs from within the page.
-
#xml? ⇒ Boolean
Determines if the page is XML document.
-
#xsl? ⇒ Boolean
Determines if the page is XML Stylesheet (XSL).
-
#zip? ⇒ Boolean
Determines if the page is a ZIP archive.
Constructor Details
#initialize(url, response) ⇒ Page
Creates a new Page object.
27 28 29 30 31 32 |
# File 'lib/spidr/page.rb', line 27 def initialize(url,response) @url = url @response = response @headers = response.to_hash @doc = nil end |
Dynamic Method Handling
This class handles dynamic methods through the method_missing method
#method_missing(name, *arguments, &block) ⇒ String (protected)
Provides transparent access to the values in #headers.
136 137 138 139 140 141 142 143 144 145 146 |
# File 'lib/spidr/page.rb', line 136 def method_missing(name,*arguments,&block) if (arguments.empty? && block.nil?) header_name = name.to_s.tr('_','-') if @response.key?(header_name) return @response[header_name] end end return super(name,*arguments,&block) end |
Instance Attribute Details
#headers ⇒ Object (readonly)
Headers returned with the body
16 17 18 |
# File 'lib/spidr/page.rb', line 16 def headers @headers end |
#response ⇒ Object (readonly)
HTTP Response
13 14 15 |
# File 'lib/spidr/page.rb', line 13 def response @response end |
#url ⇒ Object (readonly)
URL of the page
10 11 12 |
# File 'lib/spidr/page.rb', line 10 def url @url end |
Instance Method Details
#at(*arguments) ⇒ Nokogiri::HTML::Node, ... Also known as: %
Searches for the first occurrence an XPath or CSS Path expression.
110 111 112 113 114 |
# File 'lib/spidr/page.rb', line 110 def at(*arguments) if doc doc.at(*arguments) end end |
#atom? ⇒ Boolean
Determines if the page is an Atom feed.
193 194 195 |
# File 'lib/spidr/page/content_types.rb', line 193 def atom? is_content_type?('application/atom+xml') end |
#bad_request? ⇒ Boolean
Determines if the response code is 400
.
33 34 35 |
# File 'lib/spidr/page/status_codes.rb', line 33 def bad_request? code == 400 end |
#body ⇒ String Also known as: to_s
The body of the response.
40 41 42 |
# File 'lib/spidr/page.rb', line 40 def body (response.body || '') end |
#code ⇒ Integer
The response code from the page.
11 12 13 |
# File 'lib/spidr/page/status_codes.rb', line 11 def code @response.code.to_i end |
#content_charset ⇒ String?
The charset included in the Content-Type.
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/spidr/page/content_types.rb', line 35 def content_charset content_types.each do |value| if value.include?(';') value.split(';').each do |param| param.strip! if param.start_with?('charset=') return param.split('=',2).last end end end end return nil end |
#content_type ⇒ String
The Content-Type of the page.
11 12 13 |
# File 'lib/spidr/page/content_types.rb', line 11 def content_type @response['Content-Type'] || '' end |
#content_types ⇒ Array<String>
The content types of the page.
23 24 25 |
# File 'lib/spidr/page/content_types.rb', line 23 def content_types @response.get_fields('content-type') || [] end |
#cookie ⇒ String Also known as:
The raw Cookie String sent along with the page.
18 19 20 |
# File 'lib/spidr/page/cookies.rb', line 18 def @response['Set-Cookie'] || '' end |
#cookie_params ⇒ Hash{String => String}
The Cookie key -> value pairs returned with the response.
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 |
# File 'lib/spidr/page/cookies.rb', line 44 def params = {} .each do |value| value.split(';').each do |param| param.strip! name, value = param.split('=',2) unless name =~ RESERVED_COOKIE_NAMES params[name] = (value || '') end end end return params end |
#cookies ⇒ Array<String>
The Cookie values sent along with the page.
32 33 34 |
# File 'lib/spidr/page/cookies.rb', line 32 def (@response.get_fields('Set-Cookie') || []) end |
#css? ⇒ Boolean
Determines if the page is a CSS stylesheet.
172 173 174 |
# File 'lib/spidr/page/content_types.rb', line 172 def css? is_content_type?('text/css') end |
#directory? ⇒ Boolean
Determines if the page is a Directory Listing.
108 109 110 |
# File 'lib/spidr/page/content_types.rb', line 108 def directory? is_content_type?('text/directory') end |
#doc ⇒ Nokogiri::HTML::Document, ...
Returns a parsed document object for HTML, XML, RSS and Atom pages.
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
# File 'lib/spidr/page.rb', line 57 def doc unless body.empty? doc_class = if html? Nokogiri::HTML::Document elsif rss? || atom? || xml? || xsl? Nokogiri::XML::Document end if doc_class begin @doc ||= doc_class.parse(body, @url.to_s, content_charset) rescue end end end end |
#each_link {|link| ... } ⇒ Enumerator
Enumerates over every link in the page.
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
# File 'lib/spidr/page/html.rb', line 183 def each_link(&block) return enum_for(__method__) unless block_given? each_redirect(&block) if is_redirect? if (html? && doc) doc.search('//a[@href[string()]]').each do |a| yield a.get_attribute('href') end doc.search('//frame[@src[string()]]').each do |iframe| yield iframe.get_attribute('src') end doc.search('//iframe[@src[string()]]').each do |iframe| yield iframe.get_attribute('src') end doc.search('//link[@href[string()]]').each do |link| yield link.get_attribute('href') end doc.search('//script[@src[string()]]').each do |script| yield script.get_attribute('src') end end end |
#each_mailto {|link| ... } ⇒ Enumerator
Enumerates over every mailto:
link in the page.
147 148 149 150 151 152 153 154 155 |
# File 'lib/spidr/page/html.rb', line 147 def each_mailto return enum_for(__method__) unless block_given? if (html? && doc) doc.search('//a[starts-with(@href,"mailto:")]').each do |a| yield a.get_attribute('href')[7..-1] end end end |
#each_meta_redirect {|link| ... } ⇒ Enumerator
Enumerates over the meta-redirect links in the page.
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
# File 'lib/spidr/page/html.rb', line 38 def return enum_for(__method__) unless block_given? if (html? && doc) search('//meta[@http-equiv and @content]').each do |node| if node.get_attribute('http-equiv') =~ /refresh/i content = node.get_attribute('content') if (redirect = content.match(/url=(\S+)$/)) yield redirect[1] end end end end end |
#each_redirect {|link| ... } ⇒ Enumerator
Enumerates over every HTTP or meta-redirect link in the page.
108 109 110 111 112 113 114 115 116 117 118 119 120 |
# File 'lib/spidr/page/html.rb', line 108 def each_redirect(&block) return enum_for(__method__) unless block locations = @response.get_fields('Location') unless (locations.nil? || locations.empty?) # Location headers override any meta-refresh redirects in the HTML locations.each(&block) else # check page-level meta redirects if there isn't a location header (&block) end end |
#each_url {|url| ... } ⇒ Enumerator Also known as: each
Enumerates over every absolute URL in the page.
236 237 238 239 240 241 242 243 244 |
# File 'lib/spidr/page/html.rb', line 236 def each_url return enum_for(__method__) unless block_given? each_link do |link| if (url = to_absolute(link)) yield url end end end |
#gif? ⇒ Boolean
Determines if the page is a GIF image.
247 248 249 |
# File 'lib/spidr/page/content_types.rb', line 247 def gif? is_content_type?('image/gif') end |
#had_internal_server_error? ⇒ Boolean
Determines if the response code is 500
.
91 92 93 |
# File 'lib/spidr/page/status_codes.rb', line 91 def had_internal_server_error? code == 500 end |
#html? ⇒ Boolean
Determines if the page is HTML document.
118 119 120 |
# File 'lib/spidr/page/content_types.rb', line 118 def html? is_content_type?('text/html') end |
#ico? ⇒ Boolean Also known as: icon?
Determines if the page is a ICO image.
271 272 273 274 |
# File 'lib/spidr/page/content_types.rb', line 271 def ico? is_content_type?('image/x-icon') || is_content_type?('image/vnd.microsoft.icon') end |
#is_content_type?(type) ⇒ Boolean
Determines if any of the content-types of the page include a given type.
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
# File 'lib/spidr/page/content_types.rb', line 69 def is_content_type?(type) if type.include?('/') # otherwise only match the first param content_types.any? do |value| value = value.split(';',2).first value == type end else # otherwise only match the sub-type content_types.any? do |value| value = value.split(';',2).first value = value.split('/',2).last value == type end end end |
#is_forbidden? ⇒ Boolean Also known as: forbidden?
Determines if the response code is 403
.
55 56 57 |
# File 'lib/spidr/page/status_codes.rb', line 55 def is_forbidden? code == 403 end |
#is_missing? ⇒ Boolean Also known as: missing?
Determines if the response code is 404
.
67 68 69 |
# File 'lib/spidr/page/status_codes.rb', line 67 def is_missing? code == 404 end |
#is_ok? ⇒ Boolean Also known as: ok?
Determines if the response code is 200
.
21 22 23 |
# File 'lib/spidr/page/status_codes.rb', line 21 def is_ok? code == 200 end |
#is_redirect? ⇒ Boolean Also known as: redirect?
Determines if the response code is 300
, 301
, 302
, 303
or 307
. Also checks for "soft" redirects added at the page
level by a meta refresh tag.
103 104 105 106 107 108 109 110 111 112 |
# File 'lib/spidr/page/status_codes.rb', line 103 def is_redirect? case code when 300..303, 307 true when 200 else false end end |
#is_timedout? ⇒ Boolean Also known as: timedout?
Determines if the response code is 408
.
79 80 81 |
# File 'lib/spidr/page/status_codes.rb', line 79 def is_timedout? code == 408 end |
#is_unauthorized? ⇒ Boolean Also known as:
Determines if the response code is 401
.
43 44 45 |
# File 'lib/spidr/page/status_codes.rb', line 43 def code == 401 end |
#javascript? ⇒ Boolean
Determines if the page is JavaScript.
149 150 151 152 |
# File 'lib/spidr/page/content_types.rb', line 149 def javascript? is_content_type?('text/javascript') || \ is_content_type?('application/javascript') end |
#jpeg? ⇒ Boolean
Determines if the page is a JPEG image.
259 260 261 |
# File 'lib/spidr/page/content_types.rb', line 259 def jpeg? is_content_type?('image/jpeg') end |
#json? ⇒ Boolean
Determines if the page is JSON.
162 163 164 |
# File 'lib/spidr/page/content_types.rb', line 162 def json? is_content_type?('application/json') end |
#links ⇒ Array<String>
The links from within the page.
218 219 220 |
# File 'lib/spidr/page/html.rb', line 218 def links each_link.to_a end |
#mailtos ⇒ Array<String>
mailto:
links in the page.
165 166 167 |
# File 'lib/spidr/page/html.rb', line 165 def mailtos each_mailto.to_a end |
#meta_redirect ⇒ Array<String>
Deprecated in 0.3.0 and will be removed in 0.4.0. Use #meta_redirects instead.
The meta-redirect links of the page.
87 88 89 90 91 92 |
# File 'lib/spidr/page/html.rb', line 87 def warn 'DEPRECATION: Spidr::Page#meta_redirect will be removed in 0.3.0' warn 'DEPRECATION: Use Spidr::Page#meta_redirects instead' end |
#meta_redirect? ⇒ Boolean
Returns a boolean indicating whether or not page-level meta redirects are present in this page.
61 62 63 |
# File 'lib/spidr/page/html.rb', line 61 def !.first.nil? end |
#meta_redirects ⇒ Array<String>
The meta-redirect links of the page.
73 74 75 |
# File 'lib/spidr/page/html.rb', line 73 def .to_a end |
#ms_word? ⇒ Boolean
Determines if the page is a MS Word document.
203 204 205 |
# File 'lib/spidr/page/content_types.rb', line 203 def ms_word? is_content_type?('application/msword') end |
#pdf? ⇒ Boolean
Determines if the page is a PDF document.
213 214 215 |
# File 'lib/spidr/page/content_types.rb', line 213 def pdf? is_content_type?('application/pdf') end |
#plain_text? ⇒ Boolean Also known as: txt?
Determines if the page is plain-text.
94 95 96 |
# File 'lib/spidr/page/content_types.rb', line 94 def plain_text? is_content_type?('text/plain') end |
#png? ⇒ Boolean
Determines if the page is a PNG image.
235 236 237 |
# File 'lib/spidr/page/content_types.rb', line 235 def png? is_content_type?('image/png') end |
#redirects_to ⇒ Array<String>
URLs that this document redirects to.
129 130 131 |
# File 'lib/spidr/page/html.rb', line 129 def redirects_to each_redirect.to_a end |
#rss? ⇒ Boolean
Determines if the page is a RSS feed.
182 183 184 185 |
# File 'lib/spidr/page/content_types.rb', line 182 def rss? is_content_type?('application/rss+xml') || \ is_content_type?('application/rdf+xml') end |
#search(*paths) ⇒ Array Also known as: /
Searches the document for XPath or CSS Path paths.
90 91 92 93 94 95 96 |
# File 'lib/spidr/page.rb', line 90 def search(*paths) if doc doc.search(*paths) else [] end end |
#title ⇒ String
The title of the HTML page.
17 18 19 20 21 |
# File 'lib/spidr/page/html.rb', line 17 def title if (node = at('//title')) node.inner_text end end |
#to_absolute(link) ⇒ URI::HTTP
Normalizes and expands a given link into a proper URI.
267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 |
# File 'lib/spidr/page/html.rb', line 267 def to_absolute(link) link = link.to_s new_url = begin url.merge(link) rescue URI::Error return end if (!new_url.opaque) && (path = new_url.path) # ensure that paths begin with a leading '/' for URI::FTP if (new_url.scheme == 'ftp' && !path.start_with?('/')) path.insert(0,'/') end # make sure the path does not contain any .. or . directories, # since URI::Generic#merge cannot normalize paths such as # "/stuff/../" new_url.path = URI.(path) end return new_url end |
#urls ⇒ Array<URI::HTTP>
Absolute URIs from within the page.
254 255 256 |
# File 'lib/spidr/page/html.rb', line 254 def urls each_url.to_a end |
#xml? ⇒ Boolean
Determines if the page is XML document.
128 129 130 131 |
# File 'lib/spidr/page/content_types.rb', line 128 def xml? is_content_type?('text/xml') || \ is_content_type?('application/xml') end |
#xsl? ⇒ Boolean
Determines if the page is XML Stylesheet (XSL).
139 140 141 |
# File 'lib/spidr/page/content_types.rb', line 139 def xsl? is_content_type?('text/xsl') end |
#zip? ⇒ Boolean
Determines if the page is a ZIP archive.
223 224 225 |
# File 'lib/spidr/page/content_types.rb', line 223 def zip? is_content_type?('application/zip') end |