Class: Spidr::Page

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/spidr/page.rb,
lib/spidr/page/html.rb,
lib/spidr/page/cookies.rb,
lib/spidr/page/status_codes.rb,
lib/spidr/page/content_types.rb

Overview

Represents a requested page from a website.

Constant Summary collapse

/^(?:Path|Expires|Domain|Secure|HTTPOnly)$/i

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(url, response) ⇒ Page

Creates a new Page object.

Parameters:

  • url (URI::HTTP)

    The URL of the page.

  • response (Net::HTTPResponse)

    The response from the request for the page.



27
28
29
30
31
32
# File 'lib/spidr/page.rb', line 27

def initialize(url,response)
  @url      = url
  @response = response
  @headers  = response.to_hash
  @doc      = nil
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *arguments, &block) ⇒ String (protected)

Provides transparent access to the values in #headers.

Parameters:

  • name (Symbol)

    The name of the missing method.

  • arguments (Array)

    Additional arguments for the missing method.

Returns:

  • (String)

    The missing method mapped to a header in #headers.

Raises:

  • (NoMethodError)

    The missing method did not map to a header in #headers.



136
137
138
139
140
141
142
143
144
145
146
# File 'lib/spidr/page.rb', line 136

def method_missing(name,*arguments,&block)
  if (arguments.empty? && block.nil?)
    header_name = name.to_s.tr('_','-')

    if @response.key?(header_name)
      return @response[header_name]
    end
  end

  return super(name,*arguments,&block)
end

Instance Attribute Details

#headersObject (readonly)

Headers returned with the body



16
17
18
# File 'lib/spidr/page.rb', line 16

def headers
  @headers
end

#responseObject (readonly)

HTTP Response



13
14
15
# File 'lib/spidr/page.rb', line 13

def response
  @response
end

#urlObject (readonly)

URL of the page



10
11
12
# File 'lib/spidr/page.rb', line 10

def url
  @url
end

Instance Method Details

#at(*arguments) ⇒ Nokogiri::HTML::Node, ... Also known as: %

Searches for the first occurrence an XPath or CSS Path expression.

Examples:

page.at('//title')

Returns:

  • (Nokogiri::HTML::Node, Nokogiri::XML::Node, nil)

    The first matched node. Returns nil if no nodes could be matched, or if the page is not a HTML or XML document.

See Also:



110
111
112
113
114
# File 'lib/spidr/page.rb', line 110

def at(*arguments)
  if doc
    doc.at(*arguments)
  end
end

#atom?Boolean

Determines if the page is an Atom feed.

Returns:

  • (Boolean)

    Specifies whether the page is an Atom feed.



193
194
195
# File 'lib/spidr/page/content_types.rb', line 193

def atom?
  is_content_type?('application/atom+xml')
end

#bad_request?Boolean

Determines if the response code is 400.

Returns:

  • (Boolean)

    Specifies whether the response code is 400.



33
34
35
# File 'lib/spidr/page/status_codes.rb', line 33

def bad_request?
  code == 400
end

#bodyString Also known as: to_s

The body of the response.

Returns:

  • (String)

    The body of the response.



40
41
42
# File 'lib/spidr/page.rb', line 40

def body
  (response.body || '')
end

#codeInteger

The response code from the page.

Returns:

  • (Integer)

    Response code from the page.



11
12
13
# File 'lib/spidr/page/status_codes.rb', line 11

def code
  @response.code.to_i
end

#content_charsetString?

The charset included in the Content-Type.

Returns:

  • (String, nil)

    The charset of the content.

Since:

  • 0.4.0



35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# File 'lib/spidr/page/content_types.rb', line 35

def content_charset
  content_types.each do |value|
    if value.include?(';')
      value.split(';').each do |param|
        param.strip!

        if param.start_with?('charset=')
          return param.split('=',2).last
        end
      end
    end
  end

  return nil
end

#content_typeString

The Content-Type of the page.

Returns:

  • (String)

    The Content-Type of the page.



11
12
13
# File 'lib/spidr/page/content_types.rb', line 11

def content_type
  @response['Content-Type'] || ''
end

#content_typesArray<String>

The content types of the page.

Returns:

  • (Array<String>)

    The values within the Content-Type header.

Since:

  • 0.2.2



23
24
25
# File 'lib/spidr/page/content_types.rb', line 23

def content_types
  @response.get_fields('content-type') || []
end

The raw Cookie String sent along with the page.

Returns:

  • (String)

    The raw Cookie from the response.

Since:

  • 0.2.7



18
19
20
# File 'lib/spidr/page/cookies.rb', line 18

def cookie
  @response['Set-Cookie'] || ''
end

The Cookie key -> value pairs returned with the response.

Returns:

  • (Hash{String => String})

    The cookie keys and values.

Since:

  • 0.2.2



44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
# File 'lib/spidr/page/cookies.rb', line 44

def cookie_params
  params = {}

  cookies.each do |value|
    value.split(';').each do |param|
      param.strip!

      name, value = param.split('=',2)

      unless name =~ RESERVED_COOKIE_NAMES
        params[name] = (value || '')
      end
    end
  end

  return params
end

#cookiesArray<String>

The Cookie values sent along with the page.

Returns:

  • (Array<String>)

    The Cookies from the response.

Since:

  • 0.2.2



32
33
34
# File 'lib/spidr/page/cookies.rb', line 32

def cookies
  (@response.get_fields('Set-Cookie') || [])
end

#css?Boolean

Determines if the page is a CSS stylesheet.

Returns:

  • (Boolean)

    Specifies whether the page is a CSS stylesheet.



172
173
174
# File 'lib/spidr/page/content_types.rb', line 172

def css?
  is_content_type?('text/css')
end

#directory?Boolean

Determines if the page is a Directory Listing.

Returns:

  • (Boolean)

    Specifies whether the page is a Directory Listing.

Since:

  • 0.3.0



108
109
110
# File 'lib/spidr/page/content_types.rb', line 108

def directory?
  is_content_type?('text/directory')
end

#docNokogiri::HTML::Document, ...

Returns a parsed document object for HTML, XML, RSS and Atom pages.

Returns:

  • (Nokogiri::HTML::Document, Nokogiri::XML::Document, nil)

    The document that represents HTML or XML pages. Returns nil if the page is neither HTML, XML, RSS, Atom or if the page could not be parsed properly.

See Also:



57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
# File 'lib/spidr/page.rb', line 57

def doc
  unless body.empty?
    doc_class = if html?
                  Nokogiri::HTML::Document
                elsif rss? || atom? || xml? || xsl?
                  Nokogiri::XML::Document
                end

    if doc_class
      begin
        @doc ||= doc_class.parse(body, @url.to_s, content_charset)
      rescue
      end
    end
  end
end

Enumerates over every link in the page.

Yields:

  • (link)

    The given block will be passed every non-empty link in the page.

Yield Parameters:

  • link (String)

    A link in the page.

Returns:

  • (Enumerator)

    If no block is given, an enumerator object will be returned.

Since:

  • 0.3.0



183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
# File 'lib/spidr/page/html.rb', line 183

def each_link(&block)
  return enum_for(__method__) unless block_given?

  each_redirect(&block) if is_redirect?

  if (html? && doc)
    doc.search('//a[@href[string()]]').each do |a|
      yield a.get_attribute('href')
    end

    doc.search('//frame[@src[string()]]').each do |iframe|
      yield iframe.get_attribute('src')
    end

    doc.search('//iframe[@src[string()]]').each do |iframe|
      yield iframe.get_attribute('src')
    end

    doc.search('//link[@href[string()]]').each do |link|
      yield link.get_attribute('href')
    end

    doc.search('//script[@src[string()]]').each do |script|
      yield script.get_attribute('src')
    end
  end
end

#each_mailto {|link| ... } ⇒ Enumerator

Enumerates over every mailto: link in the page.

Yields:

  • (link)

    The given block will be passed every mailto: link from the page.

Yield Parameters:

  • link (String)

    A mailto: link from the page.

Returns:

  • (Enumerator)

    If no block is given, an enumerator object will be returned.

Since:

  • 0.5.0



147
148
149
150
151
152
153
154
155
# File 'lib/spidr/page/html.rb', line 147

def each_mailto
  return enum_for(__method__) unless block_given?

  if (html? && doc)
    doc.search('//a[starts-with(@href,"mailto:")]').each do |a|
      yield a.get_attribute('href')[7..-1]
    end
  end
end

#each_meta_redirect {|link| ... } ⇒ Enumerator

Enumerates over the meta-redirect links in the page.

Yields:

  • (link)

    If a block is given, it will be passed every meta-redirect link from the page.

Yield Parameters:

  • link (String)

    A meta-redirect link from the page.

Returns:

  • (Enumerator)

    If no block is given, an enumerator object will be returned.

Since:

  • 0.3.0



38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# File 'lib/spidr/page/html.rb', line 38

def each_meta_redirect
  return enum_for(__method__) unless block_given?

  if (html? && doc)
    search('//meta[@http-equiv and @content]').each do |node|
      if node.get_attribute('http-equiv') =~ /refresh/i
        content = node.get_attribute('content')

        if (redirect = content.match(/url=(\S+)$/))
          yield redirect[1]
        end
      end
    end
  end
end

#each_redirect {|link| ... } ⇒ Enumerator

Enumerates over every HTTP or meta-redirect link in the page.

Yields:

  • (link)

    The given block will be passed every redirection link from the page.

Yield Parameters:

  • link (String)

    A HTTP or meta-redirect link from the page.

Returns:

  • (Enumerator)

    If no block is given, an enumerator object will be returned.

Since:

  • 0.3.0



108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'lib/spidr/page/html.rb', line 108

def each_redirect(&block)
  return enum_for(__method__) unless block

  locations = @response.get_fields('Location')

  unless (locations.nil? || locations.empty?)
    # Location headers override any meta-refresh redirects in the HTML
    locations.each(&block)
  else
    # check page-level meta redirects if there isn't a location header
    each_meta_redirect(&block)
  end
end

#each_url {|url| ... } ⇒ Enumerator Also known as: each

Enumerates over every absolute URL in the page.

Yields:

  • (url)

    The given block will be passed every URL in the page.

Yield Parameters:

  • url (URI::HTTP)

    An absolute URL in the page.

Returns:

  • (Enumerator)

    If no block is given, an enumerator object will be returned.

Since:

  • 0.3.0



236
237
238
239
240
241
242
243
244
# File 'lib/spidr/page/html.rb', line 236

def each_url
  return enum_for(__method__) unless block_given?

  each_link do |link|
    if (url = to_absolute(link))
      yield url
    end
  end
end

#gif?Boolean

Determines if the page is a GIF image.

Returns:

  • (Boolean)

    Specifies whether the page is a GIF image.

Since:

  • 0.7.0



247
248
249
# File 'lib/spidr/page/content_types.rb', line 247

def gif?
  is_content_type?('image/gif')
end

#had_internal_server_error?Boolean

Determines if the response code is 500.

Returns:

  • (Boolean)

    Specifies whether the response code is 500.



91
92
93
# File 'lib/spidr/page/status_codes.rb', line 91

def had_internal_server_error?
  code == 500
end

#html?Boolean

Determines if the page is HTML document.

Returns:

  • (Boolean)

    Specifies whether the page is HTML document.



118
119
120
# File 'lib/spidr/page/content_types.rb', line 118

def html?
  is_content_type?('text/html')
end

#ico?Boolean Also known as: icon?

Determines if the page is a ICO image.

Returns:

  • (Boolean)

    Specifies whether the page is a ICO image.

Since:

  • 0.7.0



271
272
273
274
# File 'lib/spidr/page/content_types.rb', line 271

def ico?
  is_content_type?('image/x-icon') ||
    is_content_type?('image/vnd.microsoft.icon')
end

#is_content_type?(type) ⇒ Boolean

Determines if any of the content-types of the page include a given type.

Examples:

Match the Content-Type

page.is_content_type?('application/json')

Match the sub-type of the Content-Type

page.is_content_type?('json')

Parameters:

  • type (String)

    The content-type to test for.

Returns:

  • (Boolean)

    Specifies whether the page includes the given content-type.

Since:

  • 0.4.0



69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# File 'lib/spidr/page/content_types.rb', line 69

def is_content_type?(type)
  if type.include?('/')
    # otherwise only match the first param
    content_types.any? do |value|
      value = value.split(';',2).first

      value == type
    end
  else
    # otherwise only match the sub-type
    content_types.any? do |value|
      value = value.split(';',2).first
      value = value.split('/',2).last

      value == type
    end
  end
end

#is_forbidden?Boolean Also known as: forbidden?

Determines if the response code is 403.

Returns:

  • (Boolean)

    Specifies whether the response code is 403.



55
56
57
# File 'lib/spidr/page/status_codes.rb', line 55

def is_forbidden?
  code == 403
end

#is_missing?Boolean Also known as: missing?

Determines if the response code is 404.

Returns:

  • (Boolean)

    Specifies whether the response code is 404.



67
68
69
# File 'lib/spidr/page/status_codes.rb', line 67

def is_missing?
  code == 404
end

#is_ok?Boolean Also known as: ok?

Determines if the response code is 200.

Returns:

  • (Boolean)

    Specifies whether the response code is 200.



21
22
23
# File 'lib/spidr/page/status_codes.rb', line 21

def is_ok?
  code == 200
end

#is_redirect?Boolean Also known as: redirect?

Determines if the response code is 300, 301, 302, 303 or 307. Also checks for "soft" redirects added at the page level by a meta refresh tag.

Returns:

  • (Boolean)

    Specifies whether the response code is a HTTP Redirect code.



103
104
105
106
107
108
109
110
111
112
# File 'lib/spidr/page/status_codes.rb', line 103

def is_redirect?
  case code
  when 300..303, 307
    true
  when 200
    meta_redirect?
  else
    false
  end
end

#is_timedout?Boolean Also known as: timedout?

Determines if the response code is 408.

Returns:

  • (Boolean)

    Specifies whether the response code is 408.



79
80
81
# File 'lib/spidr/page/status_codes.rb', line 79

def is_timedout?
  code == 408
end

#is_unauthorized?Boolean Also known as: unauthorized?

Determines if the response code is 401.

Returns:

  • (Boolean)

    Specifies whether the response code is 401.



43
44
45
# File 'lib/spidr/page/status_codes.rb', line 43

def is_unauthorized?
  code == 401
end

#javascript?Boolean

Determines if the page is JavaScript.

Returns:

  • (Boolean)

    Specifies whether the page is JavaScript.



149
150
151
152
# File 'lib/spidr/page/content_types.rb', line 149

def javascript?
  is_content_type?('text/javascript') || \
    is_content_type?('application/javascript')
end

#jpeg?Boolean

Determines if the page is a JPEG image.

Returns:

  • (Boolean)

    Specifies whether the page is a JPEG image.

Since:

  • 0.7.0



259
260
261
# File 'lib/spidr/page/content_types.rb', line 259

def jpeg?
  is_content_type?('image/jpeg')
end

#json?Boolean

Determines if the page is JSON.

Returns:

  • (Boolean)

    Specifies whether the page is JSON.

Since:

  • 0.3.0



162
163
164
# File 'lib/spidr/page/content_types.rb', line 162

def json?
  is_content_type?('application/json')
end

The links from within the page.

Returns:

  • (Array<String>)

    All links within the HTML page, frame/iframe source URLs and any links in the Location header.



218
219
220
# File 'lib/spidr/page/html.rb', line 218

def links
  each_link.to_a
end

#mailtosArray<String>

mailto: links in the page.

Returns:

  • (Array<String>)

    The mailto: links found within the page.

Since:

  • 0.5.0



165
166
167
# File 'lib/spidr/page/html.rb', line 165

def mailtos
  each_mailto.to_a
end

#meta_redirectArray<String>

Deprecated.

Deprecated in 0.3.0 and will be removed in 0.4.0. Use #meta_redirects instead.

The meta-redirect links of the page.

Returns:

  • (Array<String>)

    All meta-redirect links in the page.



87
88
89
90
91
92
# File 'lib/spidr/page/html.rb', line 87

def meta_redirect
  warn 'DEPRECATION: Spidr::Page#meta_redirect will be removed in 0.3.0'
  warn 'DEPRECATION: Use Spidr::Page#meta_redirects instead'

  meta_redirects
end

#meta_redirect?Boolean

Returns a boolean indicating whether or not page-level meta redirects are present in this page.

Returns:

  • (Boolean)

    Specifies whether the page includes page-level redirects.



61
62
63
# File 'lib/spidr/page/html.rb', line 61

def meta_redirect?
  !each_meta_redirect.first.nil?
end

#meta_redirectsArray<String>

The meta-redirect links of the page.

Returns:

  • (Array<String>)

    All meta-redirect links in the page.

Since:

  • 0.3.0



73
74
75
# File 'lib/spidr/page/html.rb', line 73

def meta_redirects
  each_meta_redirect.to_a
end

#ms_word?Boolean

Determines if the page is a MS Word document.

Returns:

  • (Boolean)

    Specifies whether the page is a MS Word document.



203
204
205
# File 'lib/spidr/page/content_types.rb', line 203

def ms_word?
  is_content_type?('application/msword')
end

#pdf?Boolean

Determines if the page is a PDF document.

Returns:

  • (Boolean)

    Specifies whether the page is a PDF document.



213
214
215
# File 'lib/spidr/page/content_types.rb', line 213

def pdf?
  is_content_type?('application/pdf')
end

#plain_text?Boolean Also known as: txt?

Determines if the page is plain-text.

Returns:

  • (Boolean)

    Specifies whether the page is plain-text.



94
95
96
# File 'lib/spidr/page/content_types.rb', line 94

def plain_text?
  is_content_type?('text/plain')
end

#png?Boolean

Determines if the page is a PNG image.

Returns:

  • (Boolean)

    Specifies whether the page is a PNG image.

Since:

  • 0.7.0



235
236
237
# File 'lib/spidr/page/content_types.rb', line 235

def png?
  is_content_type?('image/png')
end

#redirects_toArray<String>

URLs that this document redirects to.

Returns:

  • (Array<String>)

    The links that this page redirects to (usually found in a location header or by way of a page-level meta redirect).



129
130
131
# File 'lib/spidr/page/html.rb', line 129

def redirects_to
  each_redirect.to_a
end

#rss?Boolean

Determines if the page is a RSS feed.

Returns:

  • (Boolean)

    Specifies whether the page is a RSS feed.



182
183
184
185
# File 'lib/spidr/page/content_types.rb', line 182

def rss?
  is_content_type?('application/rss+xml') || \
    is_content_type?('application/rdf+xml')
end

#search(*paths) ⇒ Array Also known as: /

Searches the document for XPath or CSS Path paths.

Examples:

page.search('//a[@href]')

Parameters:

  • paths (Array<String>)

    CSS or XPath expressions to search the document with.

Returns:

  • (Array)

    The matched nodes from the document. Returns an empty Array if no nodes were matched, or if the page is not an HTML or XML document.

See Also:



90
91
92
93
94
95
96
# File 'lib/spidr/page.rb', line 90

def search(*paths)
  if doc
    doc.search(*paths)
  else
    []
  end
end

#titleString

The title of the HTML page.

Returns:

  • (String)

    The inner-text of the title element of the page.



17
18
19
20
21
# File 'lib/spidr/page/html.rb', line 17

def title
  if (node = at('//title'))
    node.inner_text
  end
end

#to_absolute(link) ⇒ URI::HTTP

Normalizes and expands a given link into a proper URI.

Parameters:

  • link (String)

    The link to normalize and expand.

Returns:

  • (URI::HTTP)

    The normalized URI.



267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
# File 'lib/spidr/page/html.rb', line 267

def to_absolute(link)
  link    = link.to_s
  new_url = begin
              url.merge(link)
            rescue URI::Error
              return
            end

  if (!new_url.opaque) && (path = new_url.path)
    # ensure that paths begin with a leading '/' for URI::FTP
    if (new_url.scheme == 'ftp' && !path.start_with?('/'))
      path.insert(0,'/')
    end

    # make sure the path does not contain any .. or . directories,
    # since URI::Generic#merge cannot normalize paths such as
    # "/stuff/../"
    new_url.path = URI.expand_path(path)
  end

  return new_url
end

#urlsArray<URI::HTTP>

Absolute URIs from within the page.

Returns:

  • (Array<URI::HTTP>)

    The links from within the page, converted to absolute URIs.



254
255
256
# File 'lib/spidr/page/html.rb', line 254

def urls
  each_url.to_a
end

#xml?Boolean

Determines if the page is XML document.

Returns:

  • (Boolean)

    Specifies whether the page is XML document.



128
129
130
131
# File 'lib/spidr/page/content_types.rb', line 128

def xml?
  is_content_type?('text/xml') || \
    is_content_type?('application/xml')
end

#xsl?Boolean

Determines if the page is XML Stylesheet (XSL).

Returns:

  • (Boolean)

    Specifies whether the page is XML Stylesheet (XSL).



139
140
141
# File 'lib/spidr/page/content_types.rb', line 139

def xsl?
  is_content_type?('text/xsl')
end

#zip?Boolean

Determines if the page is a ZIP archive.

Returns:

  • (Boolean)

    Specifies whether the page is a ZIP archive.



223
224
225
# File 'lib/spidr/page/content_types.rb', line 223

def zip?
  is_content_type?('application/zip')
end