Class: Spidr::Page

Inherits:

Object

Object
Spidr::Page

show all

Includes:: Enumerable

Defined in:: lib/spidr/page.rb,
lib/spidr/page/html.rb,
lib/spidr/page/cookies.rb,
lib/spidr/page/status_codes.rb,
lib/spidr/page/content_types.rb

Overview

Represents a requested page from a website.

Constant Summary collapse

RESERVED_COOKIE_NAMES = Reserved names used within Cookie strings

/^(?:Path|Expires|Domain|Secure|HTTPOnly)$/i

Instance Attribute Summary collapse

#headers ⇒ Object readonly
Headers returned with the body.
#response ⇒ Object readonly
HTTP Response.
#url ⇒ Object readonly
URL of the page.

Instance Method Summary collapse

#at(*arguments) ⇒ Nokogiri::HTML::Node, ... (also: #%)
Searches for the first occurrence an XPath or CSS Path expression.
#atom? ⇒ Boolean
Determines if the page is an Atom feed.
#bad_request? ⇒ Boolean
Determines if the response code is 400.
#body ⇒ String (also: #to_s)
The body of the response.
#code ⇒ Integer
The response code from the page.
#content_charset ⇒ String^?
The charset included in the Content-Type.
#content_type ⇒ String
The Content-Type of the page.
#content_types ⇒ Array<String>
The content types of the page.
#cookie ⇒ String (also: #raw_cookie)
The raw Cookie String sent along with the page.
#cookie_params ⇒ Hash{String => String}
The Cookie key -> value pairs returned with the response.
#cookies ⇒ Array<String>
The Cookie values sent along with the page.
#css? ⇒ Boolean
Determines if the page is a CSS stylesheet.
#directory? ⇒ Boolean
Determines if the page is a Directory Listing.
#doc ⇒ Nokogiri::HTML::Document, ...
Returns a parsed document object for HTML, XML, RSS and Atom pages.
#each_link {|link| ... } ⇒ Enumerator
Enumerates over every link in the page.
#each_mailto {|link| ... } ⇒ Enumerator
Enumerates over every mailto: link in the page.
#each_meta_redirect {|link| ... } ⇒ Enumerator
Enumerates over the meta-redirect links in the page.
#each_redirect {|link| ... } ⇒ Enumerator
Enumerates over every HTTP or meta-redirect link in the page.
#each_url {|url| ... } ⇒ Enumerator (also: #each)
Enumerates over every absolute URL in the page.
#gif? ⇒ Boolean
Determines if the page is a GIF image.
#had_internal_server_error? ⇒ Boolean
Determines if the response code is 500.
#html? ⇒ Boolean
Determines if the page is HTML document.
#ico? ⇒ Boolean (also: #icon?)
Determines if the page is a ICO image.
#initialize(url, response) ⇒ Page constructor
Creates a new Page object.
#is_content_type?(type) ⇒ Boolean
Determines if any of the content-types of the page include a given type.
#is_forbidden? ⇒ Boolean (also: #forbidden?)
Determines if the response code is 403.
#is_missing? ⇒ Boolean (also: #missing?)
Determines if the response code is 404.
#is_ok? ⇒ Boolean (also: #ok?)
Determines if the response code is 200.
#is_redirect? ⇒ Boolean (also: #redirect?)
Determines if the response code is 300, 301, 302, 303 or 307.
#is_timedout? ⇒ Boolean (also: #timedout?)
Determines if the response code is 408.
#is_unauthorized? ⇒ Boolean (also: #unauthorized?)
Determines if the response code is 401.
#javascript? ⇒ Boolean
Determines if the page is JavaScript.
#jpeg? ⇒ Boolean
Determines if the page is a JPEG image.
#json? ⇒ Boolean
Determines if the page is JSON.
#links ⇒ Array<String>
The links from within the page.
#mailtos ⇒ Array<String>
mailto: links in the page.
#meta_redirect ⇒ Array<String> deprecated Deprecated.
Deprecated in 0.3.0 and will be removed in 0.4.0. Use #meta_redirects instead.
#meta_redirect? ⇒ Boolean
Returns a boolean indicating whether or not page-level meta redirects are present in this page.
#meta_redirects ⇒ Array<String>
The meta-redirect links of the page.
#method_missing(name, *arguments, &block) ⇒ String protected
Provides transparent access to the values in #headers.
#ms_word? ⇒ Boolean
Determines if the page is a MS Word document.
#pdf? ⇒ Boolean
Determines if the page is a PDF document.
#plain_text? ⇒ Boolean (also: #txt?)
Determines if the page is plain-text.
#png? ⇒ Boolean
Determines if the page is a PNG image.
#redirects_to ⇒ Array<String>
URLs that this document redirects to.
#rss? ⇒ Boolean
Determines if the page is a RSS feed.
#search(*paths) ⇒ Array (also: #/)
Searches the document for XPath or CSS Path paths.
#title ⇒ String
The title of the HTML page.
#to_absolute(link) ⇒ URI::HTTP
Normalizes and expands a given link into a proper URI.
#urls ⇒ Array<URI::HTTP>
Absolute URIs from within the page.
#xml? ⇒ Boolean
Determines if the page is XML document.
#xsl? ⇒ Boolean
Determines if the page is XML Stylesheet (XSL).
#zip? ⇒ Boolean
Determines if the page is a ZIP archive.

Constructor Details

#initialize(url, response) ⇒ `Page`

Creates a new Page object.

Parameters:

url (URI::HTTP) —
The URL of the page.
response (Net::HTTPResponse) —
The response from the request for the page.

# File 'lib/spidr/page.rb', line 25

def initialize(url,response)
  @url      = url
  @response = response
  @headers  = response.to_hash
  @doc      = nil
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *arguments, &block) ⇒ `String` (protected)

Provides transparent access to the values in #headers.

Parameters:

name (Symbol) —
The name of the missing method.
arguments (Array) —
Additional arguments for the missing method.

Returns:

(String) —
The missing method mapped to a header in #headers.

Raises:

(NoMethodError) —
The missing method did not map to a header in #headers.

# File 'lib/spidr/page.rb', line 134

def method_missing(name,*arguments,&block)
  if (arguments.empty? && block.nil?)
    header_name = name.to_s.tr('_','-')

    if @response.key?(header_name)
      return @response[header_name]
    end
  end

  return super(name,*arguments,&block)
end

Instance Attribute Details

#headers ⇒ `Object` (readonly)

Headers returned with the body



14
15
16

# File 'lib/spidr/page.rb', line 14

def headers
  @headers
end

#response ⇒ `Object` (readonly)

HTTP Response



11
12
13

# File 'lib/spidr/page.rb', line 11

def response
  @response
end

#url ⇒ `Object` (readonly)

URL of the page



8
9
10

# File 'lib/spidr/page.rb', line 8

def url
  @url
end

Instance Method Details

#at(*arguments) ⇒ `Nokogiri::HTML::Node`, ... Also known as: %

Searches for the first occurrence an XPath or CSS Path expression.

Examples:

page.at('//title')

Returns:

(Nokogiri::HTML::Node, Nokogiri::XML::Node, nil) —
The first matched node. Returns nil if no nodes could be matched, or if the page is not a HTML or XML document.

#atom? ⇒ `Boolean`

Determines if the page is an Atom feed.

Returns:

(Boolean) —
Specifies whether the page is an Atom feed.



191
192
193

# File 'lib/spidr/page/content_types.rb', line 191

def atom?
  is_content_type?('application/atom+xml')
end

#bad_request? ⇒ `Boolean`

Determines if the response code is 400.

Returns:

(Boolean) —
Specifies whether the response code is 400.



31
32
33

# File 'lib/spidr/page/status_codes.rb', line 31

def bad_request?
  code == 400
end

#body ⇒ `String` Also known as: to_s

The body of the response.

Returns:

(String) —
The body of the response.



38
39
40

# File 'lib/spidr/page.rb', line 38

def body
  (response.body || '')
end

#code ⇒ `Integer`

The response code from the page.

Returns:

(Integer) —
Response code from the page.



9
10
11

# File 'lib/spidr/page/status_codes.rb', line 9

def code
  @response.code.to_i
end

#content_charset ⇒ `String`^?

The charset included in the Content-Type.

Returns:

(String, nil) —
The charset of the content.

Since:

0.4.0

# File 'lib/spidr/page/content_types.rb', line 33

def content_charset
  content_types.each do |value|
    if value.include?(';')
      value.split(';').each do |param|
        param.strip!

        if param.start_with?('charset=')
          return param.split('=',2).last
        end
      end
    end
  end

  return nil
end

#content_type ⇒ `String`

The Content-Type of the page.

Returns:

(String) —
The Content-Type of the page.



9
10
11

# File 'lib/spidr/page/content_types.rb', line 9

def content_type
  @response['Content-Type'] || ''
end

#content_types ⇒ `Array<String>`

The content types of the page.

Returns:

(Array<String>) —
The values within the Content-Type header.

Since:

0.2.2



21
22
23

# File 'lib/spidr/page/content_types.rb', line 21

def content_types
  @response.get_fields('content-type') || []
end

The raw Cookie String sent along with the page.

Returns:

(String) —
The raw Cookie from the response.

Since:

0.2.7



16
17
18

# File 'lib/spidr/page/cookies.rb', line 16

def cookie
  @response['Set-Cookie'] || ''
end

#cookie_params ⇒ `Hash{String => String}`

The Cookie key -> value pairs returned with the response.

Returns:

(Hash{String => String}) —
The cookie keys and values.

Since:

0.2.2

# File 'lib/spidr/page/cookies.rb', line 42

def cookie_params
  params = {}

  cookies.each do |value|
    value.split(';').each do |param|
      param.strip!

      name, value = param.split('=',2)

      unless name =~ RESERVED_COOKIE_NAMES
        params[name] = (value || '')
      end
    end
  end

  return params
end

#cookies ⇒ `Array<String>`

The Cookie values sent along with the page.

Returns:

(Array<String>) —
The Cookies from the response.

Since:

0.2.2



30
31
32

# File 'lib/spidr/page/cookies.rb', line 30

def cookies
  (@response.get_fields('Set-Cookie') || [])
end

#css? ⇒ `Boolean`

Determines if the page is a CSS stylesheet.

Returns:

(Boolean) —
Specifies whether the page is a CSS stylesheet.



170
171
172

# File 'lib/spidr/page/content_types.rb', line 170

def css?
  is_content_type?('text/css')
end

#directory? ⇒ `Boolean`

Determines if the page is a Directory Listing.

Returns:

(Boolean) —
Specifies whether the page is a Directory Listing.

Since:

0.3.0



106
107
108

# File 'lib/spidr/page/content_types.rb', line 106

def directory?
  is_content_type?('text/directory')
end

#doc ⇒ `Nokogiri::HTML::Document`, ...

Returns a parsed document object for HTML, XML, RSS and Atom pages.

Returns:

(Nokogiri::HTML::Document, Nokogiri::XML::Document, nil) —
The document that represents HTML or XML pages. Returns nil if the page is neither HTML, XML, RSS, Atom or if the page could not be parsed properly.

#each_link {|link| ... } ⇒ `Enumerator`

Enumerates over every link in the page.

Yields:

(link) —
The given block will be passed every non-empty link in the page.

Yield Parameters:

link (String) —
A link in the page.

Returns:

(Enumerator) —
If no block is given, an enumerator object will be returned.

Since:

0.3.0

# File 'lib/spidr/page/html.rb', line 180

def each_link(&block)
  return enum_for(__method__) unless block_given?

  each_redirect(&block) if is_redirect?

  if (html? && doc)
    doc.search('//a[@href[string()]]').each do |a|
      yield a.get_attribute('href')
    end

    doc.search('//frame[@src[string()]]').each do |iframe|
      yield iframe.get_attribute('src')
    end

    doc.search('//iframe[@src[string()]]').each do |iframe|
      yield iframe.get_attribute('src')
    end

    doc.search('//link[@href[string()]]').each do |link|
      yield link.get_attribute('href')
    end

    doc.search('//script[@src[string()]]').each do |script|
      yield script.get_attribute('src')
    end
  end
end

#each_mailto {|link| ... } ⇒ `Enumerator`

Enumerates over every mailto: link in the page.

Yields:

(link) —
The given block will be passed every mailto: link from the page.

Yield Parameters:

link (String) —
A mailto: link from the page.

Returns:

(Enumerator) —
If no block is given, an enumerator object will be returned.

Since:

0.5.0

# File 'lib/spidr/page/html.rb', line 144

def each_mailto
  return enum_for(__method__) unless block_given?

  if (html? && doc)
    doc.search('//a[starts-with(@href,"mailto:")]').each do |a|
      yield a.get_attribute('href')[7..-1]
    end
  end
end

#each_meta_redirect {|link| ... } ⇒ `Enumerator`

Enumerates over the meta-redirect links in the page.

Yields:

(link) —
If a block is given, it will be passed every meta-redirect link from the page.

Yield Parameters:

link (String) —
A meta-redirect link from the page.

Returns:

(Enumerator) —
If no block is given, an enumerator object will be returned.

Since:

0.3.0

# File 'lib/spidr/page/html.rb', line 35

def each_meta_redirect
  return enum_for(__method__) unless block_given?

  if (html? && doc)
    search('//meta[@http-equiv and @content]').each do |node|
      if node.get_attribute('http-equiv') =~ /refresh/i
        content = node.get_attribute('content')

        if (redirect = content.match(/url=(\S+)$/))
          yield redirect[1]
        end
      end
    end
  end
end

#each_redirect {|link| ... } ⇒ `Enumerator`

Enumerates over every HTTP or meta-redirect link in the page.

Yields:

(link) —
The given block will be passed every redirection link from the page.

Yield Parameters:

link (String) —
A HTTP or meta-redirect link from the page.

Returns:

(Enumerator) —
If no block is given, an enumerator object will be returned.

Since:

0.3.0

# File 'lib/spidr/page/html.rb', line 105

def each_redirect(&block)
  return enum_for(__method__) unless block

  locations = @response.get_fields('Location')

  unless (locations.nil? || locations.empty?)
    # Location headers override any meta-refresh redirects in the HTML
    locations.each(&block)
  else
    # check page-level meta redirects if there isn't a location header
    each_meta_redirect(&block)
  end
end

#each_url {|url| ... } ⇒ `Enumerator` Also known as: each

Enumerates over every absolute URL in the page.

Yields:

(url) —
The given block will be passed every URL in the page.

Yield Parameters:

url (URI::HTTP) —
An absolute URL in the page.

Returns:

(Enumerator) —
If no block is given, an enumerator object will be returned.

Since:

0.3.0

# File 'lib/spidr/page/html.rb', line 233

def each_url
  return enum_for(__method__) unless block_given?

  each_link do |link|
    if (url = to_absolute(link))
      yield url
    end
  end
end

#gif? ⇒ `Boolean`

Determines if the page is a GIF image.

Returns:

(Boolean) —
Specifies whether the page is a GIF image.

Since:

0.7.0



245
246
247

# File 'lib/spidr/page/content_types.rb', line 245

def gif?
  is_content_type?('image/gif')
end

#had_internal_server_error? ⇒ `Boolean`

Determines if the response code is 500.

Returns:

(Boolean) —
Specifies whether the response code is 500.



89
90
91

# File 'lib/spidr/page/status_codes.rb', line 89

def had_internal_server_error?
  code == 500
end

#html? ⇒ `Boolean`

Determines if the page is HTML document.

Returns:

(Boolean) —
Specifies whether the page is HTML document.



116
117
118

# File 'lib/spidr/page/content_types.rb', line 116

def html?
  is_content_type?('text/html')
end

#ico? ⇒ `Boolean` Also known as: icon?

Determines if the page is a ICO image.

Returns:

(Boolean) —
Specifies whether the page is a ICO image.

Since:

0.7.0

# File 'lib/spidr/page/content_types.rb', line 269

def ico?
  is_content_type?('image/x-icon') ||
    is_content_type?('image/vnd.microsoft.icon')
end

#is_content_type?(type) ⇒ `Boolean`

Determines if any of the content-types of the page include a given type.

Examples:

Match the Content-Type

page.is_content_type?('application/json')

Match the sub-type of the Content-Type

page.is_content_type?('json')

Parameters:

type (String) —
The content-type to test for.

Returns:

(Boolean) —
Specifies whether the page includes the given content-type.

Since:

0.4.0

# File 'lib/spidr/page/content_types.rb', line 67

def is_content_type?(type)
  if type.include?('/')
    # otherwise only match the first param
    content_types.any? do |value|
      value = value.split(';',2).first

      value == type
    end
  else
    # otherwise only match the sub-type
    content_types.any? do |value|
      value = value.split(';',2).first
      value = value.split('/',2).last

      value == type
    end
  end
end

#is_forbidden? ⇒ `Boolean` Also known as: forbidden?

Determines if the response code is 403.

Returns:

(Boolean) —
Specifies whether the response code is 403.



53
54
55

# File 'lib/spidr/page/status_codes.rb', line 53

def is_forbidden?
  code == 403
end

#is_missing? ⇒ `Boolean` Also known as: missing?

Determines if the response code is 404.

Returns:

(Boolean) —
Specifies whether the response code is 404.



65
66
67

# File 'lib/spidr/page/status_codes.rb', line 65

def is_missing?
  code == 404
end

#is_ok? ⇒ `Boolean` Also known as: ok?

Determines if the response code is 200.

Returns:

(Boolean) —
Specifies whether the response code is 200.



19
20
21

# File 'lib/spidr/page/status_codes.rb', line 19

def is_ok?
  code == 200
end

#is_redirect? ⇒ `Boolean` Also known as: redirect?

Determines if the response code is 300, 301, 302, 303 or 307. Also checks for "soft" redirects added at the page level by a meta refresh tag.

Returns:

(Boolean) —
Specifies whether the response code is a HTTP Redirect code.

# File 'lib/spidr/page/status_codes.rb', line 101

def is_redirect?
  case code
  when 300..303, 307
    true
  when 200
    meta_redirect?
  else
    false
  end
end

#is_timedout? ⇒ `Boolean` Also known as: timedout?

Determines if the response code is 408.

Returns:

(Boolean) —
Specifies whether the response code is 408.



77
78
79

# File 'lib/spidr/page/status_codes.rb', line 77

def is_timedout?
  code == 408
end

#is_unauthorized? ⇒ `Boolean` Also known as: unauthorized?

Determines if the response code is 401.

Returns:

(Boolean) —
Specifies whether the response code is 401.



41
42
43

# File 'lib/spidr/page/status_codes.rb', line 41

def is_unauthorized?
  code == 401
end

#javascript? ⇒ `Boolean`

Determines if the page is JavaScript.

Returns:

(Boolean) —
Specifies whether the page is JavaScript.

# File 'lib/spidr/page/content_types.rb', line 147

def javascript?
  is_content_type?('text/javascript') || \
    is_content_type?('application/javascript')
end

#jpeg? ⇒ `Boolean`

Determines if the page is a JPEG image.

Returns:

(Boolean) —
Specifies whether the page is a JPEG image.

Since:

0.7.0



257
258
259

# File 'lib/spidr/page/content_types.rb', line 257

def jpeg?
  is_content_type?('image/jpeg')
end

#json? ⇒ `Boolean`

Determines if the page is JSON.

Returns:

(Boolean) —
Specifies whether the page is JSON.

Since:

0.3.0



160
161
162

# File 'lib/spidr/page/content_types.rb', line 160

def json?
  is_content_type?('application/json')
end

#links ⇒ `Array<String>`

The links from within the page.

Returns:

(Array<String>) —
All links within the HTML page, frame/iframe source URLs and any links in the Location header.



215
216
217

# File 'lib/spidr/page/html.rb', line 215

def links
  each_link.to_a
end

#mailtos ⇒ `Array<String>`

mailto: links in the page.

Returns:

(Array<String>) —
The mailto: links found within the page.

Since:

0.5.0



162
163
164

# File 'lib/spidr/page/html.rb', line 162

def mailtos
  each_mailto.to_a
end

#meta_redirect ⇒ `Array<String>`

Deprecated.

Deprecated in 0.3.0 and will be removed in 0.4.0. Use #meta_redirects instead.

The meta-redirect links of the page.

Returns:

(Array<String>) —
All meta-redirect links in the page.

# File 'lib/spidr/page/html.rb', line 84

def meta_redirect
  warn 'DEPRECATION: Spidr::Page#meta_redirect will be removed in 0.3.0'
  warn 'DEPRECATION: Use Spidr::Page#meta_redirects instead'

  meta_redirects
end

#meta_redirect? ⇒ `Boolean`

Returns a boolean indicating whether or not page-level meta redirects are present in this page.

Returns:

(Boolean) —
Specifies whether the page includes page-level redirects.



58
59
60

# File 'lib/spidr/page/html.rb', line 58

def meta_redirect?
  !each_meta_redirect.first.nil?
end

#meta_redirects ⇒ `Array<String>`

The meta-redirect links of the page.

Returns:

(Array<String>) —
All meta-redirect links in the page.

Since:

0.3.0



70
71
72

# File 'lib/spidr/page/html.rb', line 70

def meta_redirects
  each_meta_redirect.to_a
end

#ms_word? ⇒ `Boolean`

Determines if the page is a MS Word document.

Returns:

(Boolean) —
Specifies whether the page is a MS Word document.



201
202
203

# File 'lib/spidr/page/content_types.rb', line 201

def ms_word?
  is_content_type?('application/msword')
end

#pdf? ⇒ `Boolean`

Determines if the page is a PDF document.

Returns:

(Boolean) —
Specifies whether the page is a PDF document.



211
212
213

# File 'lib/spidr/page/content_types.rb', line 211

def pdf?
  is_content_type?('application/pdf')
end

#plain_text? ⇒ `Boolean` Also known as: txt?

Determines if the page is plain-text.

Returns:

(Boolean) —
Specifies whether the page is plain-text.



92
93
94

# File 'lib/spidr/page/content_types.rb', line 92

def plain_text?
  is_content_type?('text/plain')
end

#png? ⇒ `Boolean`

Determines if the page is a PNG image.

Returns:

(Boolean) —
Specifies whether the page is a PNG image.

Since:

0.7.0



233
234
235

# File 'lib/spidr/page/content_types.rb', line 233

def png?
  is_content_type?('image/png')
end

#redirects_to ⇒ `Array<String>`

URLs that this document redirects to.

Returns:

(Array<String>) —
The links that this page redirects to (usually found in a location header or by way of a page-level meta redirect).



126
127
128

# File 'lib/spidr/page/html.rb', line 126

def redirects_to
  each_redirect.to_a
end

#rss? ⇒ `Boolean`

Determines if the page is a RSS feed.

Returns:

(Boolean) —
Specifies whether the page is a RSS feed.

# File 'lib/spidr/page/content_types.rb', line 180

def rss?
  is_content_type?('application/rss+xml') || \
    is_content_type?('application/rdf+xml')
end

#search(*paths) ⇒ `Array` Also known as: /

Searches the document for XPath or CSS Path paths.

Examples:

page.search('//a[@href]')

Parameters:

paths (Array<String>) —
CSS or XPath expressions to search the document with.

Returns:

(Array) —
The matched nodes from the document. Returns an empty Array if no nodes were matched, or if the page is not an HTML or XML document.

#title ⇒ `String`

The title of the HTML page.

Returns:

(String) —
The inner-text of the title element of the page.

# File 'lib/spidr/page/html.rb', line 14

def title
  if (node = at('//title'))
    node.inner_text
  end
end

#to_absolute(link) ⇒ `URI::HTTP`

Normalizes and expands a given link into a proper URI.

Parameters:

link (String) —
The link to normalize and expand.

Returns:

(URI::HTTP) —
The normalized URI.

# File 'lib/spidr/page/html.rb', line 264

def to_absolute(link)
  link    = link.to_s
  new_url = begin
              url.merge(link)
            rescue Exception
              return
            end

  if (!new_url.opaque) && (path = new_url.path)
    # ensure that paths begin with a leading '/' for URI::FTP
    if (new_url.scheme == 'ftp' && !path.start_with?('/'))
      path.insert(0,'/')
    end

    # make sure the path does not contain any .. or . directories,
    # since URI::Generic#merge cannot normalize paths such as
    # "/stuff/../"
    new_url.path = URI.expand_path(path)
  end

  return new_url
end

#urls ⇒ `Array<URI::HTTP>`

Absolute URIs from within the page.

Returns:

(Array<URI::HTTP>) —
The links from within the page, converted to absolute URIs.



251
252
253

# File 'lib/spidr/page/html.rb', line 251

def urls
  each_url.to_a
end

#xml? ⇒ `Boolean`

Determines if the page is XML document.

Returns:

(Boolean) —
Specifies whether the page is XML document.

# File 'lib/spidr/page/content_types.rb', line 126

def xml?
  is_content_type?('text/xml') || \
    is_content_type?('application/xml')
end

#xsl? ⇒ `Boolean`

Determines if the page is XML Stylesheet (XSL).

Returns:

(Boolean) —
Specifies whether the page is XML Stylesheet (XSL).



137
138
139

# File 'lib/spidr/page/content_types.rb', line 137

def xsl?
  is_content_type?('text/xsl')
end

#zip? ⇒ `Boolean`

Determines if the page is a ZIP archive.

Returns:

(Boolean) —
Specifies whether the page is a ZIP archive.



221
222
223

# File 'lib/spidr/page/content_types.rb', line 221

def zip?
  is_content_type?('application/zip')
end

Class: Spidr::Page

Overview

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(url, response) ⇒ Page

Dynamic Method Handling

#method_missing(name, *arguments, &block) ⇒ String (protected)

Instance Attribute Details

#headers ⇒ Object (readonly)

#response ⇒ Object (readonly)

#url ⇒ Object (readonly)

Instance Method Details

#at(*arguments) ⇒ Nokogiri::HTML::Node, ... Also known as: %

Examples:

#atom? ⇒ Boolean

#bad_request? ⇒ Boolean

#body ⇒ String Also known as: to_s

#code ⇒ Integer

#content_charset ⇒ String?

#content_type ⇒ String

#content_types ⇒ Array<String>

#cookie ⇒ String Also known as: raw_cookie

#cookie_params ⇒ Hash{String => String}

#cookies ⇒ Array<String>

#css? ⇒ Boolean

#directory? ⇒ Boolean

#doc ⇒ Nokogiri::HTML::Document, ...

#each_link {|link| ... } ⇒ Enumerator

#each_mailto {|link| ... } ⇒ Enumerator

#each_meta_redirect {|link| ... } ⇒ Enumerator

#each_redirect {|link| ... } ⇒ Enumerator

#each_url {|url| ... } ⇒ Enumerator Also known as: each

#gif? ⇒ Boolean

#had_internal_server_error? ⇒ Boolean

#html? ⇒ Boolean

#ico? ⇒ Boolean Also known as: icon?

#is_content_type?(type) ⇒ Boolean

Examples:

Match the Content-Type

Match the sub-type of the Content-Type

#is_forbidden? ⇒ Boolean Also known as: forbidden?

#is_missing? ⇒ Boolean Also known as: missing?

#is_ok? ⇒ Boolean Also known as: ok?

#is_redirect? ⇒ Boolean Also known as: redirect?

#is_timedout? ⇒ Boolean Also known as: timedout?

#is_unauthorized? ⇒ Boolean Also known as: unauthorized?

#javascript? ⇒ Boolean

#jpeg? ⇒ Boolean

#json? ⇒ Boolean

#links ⇒ Array<String>

#mailtos ⇒ Array<String>

#meta_redirect ⇒ Array<String>

#meta_redirect? ⇒ Boolean

#meta_redirects ⇒ Array<String>

#ms_word? ⇒ Boolean

#pdf? ⇒ Boolean

#plain_text? ⇒ Boolean Also known as: txt?

#png? ⇒ Boolean

#redirects_to ⇒ Array<String>

#rss? ⇒ Boolean

#search(*paths) ⇒ Array Also known as: /

Examples:

#title ⇒ String

#to_absolute(link) ⇒ URI::HTTP

#urls ⇒ Array<URI::HTTP>

#xml? ⇒ Boolean

#xsl? ⇒ Boolean

#zip? ⇒ Boolean

#initialize(url, response) ⇒ `Page`

#method_missing(name, *arguments, &block) ⇒ `String` (protected)

#headers ⇒ `Object` (readonly)

#response ⇒ `Object` (readonly)

#url ⇒ `Object` (readonly)

#at(*arguments) ⇒ `Nokogiri::HTML::Node`, ... Also known as: %

#atom? ⇒ `Boolean`

#bad_request? ⇒ `Boolean`

#body ⇒ `String` Also known as: to_s

#code ⇒ `Integer`

#content_charset ⇒ `String`^?

#content_type ⇒ `String`

#content_types ⇒ `Array<String>`

#cookie ⇒ `String` Also known as: raw_cookie

#cookie_params ⇒ `Hash{String => String}`

#cookies ⇒ `Array<String>`

#css? ⇒ `Boolean`

#directory? ⇒ `Boolean`

#doc ⇒ `Nokogiri::HTML::Document`, ...

#each_link {|link| ... } ⇒ `Enumerator`

#each_mailto {|link| ... } ⇒ `Enumerator`

#each_meta_redirect {|link| ... } ⇒ `Enumerator`

#each_redirect {|link| ... } ⇒ `Enumerator`

#each_url {|url| ... } ⇒ `Enumerator` Also known as: each

#gif? ⇒ `Boolean`

#had_internal_server_error? ⇒ `Boolean`

#html? ⇒ `Boolean`

#ico? ⇒ `Boolean` Also known as: icon?

#is_content_type?(type) ⇒ `Boolean`

#is_forbidden? ⇒ `Boolean` Also known as: forbidden?

#is_missing? ⇒ `Boolean` Also known as: missing?

#is_ok? ⇒ `Boolean` Also known as: ok?

#is_redirect? ⇒ `Boolean` Also known as: redirect?

#is_timedout? ⇒ `Boolean` Also known as: timedout?

#is_unauthorized? ⇒ `Boolean` Also known as: unauthorized?

#javascript? ⇒ `Boolean`

#jpeg? ⇒ `Boolean`

#json? ⇒ `Boolean`

#links ⇒ `Array<String>`

#mailtos ⇒ `Array<String>`

#meta_redirect ⇒ `Array<String>`

#meta_redirect? ⇒ `Boolean`

#meta_redirects ⇒ `Array<String>`

#ms_word? ⇒ `Boolean`

#pdf? ⇒ `Boolean`

#plain_text? ⇒ `Boolean` Also known as: txt?

#png? ⇒ `Boolean`

#redirects_to ⇒ `Array<String>`

#rss? ⇒ `Boolean`

#search(*paths) ⇒ `Array` Also known as: /

#title ⇒ `String`

#to_absolute(link) ⇒ `URI::HTTP`

#urls ⇒ `Array<URI::HTTP>`

#xml? ⇒ `Boolean`

#xsl? ⇒ `Boolean`

#zip? ⇒ `Boolean`