File: ChangeLog — Documentation for spidr (0.7.0)

0.7.0 / 2022-12-31

Added Spidr.domain and Spidr::Agent.domain.
Added Spidr::Page#gif?.
Added Spidr::Page#jpeg?.
Added Spidr::Page#icon? and Spidr::Page#ico?.
Added Spidr::Page#png?.
Spidr::Settings::Proxy#proxy= and Spidr::Agent#proxy= can now accept a String or a URI::HTTP object.

0.6.1 / 2019-10-24

Check for the opaque component of URIs before attempting to set the path component (@kyaroch). This fixes URI::InvalidURIError: path conflicts with opaque exceptions.
Fix @robots instance variable warning (@spk).

0.6.0 / 2016-08-04

Added Spidr::Proxy.
Added more options to Spidr::Agent#initialize:
- :default_headers: specifies the default headers to set in all requests (@maccman).
- :limit: specify the maximum number of links to visit.
- :open_timeout, :read_timeout, :ssl_timeout, :continue_timeout, and :keep_alive_timeout: sets Net::HTTP timeouts.
Allow Spidr.proxy= to accept nil.
Use Net::HTTPResponse#get_fields in Spidr::Page to correctly return multiple values for repeated headers.
Fixed a bug in Spidr::Page#method_missing where method names were not being correctly converted to header names.
Fixed a bug in Spidr::Page#cookie_params where Set-Cookie flags were not being filtered out.
Rewrote the specs to use webmock and increased spec coverage.

0.5.0 / 2016-01-03

Added support for respecting robots.txt files.

Spidr.site('http://reddit.com/', robots: true)
Added Spidr.robots= and Spidr.robots?.
Added Spidr::Page#each_mailto and Spidr::Page#mailtos.
Fixed a bug in Spidr::Agent.host that limited spidering to only http:// URIs.
Rescue Zlib::Error to catch Zlib::DataError and Zlib::BufError exceptions caused by web servers that use incompatible gzip compression.
Fixed a bug in URI.expand_path where /../foo was being expanded to foo instead of /foo.

0.4.1 / 2011-12-08

Catch OpenSSL::SSL::SSLError exceptions when initiated HTTPS Sessions.

0.4.0 / 2011-08-07

Added Spidr::Headers#content_charset.
Pass the Page url and content_charset to Nokogiri in Spidr::Body#doc. This ensures that Nokogiri will preserve the body encoding.
Made Spidr::Headers#is_content_type? public.
Allow Spidr::Headers#is_content_type? to match the full Content-Type or the sub-type.

0.3.2 / 2011-06-20

Added separate intitialize methods for Spidr::Actions, Spidr::Events, Spidr::Filters and Spidr::Sanitizers.
Aliased Spidr::Events#urls_like to Spidr::Events#every_url_like.
Reduce usage of self.included and module_eval.
Reduce usage of nested-blocks.
Reduce usage of return.

0.3.1 / 2011-04-22

Require set in spidr/headers.rb.

0.3.0 / 2011-04-14

Switched from Jeweler to Ore.
Split all header related methods out of Spidr::Page and into Spidr::Headers.
Split all body related methods out of Spidr::Page and into Spidr::Body.
Split all link related methods out of Spidr::Page and into Spidr::Links.
Added Spidr::Headers#directory?.
Added Spidr::Headers#json?.
Added Spidr::Links#each_url.
Added Spidr::Links#each_link.
Added Spidr::Links#each_redirect.
Added Spidr::Links#each_meta_redirect.
Aliased Spidr::Headers#raw_cookie to Spidr::Headers#cookie.
Aliased Spidr::Body#to_s to Spidr::Body#body.
Also check for application/xml in Spidr::Headers#xml?.
Catch all exceptions when merging URIs in Spidr::Links#to_absolute.
Always prepend a / to all FTP URI paths. Fixes a Ruby 1.8 specific bug, where it expects an absolute path for all FTP URIs.
Refactored URI.expand_path.
Start the session in Spidr::SessionCache#[] to prevent multiple CONNECT commands being sent to HTTP Proxies (thanks falaise).

0.2.7 / 2010-08-17

Added Spidr::CookieJar#cookies_for_host (thanks zapnap).
Renamed Spidr::Page#cookie to Spidr::Page#raw_cookie.
Rescue URI::InvalidComponentError exceptions in Spidr::Page#to_absolute (thanks zapnap).

0.2.6 / 2010-07-05

Fixed a bug in Spidr::Page#meta_redirect, by calling Nokogiri::XML::Element#get_attribute instead of attr.

0.2.5 / 2010-07-02

Added Spidr::Page#meta_redirect.
Added Spidr::Page#meta_redirect?.
Manage development dependencies with Bundler.
Support following "old-school" meta-refresh redirects (thanks zapnap).
Allow Spidr::CookieJar inherit cookies set by a parent domain.
Fixed a constant lookup issue in Spidr::Agent.
Use yield instead of block.call when necessary.

0.2.4 / 2010-05-05

Added Spidr::Filters#visit_urls.
Added Spidr::Filters#visit_urls_like.
Added Spidr::Filters#ignore_urls.
Added Spidr::Filters#ignore_urls_like.
Added Spidr::Page#is_content_type?.
Default Spidr::Page#body to an empty String.
Default Spidr::Page#content_type to an empty String.
Default Spidr::Page#content_types to an empty Array.
Improved reliability of Spidr::Page#is_redirect?.
Improved content type detection in Spidr::Page to handle Content-Type headers containing charsets (thanks Josh Lindsey).

0.2.3 / 2010-02-27

Migrated to Jeweler, for the packaging and releasing RubyGems.
Switched to MarkDown formatted YARD documentation.
Added Spidr::Events#every_link.
Added Spidr::SessionCache#active?.
Added specs for Spidr::SessionCache.

0.2.2 / 2010-01-06

Require Web Spider Obstacle Course (WSOC) >= 0.1.1.
Integrated the new WSOC into the specs.
Removed the built-in Web Spider Obstacle Course.
Added Spidr::Page#content_types.
Added Spidr::Page#cookie.
Added Spidr::Page#cookies.
Added Spidr::Page#cookie_params.
Added Spidr::Sanitizers.
Added Spidr::SessionCache.
Added Spidr::CookieJar (thanks Nick Plante).
Added Spidr::AuthStore (thanks Nick Plante).
Added Spidr::Agent#post_page (thanks Nick Plante).
Renamed Spidr::Agent#get_session to Spidr::SessionCache#[].
Renamed Spidr::Agent#kill_session to Spidr::SessionCache#kill!.

0.2.1 / 2009-11-25

Added Spidr::Events#every_ok_page.
Added Spidr::Events#every_redirect_page.
Added Spidr::Events#every_timedout_page.
Added Spidr::Events#every_bad_request_page.
Added Spidr::Events#every_unauthorized_page.
Added Spidr::Events#every_forbidden_page.
Added Spidr::Events#every_missing_page.
Added Spidr::Events#every_internal_server_error_page.
Added Spidr::Events#every_txt_page.
Added Spidr::Events#every_html_page.
Added Spidr::Events#every_xml_page.
Added Spidr::Events#every_xsl_page.
Added Spidr::Events#every_doc.
Added Spidr::Events#every_html_doc.
Added Spidr::Events#every_xml_doc.
Added Spidr::Events#every_xsl_doc.
Added Spidr::Events#every_rss_doc.
Added Spidr::Events#every_atom_doc.
Added Spidr::Events#every_javascript_page.
Added Spidr::Events#every_css_page.
Added Spidr::Events#every_rss_page.
Added Spidr::Events#every_atom_page.
Added Spidr::Events#every_ms_word_page.
Added Spidr::Events#every_pdf_page.
Added Spidr::Events#every_zip_page.
Fixed a bug where Spidr::Agent#delay was not being used to delay requesting pages.
Spider link and script tags in HTML pages (thanks Nick Plante).

0.2.0 / 2009-10-10

Added URI.expand_path.
Added Spidr::Page#search.
Added Spidr::Page#at.
Added Spidr::Page#title.
Added Spidr::Agent#failures=.
Added a HTTP session cache to Spidr::Agent, per suggestion of falter.
- Added Spidr::Agent#get_session.
- Added Spidr::Agent#kill_session.
Added Spidr.proxy=.
Added Spidr.disable_proxy!.
Aliased Spidr::Page#txt? to Spidr::Page#plain_text?.
Aliased Spidr::Page#ok? to Spidr::Page#is_ok?.
Aliased Spidr::Page#redirect? to Spidr::Page#is_redirect?.
Aliased Spidr::Page#unauthorized? to Spidr::Page#is_unauthorized?.
Aliased Spidr::Page#forbidden? to Spidr::Page#is_forbidden?.
Aliased Spidr::Page#missing? to Spidr::Page#is_missing?.
Split URL filtering code out of Spidr::Agent and into Spidr::Filters.
Split URL / Page event code out of Spidr::Agent and into Spidr::Events.
Split pause! / continue! / skip_link! / skip_page! methods out of Spidr::Agent and into Spidr::Actions.
Fixed a bug in Spidr::Page#code, where it was not returning an Integer.
Make sure Spidr::Page#doc returns Nokogiri::XML::Document objects for RSS/RDF/Atom pages as well.
Fixed the handling of the Location header in Spidr::Page#links (thanks falter).
Fixed a bug in Spidr::Page#to_absolute where trailing / characters on URI paths were not being preserved (thanks falter).
Fixed a bug where the URI query was not being sent with the request in Spidr::Agent#get_page (thanks Damian Steer).
Fixed a bug where SSL sessions were not being properly setup (thanks falter).
Switched Spidr::Agent#history to be a Set, to improve search-time of the history (thanks falter).
Switched Spidr::Agent#failures to a Set.
Allow a block to be passed to Spidr::Agent#run, which will receive all pages visited.
Allow Spidr::Agent#start_at and Spidr::Agent#continue! to pass blocks to Spidr::Agent#run.
Made Spidr::Agent#visit_page public.
Moved to YARD based documentation.

0.1.9 / 2009-06-13

Upgraded to Hoe 2.0.0.
- Use Hoe.spec instead of Hoe.new.
- Use the Hoe signing task for signed gems.
Added the Spidr::Agent#schemes and Spidr::Agent#schemes= methods.
Added a warning message if 'net/https' cannot be loaded.
Allow the list of acceptable URL schemes to be passed into Spidr::Agent#initialize.
Allow history and queue information to be passed into Spidr::Agent#initialize.
Spidr::Agent#start_at no longer clears the history or the queue.
Fixed a bug in the sanitization of semi-escaped URLs.
Fixed a bug where https URLs would be followed even if 'net/https' could not be loaded.
Removed Spidr::Agent::SCHEMES.

0.1.8 / 2009-05-27

Added the Spidr::Agent#pause! and Spidr::Agent#continue! methods.
Added the Spidr::Agent#running? and Spidr::Agent#paused? methods.
Added an alias for pending_urls to the queue methods.
Added Spidr::Agent#queue to provide read access to the queue.
Added Spidr::Agent#queue= and Spidr::Agent#history= for setting the queue and history.
Added Spidr::Agent#to_hash which returns a Hash of the agents queue and history.
Made Spidr::Agent#enqueue and Spidr::Agent#queued? public.
Added more specs.

0.1.7 / 2009-04-24

Added Spidr::Agent#all_headers.
Fixed a bug where Spidr::Page#headers was always nil.
Spidr::Agent will now follow the Location header in HTTP 300, 301, 302, 303 and 307 Redirects.
Spidr::Agent will now follow iframe and frame tags.

0.1.6 / 2009-04-14

Added Spidr::Agent#failures, a list of URLs which could not be visited.
Added Spidr::Agent#failed?.
Added Spidr::Agent#every_failed_url.
Added Spidr::Agent#clear, which clears the history and failures URL lists.
Improved fault tolerance in Spidr::Agent#get_page.
- If a Network or HTTP error is encountered, the URL will be added to the failures list and the next URL will be visited.
Fixed a typo in Spidr::Agent#ignore_exts_like.
Updated the Web Spider Obstacle Course with links that always fail to be visited.

0.1.5 / 2009-03-22

Catch malformed URIs in Spidr::Page#to_absolute and return nil.
Filter out nil URIs in Spidr::Page#urls.

0.1.4 / 2009-01-15

Use Nokogiri for HTML and XML parsing.

0.1.3 / 2009-01-10

Added the :host options to Spidr::Agent#initialize.
Added the Web Spider Obstacle Course files to the Manifest.
Aliased Spidr::Agent#visited_urls to Spidr::Agent#history.

0.1.2 / 2008-11-06

Fixed a bug in Spidr::Page#to_absolute where URLs with no path were not receiving a default path of /.
Fixed a bug in Spidr::Page#to_absolute where URL paths were not being expanded, in order to remove .. and . directories.
Fixed a bug where absolute URLs could have a blank path, thus causing Spidr::Agent#get_page to crash when it performed the HTTP request.
Added RSpec spec tests.
Created a Web-Spider Obstacle Course (http://spidr.rubyforge.org/course/start.html) which is used in the spec tests.

0.1.1 / 2008-10-04

Added a reader method for the response instance variable in Page.
Fixed a bug in Spidr::Page#method_missing.

0.1.0 / 2008-05-23

Initial release.
- Black-list or white-list URLs based upon:
- Host name
- Port number
- Full link
- URL extension
- Provides call-backs for:
- Every visited Page.
- Every visited URL.
- Every visited URL that matches a specified pattern.