0.3.2 / 2011-06-20

0.3.1 / 2011-04-22

  • Require set in spidr/headers.rb.

0.3.0 / 2011-04-14

0.2.7 / 2010-08-17

  • Added Spidr::CookieJar#cookies_for_host (thanks zapnap).
  • Renamed Spidr::Page#cookie to Spidr::Page#raw_cookie.
  • Rescue URI::InvalidComponentError exceptions in Spidr::Page#to_absolute (thanks zapnap).

0.2.6 / 2010-07-05

  • Fixed a bug in Spidr::Page#meta_redirect, by calling Nokogiri::XML::Element#get_attribute instead of attr.

0.2.5 / 2010-07-02

  • Added Spidr::Page#meta_redirect.
  • Added Spidr::Page#meta_redirect?.
  • Manage development dependencies with Bundler.
  • Support following "old-school" meta-refresh redirects (thanks zapnap).
  • Allow Spidr::CookieJar inherit cookies set by a parent domain.
  • Fixed a constant lookup issue in Spidr::Agent.
  • Use yield instead of block.call when necessary.

0.2.4 / 2010-05-05

0.2.3 / 2010-02-27

0.2.2 / 2010-01-06

0.2.1 / 2009-11-25

0.2.0 / 2009-10-10

  • Added URI.expand_path.
  • Added Spidr::Page#search.
  • Added Spidr::Page#at.
  • Added Spidr::Page#title.
  • Added Spidr::Agent#failures=.
  • Added a HTTP session cache to Spidr::Agent, per suggestion of falter.
    • Added Spidr::Agent#get_session.
    • Added Spidr::Agent#kill_session.
  • Added Spidr.proxy=.
  • Added Spidr.disable_proxy!.
  • Aliased Spidr::Page#txt? to Spidr::Page#plain_text?.
  • Aliased Spidr::Page#ok? to Spidr::Page#is_ok?.
  • Aliased Spidr::Page#redirect? to Spidr::Page#is_redirect?.
  • Aliased Spidr::Page#unauthorized? to Spidr::Page#is_unauthorized?.
  • Aliased Spidr::Page#forbidden? to Spidr::Page#is_forbidden?.
  • Aliased Spidr::Page#missing? to Spidr::Page#is_missing?.
  • Split URL filtering code out of Spidr::Agent and into Spidr::Filters.
  • Split URL / Page event code out of Spidr::Agent and into Spidr::Events.
  • Split pause! / continue! / skip_link! / skip_page! methods out of Spidr::Agent and into Spidr::Actions.
  • Fixed a bug in Spidr::Page#code, where it was not returning an Integer.
  • Make sure Spidr::Page#doc returns Nokogiri::XML::Document objects for RSS/RDF/Atom pages as well.
  • Fixed the handling of the Location header in Spidr::Page#links (thanks falter).
  • Fixed a bug in Spidr::Page#to_absolute where trailing / characters on URI paths were not being preserved (thanks falter).
  • Fixed a bug where the URI query was not being sent with the request in Spidr::Agent#get_page (thanks Damian Steer).
  • Fixed a bug where SSL sessions were not being properly setup (thanks falter).
  • Switched Spidr::Agent#history to be a Set, to improve search-time of the history (thanks falter).
  • Switched Spidr::Agent#failures to a Set.
  • Allow a block to be passed to Spidr::Agent#run, which will receive all pages visited.
  • Allow Spidr::Agent#start_at and Spidr::Agent#continue! to pass blocks to Spidr::Agent#run.
  • Made Spidr::Agent#visit_page public.
  • Moved to YARD based documentation.

0.1.9 / 2009-06-13

  • Upgraded to Hoe 2.0.0.
    • Use Hoe.spec instead of Hoe.new.
    • Use the Hoe signing task for signed gems.
  • Added the Spidr::Agent#schemes and Spidr::Agent#schemes= methods.
  • Added a warning message if 'net/https' cannot be loaded.
  • Allow the list of acceptable URL schemes to be passed into Spidr::Agent#initialize.
  • Allow history and queue information to be passed into Spidr::Agent#initialize.
  • Spidr::Agent#start_at no longer clears the history or the queue.
  • Fixed a bug in the sanitization of semi-escaped URLs.
  • Fixed a bug where https URLs would be followed even if 'net/https' could not be loaded.
  • Removed Spidr::Agent::SCHEMES.

0.1.8 / 2009-05-27

0.1.7 / 2009-04-24

  • Added Spidr::Agent#all_headers.
  • Fixed a bug where Spidr::Page#headers was always nil.
  • Spidr::Agent will now follow the Location header in HTTP 300, 301, 302, 303 and 307 Redirects.
  • Spidr::Agent will now follow iframe and frame tags.

0.1.6 / 2009-04-14

  • Added Spidr::Agent#failures, a list of URLs which could not be visited.
  • Added Spidr::Agent#failed?.
  • Added Spidr::Agent#every_failed_url.
  • Added Spidr::Agent#clear, which clears the history and failures URL lists.
  • Improved fault tolerance in Spidr::Agent#get_page.
    • If a Network or HTTP error is encountered, the URL will be added to the failures list and the next URL will be visited.
  • Fixed a typo in Spidr::Agent#ignore_exts_like.
  • Updated the Web Spider Obstacle Course with links that always fail to be visited.

0.1.5 / 2009-03-22

  • Catch malformed URIs in Spidr::Page#to_absolute and return nil.
  • Filter out nil URIs in Spidr::Page#urls.

0.1.4 / 2009-01-15

  • Use Nokogiri for HTML and XML parsing.

0.1.3 / 2009-01-10

0.1.2 / 2008-11-06

  • Fixed a bug in Spidr::Page#to_absolute where URLs with no path were not receiving a default path of /.
  • Fixed a bug in Spidr::Page#to_absolute where URL paths were not being expanded, in order to remove .. and . directories.
  • Fixed a bug where absolute URLs could have a blank path, thus causing Spidr::Agent#get_page to crash when it performed the HTTP request.
  • Added RSpec spec tests.
  • Created a Web-Spider Obstacle Course (http://spidr.rubyforge.org/course/start.html) which is used in the spec tests.

0.1.1 / 2008-10-04

0.1.0 / 2008-05-23

  • Initial release.
    • Black-list or white-list URLs based upon:
    • Host name
    • Port number
    • Full link
    • URL extension
    • Provides call-backs for:
    • Every visited Page.
    • Every visited URL.
    • Every visited URL that matches a specified pattern.