Fast Ruby Link Checker

Gem Version Build Status Code Climate Test Coverage

A fast Ruby link checker with support for multiple HTTP libraries. Does not parse documents, just checks links. Fast. Anecdotal benchmarking on a M1 mac and T1 Internet yields ~50 URLs per second with LinkChecker::Typhoeus::Hydra.

Table of Contents

Usage

Dependencies

The LinkChecker::Typhoeus::Hydra link checker is recommended.

Add typhoeus and ruby-link-checker to your Gemfile and run bundle install.

gem 'typhoeus'
gem 'ruby-link-checker'

Basic Usage

require 'typhoeus'
require 'ruby-link-checker'

# create a new checker instance
checker = LinkChecker::Typhoeus::Hydra::Checker.new

# queue URLs to check
links = [...]
links.each do |url|
  checker.check url
end

# run the checks
checker.run

# display buckets of results
checker.results.each_pair do |bucket, results|
  puts "#{bucket}: #{results.size}"
end

Passing Options

You can pipe custom options through check and retrieve them in events as follows.

checker.check 'https://www.example.org', { location: 'page.html' }

checker.on :success do |result|
  result.options # contains { location: 'page.html' }
end

Checkers

LinkChecker::Typhoeus::Hydra

Fast link checker that uses Typhoeus.

require 'typhoeus'
require 'ruby-link-checker'

# create a new instance of a checker
checker = LinkChecker::Typhoeus::Hydra::Checker.new(
  hydra: {
    # lower than the Typhoeus default of 200, seems to start breaking around 50+
    max_concurrency: 25
  }
)

# log requests and response codes
checker.logger.level = Logger::INFO

links = [...] # array of URLs
links.each do |url|
  checker.check url
end

# examine failures and errors as they come
checker.on :error, :failure do |result|
  puts "FAIL: #{result}"
end    

# execute Hydra#run, will block until all requests have completed
checker.run

# examine results
checker.results.each_pair do |bucket, results|
  puts "#{bucket}: #{results.size}"
end

You can pass Typhoeus timeout options into a new instance of a checker, or configure timeouts globally.

LinkChecker::Typhoeus::Hydra.configure do |config|
  config.timeout = 5
  config.connecttimeout = 10
end

LinkChecker::Net::HTTP

Slow, sequential checker.

require 'net/http'
require 'ruby-link-checker'

# create a new instance of a checker
checker = LinkChecker::Net::HTTP::Checker.new

# log requests and response codes
checker.logger.level = Logger::INFO

links = [...] # array of URLs
links.each do |url|
  checker.check url
end

# examine results
checker.results.each_pair do |bucket, results|
  puts "#{bucket}: #{results.size}"
end

You can pass Net::HTTP timeout options into a new instance of a checker, or configure timeouts globally.

LinkChecker::Net::HTTP.configure do |config|
  config.read_timeout = 5
  config.open_timeout = 10
end

Options

Retries

By default link checkers do not retry. You can set a number of times to retry all errors and failures with retries.

checker = LinkChecker::Net::HTTP::Checker.new(retry: 1)

Results

By default checkers collect results.

checker = LinkChecker::Net::HTTP::Checker.new(results: false)
...
checker.run

checker.results # => { error: [...], failure: [...], success: [...] }

You can disable this with results: false.

checker = LinkChecker::Net::HTTP::Checker.new(results: false)
...
checker.run

checker.results # => nil

Methods

By default checkers try a HEAD request, followed by a GET if HEAD fails. You can change this behavior by specifying other methods.

The following examples disables GET and only makes HEAD requests.

checker = LinkChecker::Net::HTTP::Checker.new(methods: %w[HEAD])

Logger

Pass your own logger.

checker = LinkChecker::Net::HTTP::Checker.new(logger: Logger.new(STDOUT))

User-Agent

Pass your own user-agent. Default is Ruby Link Checker/x.y.z.

checker = LinkChecker::Net::HTTP::Checker.new(user_agent: 'Custom Agent/1.0')

Global Configuration

All options can also be configured globally.

LinkChecker.configure do |config|
  config.user_agent = 'Custom Agent/1.0'
  config.methods = ['HEAD', 'GET']
  config.logger = ::Logger.new(STDOUT)
end

Callbacks and Events

Events enable processing of results as they become available.

checker.on :result do |result|
  puts result # any result
end

checker.on :error, :failure do |result|
  puts result # error or failure
end

Checkers support the following events.

Event Description
:retry A request is being retried on failure or error.
:result A new result, any of success, failure, or error.
:success A valid URL, usually a 2xx response from the server.
:failure A failed URL, usually a 4xx or a 5xx response from the server.
:error An error, such as an invalid URL or a network timeout.

Events are called with results, which contain the following properties.

Property Description
:url The original URL before redirects.
:result_url The last URL, different from url in case of redirects.
:method The result HTTP method.
:code HTTP error code.
:request_headers Request headers.
:redirect_to A redirect URL in case of redirects.
:error A raised error in case of errors.

See result.rb for more details.

Contributing

You're encouraged to contribute to ruby-link-checker. See CONTRIBUTING for details.

Copyright (c) Daniel Doubrovkine and Contributors.

This project is licensed under the MIT License.