Fast Ruby Link Checker
A fast Ruby link checker with support for multiple HTTP libraries. Does not parse documents, just checks links. Fast. Anecdotal benchmarking on a M1 mac and T1 Internet yields ~50 URLs per second with LinkChecker::Typhoeus::Hydra
.
Table of Contents
Usage
Dependencies
The LinkChecker::Typhoeus::Hydra
link checker is recommended.
Add typhoeus
and ruby-link-checker
to your Gemfile
and run bundle install
.
gem 'typhoeus'
gem 'ruby-link-checker'
Basic Usage
require 'typhoeus'
require 'ruby-link-checker'
# create a new checker instance
checker = LinkChecker::Typhoeus::Hydra::Checker.new
# queue URLs to check
links = [...]
links.each do |url|
checker.check url
end
# run the checks
checker.run
# display buckets of results
checker.results.each_pair do |bucket, results|
puts "#{bucket}: #{results.size}"
end
Passing Options
You can pipe custom options through check
and retrieve them in events as follows.
checker.check 'https://www.example.org', { location: 'page.html' }
checker.on :success do |result|
result. # contains { location: 'page.html' }
end
Checkers
LinkChecker::Typhoeus::Hydra
Fast link checker that uses Typhoeus.
require 'typhoeus'
require 'ruby-link-checker'
# create a new instance of a checker
checker = LinkChecker::Typhoeus::Hydra::Checker.new(
hydra: {
# lower than the Typhoeus default of 200, seems to start breaking around 50+
max_concurrency: 25
}
)
# log requests and response codes
checker.logger.level = Logger::INFO
links = [...] # array of URLs
links.each do |url|
checker.check url
end
# examine failures and errors as they come
checker.on :error, :failure do |result|
puts "FAIL: #{result}"
end
# execute Hydra#run, will block until all requests have completed
checker.run
# examine results
checker.results.each_pair do |bucket, results|
puts "#{bucket}: #{results.size}"
end
You can pass Typhoeus
timeout options into a new instance of a checker, or configure timeouts globally.
LinkChecker::Typhoeus::Hydra.configure do |config|
config.timeout = 5
config.connecttimeout = 10
end
LinkChecker::Net::HTTP
Slow, sequential checker.
require 'net/http'
require 'ruby-link-checker'
# create a new instance of a checker
checker = LinkChecker::Net::HTTP::Checker.new
# log requests and response codes
checker.logger.level = Logger::INFO
links = [...] # array of URLs
links.each do |url|
checker.check url
end
# examine results
checker.results.each_pair do |bucket, results|
puts "#{bucket}: #{results.size}"
end
You can pass Net::HTTP
timeout options into a new instance of a checker, or configure timeouts globally.
LinkChecker::Net::HTTP.configure do |config|
config.read_timeout = 5
config.open_timeout = 10
end
Options
Retries
By default link checkers do not retry. You can set a number of times to retry all errors and failures with retries
.
checker = LinkChecker::Net::HTTP::Checker.new(retry: 1)
Results
By default checkers collect results.
checker = LinkChecker::Net::HTTP::Checker.new(results: false)
...
checker.run
checker.results # => { error: [...], failure: [...], success: [...] }
You can disable this with results: false
.
checker = LinkChecker::Net::HTTP::Checker.new(results: false)
...
checker.run
checker.results # => nil
Methods
By default checkers try a HEAD
request, followed by a GET
if HEAD
fails. You can change this behavior by specifying other methods.
The following examples disables GET
and only makes HEAD
requests.
checker = LinkChecker::Net::HTTP::Checker.new(methods: %w[HEAD])
Logger
Pass your own logger.
checker = LinkChecker::Net::HTTP::Checker.new(logger: Logger.new(STDOUT))
User-Agent
Pass your own user-agent. Default is Ruby Link Checker/x.y.z
.
checker = LinkChecker::Net::HTTP::Checker.new(user_agent: 'Custom Agent/1.0')
Global Configuration
All options can also be configured globally.
LinkChecker.configure do |config|
config.user_agent = 'Custom Agent/1.0'
config.methods = ['HEAD', 'GET']
config.logger = ::Logger.new(STDOUT)
end
Callbacks and Events
Events enable processing of results as they become available.
checker.on :result do |result|
puts result # any result
end
checker.on :error, :failure do |result|
puts result # error or failure
end
Checkers support the following events.
Event | Description |
---|---|
:retry | A request is being retried on failure or error. |
:result | A new result, any of success, failure, or error. |
:success | A valid URL, usually a 2xx response from the server. |
:failure | A failed URL, usually a 4xx or a 5xx response from the server. |
:error | An error, such as an invalid URL or a network timeout. |
Events are called with results, which contain the following properties.
Property | Description |
---|---|
:url | The original URL before redirects. |
:result_url | The last URL, different from url in case of redirects. |
:method | The result HTTP method. |
:code | HTTP error code. |
:request_headers | Request headers. |
:redirect_to | A redirect URL in case of redirects. |
:error | A raised error in case of errors. |
See result.rb for more details.
Contributing
You're encouraged to contribute to ruby-link-checker. See CONTRIBUTING for details.
Copyright and License
Copyright (c) Daniel Doubrovkine and Contributors.
This project is licensed under the MIT License.