SitemapGenerator

SitemapGenerator generates Sitemaps for your Rails application. The Sitemaps adhere to the Sitemap 0.9 protocol specification. You specify the contents of your Sitemap using a configuration file, à la Rails Routes. A set of rake tasks is included to help you manage your Sitemaps.

Features

  • Supports Video sitemaps and Image sitemaps
  • Rails3 compatible (beta)
  • Adheres to the Sitemap 0.9 protocol
  • Handles millions of links
  • Compresses Sitemaps using GZip
  • Notifies Search Engines (Google, Yahoo, Bing, Ask, SitemapWriter) of new sitemaps
  • Ensures your old Sitemaps stay in place if the new Sitemap fails to generate
  • You set the hostname (and protocol) of the links in your Sitemap

Changelog

Foreword

Adam Salter first created SitemapGenerator while we were working together in Sydney, Australia. Unfortunately, he passed away in 2009. Since then I have taken over development of SitemapGenerator.

Those who knew him know what an amazing guy he was, and what an excellent Rails programmer he was. His passing is a great loss to the Rails community.

The canonical repository is now: http://github.com/kjvarga/sitemap_generator

Install

Rails 3:

  1. Add the gem to your Gemspec

    gem 'sitemap_generator'

  2. $ rake sitemap:install

Rails 2.x: As a gem

  1. Add the gem as a dependency in your config/environment.rb

    config.gem 'sitemap_generator', :lib => false

  2. $ rake gems:install

  3. Add the following to your RAILS_ROOT/Rakefile

    begin
      require 'sitemap_generator/tasks'
    rescue Exception => e
      puts "Warning, couldn't load gem tasks: #{e.message}! Skipping..."
    end
  4. $ rake sitemap:install

Rails 2.x: As a plugin

  1. $ ./script/plugin install git://github.com/kjvarga/sitemap_generator.git

Usage

rake sitemap:install creates a config/sitemap.rb file which will contain your logic for generating the Sitemap files.

Once you have configured your sitemap in config/sitemap.rb run rake sitemap:refresh as needed to create/rebuild your Sitemap files. Sitemaps are generated into the public/ folder and are named sitemap_index.xml.gz, sitemap1.xml.gz, sitemap2.xml.gz, etc.

Using rake sitemap:refresh will notify major search engines to let them know that a new Sitemap is available (Google, Yahoo, Bing, Ask, SitemapWriter). To generate new Sitemaps without notifying search engines (for example when running in a local environment) use rake sitemap:refresh:no_ping.

To ping Yahoo you will need to set your Yahoo AppID in config/sitemap.rb. For example: SitemapGenerator::Sitemap.yahoo_app_id = "my_app_id"

To disable all non-essential output (only errors will be displayed) run the rake tasks with the -s option. For example rake -s sitemap:refresh.

Cron

To keep your Sitemaps up-to-date, setup a cron job. Make sure to pass the -s option to silence rake. That way you will only get email when the sitemap build fails.

If you're using Whenever, your schedule would look something like the following:

# config/schedule.rb
every 1.day, :at => '5:00 am' do
  rake "-s sitemap:refresh"
end

Robots.txt

You should add the Sitemap index file to public/robots.txt to help search engines find your Sitemaps. The URL should be the complete URL to the Sitemap index file. For example:

Sitemap: http://www.example.org/sitemap_index.xml.gz

Image and Video Sitemaps

Images can be added to a sitemap URL by passing an :images array to add(). Each item in the array must be a Hash containing tags defined by the Image Sitemap specification. For example:

sitemap.add('/index.html', :images => [{ :loc => 'http://www.example.com/image.png', :title => 'Image' }])

A video can be added to a sitemap URL by passing a :video Hash to add(). The Hash can contain tags defined by the Video Sitemap specification. To associate more than one tag with a video, pass the tags as an array with the key :tags.

sitemap.add('/index.html', :video => { :thumbnail_loc => 'http://www.example.com/video1_thumbnail.png', :title => 'Title', :description => 'Description', :content_loc => 'http://www.example.com/cool_video.mpg', :tags => %w[one two three], :category => 'Category' })

Example config/sitemap.rb

# Set the host name for URL creation
SitemapGenerator::Sitemap.default_host = "http://www.example.com"
SitemapGenerator::Sitemap.yahoo_app_id = nil # Set to your Yahoo AppID to ping Yahoo

SitemapGenerator::Sitemap.add_links do |sitemap|
  # Put links creation logic here.
  #
  # The Root Path ('/') and Sitemap Index file are added automatically.
  # Links are added to the Sitemap output in the order they are specified.
  #
  # Usage: sitemap.add path, options
  #        (default options are used if you don't specify them)
  #
  # Defaults: :priority => 0.5, :changefreq => 'weekly',
  #           :lastmod => Time.now, :host => default_host

  # add '/articles'
  sitemap.add articles_path, :priority => 0.7, :changefreq => 'daily'

  # add all articles
  Article.all.each do |a|
    sitemap.add article_path(a), :lastmod => a.updated_at
  end

  # add news page with images
  News.all.each do |news|
    images = news.images.collect do |image|
      { :loc => image.url, :title => image.name }
    end
    sitemap.add news_path(news), :images => images
  end
end

# Including Sitemaps from Rails Engines.
#
# These Sitemaps should be almost identical to a regular Sitemap file except
# they needn't define their own SitemapGenerator::Sitemap.default_host since
# they will undoubtedly share the host name of the application they belong to.
#
# As an example, say we have a Rails Engine in vendor/plugins/cadability_client
# We can include its Sitemap here as follows:
#
file = File.join(Rails.root, 'vendor/plugins/cadability_client/config/sitemap.rb')
eval(open(file).read, binding, file)

Raison d'être

Most of the Sitemap plugins out there seem to try to recreate the Sitemap links by iterating the Rails routes. In some cases this is possible, but for a great deal of cases it isn't.

a) There are probably quite a few routes in your routes file that don't need inclusion in the Sitemap. (AJAX routes I'm looking at you.)

and

b) How would you infer the correct series of links for the following route?

map.zipcode 'location/:state/:city/:zipcode', :controller => 'zipcode', :action => 'index'

Don't tell me it's trivial, because it isn't. It just looks trivial.

So my idea is to have another file similar to 'routes.rb' called 'sitemap.rb', where you can define what goes into the Sitemap.

Here's my solution:

Zipcode.find(:all, :include => :city).each do |z|
  sitemap.add zipcode_path(:state => z.city.state, :city => z.city, :zipcode => z)
end

Easy hey?

Other Sitemap settings for the link, like lastmod, priority, changefreq and host are entered automatically, although you can override them if you need to.

Compatibility

Tested and working on:

  • Rails 3.0.0
  • Rails 1.x - 2.3.8
  • Ruby 1.8.6, 1.8.7, 1.8.7 Enterprise Edition, 1.9.1

Notes

1) For large sitemaps it may be useful to split your generation into batches to avoid running out of memory. E.g.:

# add movies
Movie.find_in_batches(:batch_size => 1000) do |movies|
  movies.each do |movie|
    sitemap.add "/movies/show/#{movie.to_param}", :lastmod => movie.updated_at, :changefreq => 'weekly'
  end
end

2) New Capistrano deploys will remove your Sitemap files, unless you run rake sitemap:refresh. The way around this is to create a cap task:

after "deploy:update_code", "deploy:copy_old_sitemap"

namespace :deploy do
  task :copy_old_sitemap do
      run "if [ -e #{previous_release}/public/sitemap_index.xml.gz ]; then cp #{previous_release}/public/sitemap* #{current_release}/public/; fi"
  end
end

Known Bugs

  • There's no check on the size of a URL which isn't supposed to exceed 2,048 bytes.
  • Currently only supports one Sitemap Index file, which can contain 50,000 Sitemap files which can each contain 50,000 urls, so it only supports up to 2,500,000,000 (2.5 billion) urls. I personally have no need of support for more urls, but plugin could be improved to support this.

Wishlist & Coming Soon

  • Ultimately I'd like to make this gem framework agnostic. It is better suited to being run as a command-line tool as opposed to Ruby-specific Rake tasks.
  • Add rake tasks/options to validate the generated sitemaps.
  • Support News, Mobile, Geo and other types of sitemaps
  • Support for generating sitemaps for sites with multiple domains. Sitemaps can be generated into subdirectories and we can use Rack middleware to rewrite requests for sitemaps to the correct subdirectory based on the request host.
  • Auto coverage testing. Generate a report of broken URLs by checking the status codes of each page in the sitemap.

Thanks (in no particular order)

Copyright (c) 2009 Karl Varga released under the MIT license