Class: SitemapGenerator::LinkSet

Inherits:
Object
  • Object
show all
Includes:
LocationHelpers
Defined in:
lib/sitemap_generator/link_set.rb

Defined Under Namespace

Modules: LocationHelpers

Constant Summary collapse

@@requires_finalization_opts =
[:filename, :sitemaps_path, :sitemaps_host, :namer]
@@new_location_opts =
[:filename, :sitemaps_path, :namer]

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from LocationHelpers

#compress, #compress=, #create_index=, #default_host=, #filename=, #namer, #namer=, #public_path, #public_path=, #search_engines, #search_engines=, #sitemap_index_location, #sitemap_location, #sitemaps_host=, #sitemaps_path=

Constructor Details

#initialize(options = {}) ⇒ LinkSet

Constructor

Options:

  • :adapter - instance of a class with a write method which takes a SitemapGenerator::Location and raw XML data and persists it. The default adapter is a SitemapGenerator::FileAdapter which simply writes files to the filesystem. You can use a SitemapGenerator::WaveAdapter for uploading sitemaps to remote servers - useful for read-only hosts such as Heroku. Or you can provide an instance of your own class to provide custom behavior.

  • :default_host - host including protocol to use in all sitemap links e.g. en.google.ca

  • :public_path - Full or relative path to the directory to write sitemaps into. Defaults to the public/ directory in your application root directory or the current working directory.

  • :sitemaps_host - String. Host including protocol to use when generating a link to a sitemap file i.e. the hostname of the server where the sitemaps are hosted. The value will differ from the hostname in your sitemap links. For example: ‘’amazon.aws.com/‘`.

    Note that ‘include_index` is automatically turned off when the `sitemaps_host` does not match `default_host`. Because the link to the sitemap index file that would otherwise be added would point to a different host than the rest of the links in the sitemap. Something that the sitemap rules forbid.

  • :sitemaps_path - path fragment within public to write sitemaps to e.g. ‘en/’. Sitemaps are written to public_path + sitemaps_path

  • :filename - symbol giving the base name for files (default :sitemap). The names are generated like “##filename.xml.gz”, “##filename1.xml.gz”, “##filename2.xml.gz” with the first file being the index if you have more than one sitemap file.

  • :include_index - Boolean. Whether to <b>add a link pointing to the sitemap index<b> to the current sitemap. This points search engines to your Sitemap Index to include it in the indexing of your site. Default is ‘false`. Turned off when

`sitemaps_host` is set or within a `group()` block.  Turned off because Google can complain
 about nested indexing and because if a robot is already reading your sitemap, they
 probably know about the index.
  • :include_root - Boolean. Whether to **add the root** url i.e. ‘/’ to the current sitemap. Default is ‘true`. Turned off within a `group()` block.

  • :search_engines - Hash. A hash of search engine names mapped to ping URLs. See ping_search_engines.

  • :verbose - If true, output a summary line for each sitemap and sitemap index that is created. Default is false.

  • :create_index - Supported values: ‘true`, `false`, `:auto`. Default: `:auto`. Whether to create a sitemap index file. If `true` an index file is always created, regardless of how many links are in your sitemap. If `false` an index file is never created. If `:auto` an index file is created only if your sitemap has more than one sitemap file.

  • :namer - A SitemapGenerator::SimpleNamer instance for generating the sitemap and index file names. See :filename if you don’t need to do anything fancy, and can accept the default naming conventions.

  • :compress - Specifies which files to compress with gzip. Default is ‘true`. Accepted values:

    * `true` - Boolean; compress all files.
    * `false` - Boolean; write out only uncompressed files.
    * `:all_but_first` - Symbol; leave the first file uncompressed but compress any remaining files.
    

    The compression setting applies to groups too. So :all_but_first will have the same effect (the first file in the group will not be compressed, the rest will). So if you require different behaviour for your groups, pass in a ‘:compress` option e.g. group(:compress => false) { add('/link') }

  • :max_sitemap_links - The maximum number of links to put in each sitemap. Default is ‘SitemapGenerator::MAX_SITEMAPS_LINKS`, or 50,000.

Note: When adding a new option be sure to include it in ‘options_for_group()` if the option should be inherited by groups.



120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
# File 'lib/sitemap_generator/link_set.rb', line 120

def initialize(options={})
  @default_host, @sitemaps_host, @yield_sitemap, @sitemaps_path, @adapter, @verbose, @protect_index, @sitemap_index, @added_default_links, @created_group, @sitemap = nil

  options = SitemapGenerator::Utilities.reverse_merge(options,
    :include_root => true,
    :include_index => false,
    :filename => :sitemap,
    :search_engines => {
      :google         => "http://www.google.com/webmasters/tools/ping?sitemap=%s"
    },
    :create_index => :auto,
    :compress => true,
    :max_sitemap_links => SitemapGenerator::MAX_SITEMAP_LINKS
  )
  options.each_pair { |k, v| instance_variable_set("@#{k}".to_sym, v) }

  # If an index is passed in, protect it from modification.
  # Sitemaps can be added to the index but nothing else can be changed.
  if options[:sitemap_index]
    @protect_index = true
  end
end

Instance Attribute Details

#adapterObject

Returns the value of attribute adapter.



11
12
13
# File 'lib/sitemap_generator/link_set.rb', line 11

def adapter
  @adapter
end

#create_indexObject (readonly)

Returns the value of attribute create_index.



10
11
12
# File 'lib/sitemap_generator/link_set.rb', line 10

def create_index
  @create_index
end

#default_hostObject (readonly)

Returns the value of attribute default_host.



10
11
12
# File 'lib/sitemap_generator/link_set.rb', line 10

def default_host
  @default_host
end

#filenameObject (readonly)

Returns the value of attribute filename.



10
11
12
# File 'lib/sitemap_generator/link_set.rb', line 10

def filename
  @filename
end

#include_indexObject

Returns the value of attribute include_index.



11
12
13
# File 'lib/sitemap_generator/link_set.rb', line 11

def include_index
  @include_index
end

#include_rootObject

Returns the value of attribute include_root.



11
12
13
# File 'lib/sitemap_generator/link_set.rb', line 11

def include_root
  @include_root
end

Returns the value of attribute max_sitemap_links.



11
12
13
# File 'lib/sitemap_generator/link_set.rb', line 11

def max_sitemap_links
  @max_sitemap_links
end

#sitemaps_pathObject (readonly)

Returns the value of attribute sitemaps_path.



10
11
12
# File 'lib/sitemap_generator/link_set.rb', line 10

def sitemaps_path
  @sitemaps_path
end

#verboseObject

Set verbose on the instance or by setting ENV to true or false. By default verbose is true. When running rake tasks, pass the -s option to rake to turn verbose off.



368
369
370
371
372
373
# File 'lib/sitemap_generator/link_set.rb', line 368

def verbose
  if @verbose.nil?
    @verbose = SitemapGenerator.verbose.nil? ? true : SitemapGenerator.verbose
  end
  @verbose
end

#yield_sitemapObject

Returns the value of attribute yield_sitemap.



11
12
13
# File 'lib/sitemap_generator/link_set.rb', line 11

def yield_sitemap
  @yield_sitemap
end

Instance Method Details

#add(link, options = {}) ⇒ Object

Add a link to a Sitemap. If a new Sitemap is required, one will be created for you.

link - string link e.g. ‘/merchant’, ‘/article/1’ or whatever. options - see README.

host - host for the link, defaults to your <tt>default_host</tt>.


149
150
151
152
153
154
155
156
157
158
# File 'lib/sitemap_generator/link_set.rb', line 149

def add(link, options={})
  add_default_links if !@added_default_links
  sitemap.add(link, SitemapGenerator::Utilities.reverse_merge(options, :host => @default_host))
rescue SitemapGenerator::SitemapFullError
  finalize_sitemap!
  retry
rescue SitemapGenerator::SitemapFinalizedError
  @sitemap = sitemap.new
  retry
end

#add_to_index(link, options = {}) ⇒ Object

Add a link to the Sitemap Index.

  • link - A string link e.g. ‘/sitemaps/sitemap1.xml.gz’ or a SitemapFile instance.

  • options - A hash of options including ‘:lastmod`, ’:priority`, ‘:changefreq` and `:host`

The ‘:host` option defaults to the value of `sitemaps_host` which is the host where your sitemaps reside. If no `sitemaps_host` is set, the `default_host` is used.



166
167
168
# File 'lib/sitemap_generator/link_set.rb', line 166

def add_to_index(link, options={})
  sitemap_index.add(link, SitemapGenerator::Utilities.reverse_merge(options, :host => sitemaps_host))
end

#create(opts = {}, &block) ⇒ Object

Create a new sitemap index and sitemap files. Pass a block with calls to the following methods:

  • add - Add a link to the current sitemap

  • group - Start a new group of sitemaps

Options

Any option supported by new can be passed. The options will be set on the instance using the accessor methods. This is provided mostly as a convenience.

In addition to the options to new, the following options are supported:

  • :finalize - The sitemaps are written as they get full and at the end

of the block. Pass false as the value to prevent the sitemap or sitemap index from being finalized. Default is true.

If you are calling create more than once in your sitemap configuration file, make sure that you set a different sitemaps_path or filename for each call otherwise the sitemaps may be overwritten.



33
34
35
36
37
38
39
40
41
42
43
44
45
# File 'lib/sitemap_generator/link_set.rb', line 33

def create(opts={}, &block)
  reset!
  set_options(opts)
  if verbose
    start_time = Time.now
    puts "In '#{sitemap_index.location.public_path}':"
  end
  interpreter.eval(:yield_sitemap => yield_sitemap?, &block)
  finalize!
  end_time = Time.now if verbose
  output(sitemap_index.stats_summary(:time_taken => end_time - start_time)) if verbose
  self
end

#finalize!Object

All done. Write out remaining files.



342
343
344
345
# File 'lib/sitemap_generator/link_set.rb', line 342

def finalize!
  finalize_sitemap!
  finalize_sitemap_index!
end

#group(opts = {}, &block) ⇒ Object

Create a new group of sitemap files.

Returns a new LinkSet instance with the options passed in set on it. All groups share the sitemap index, which is not affected by any of the options passed here.

Options

Any of the options to LinkSet.new. Except for :public_path which is shared by all groups.

The current options are inherited by the new group of sitemaps. The only exceptions being :include_index and :include_root which default to false.

Pass a block to add links to the new LinkSet. If you pass a block the sitemaps will be finalized when the block returns.

If you are not changing any of the location settings like filename<tt>, <tt>sitemaps_path, sitemaps_host or namer, links you add within the group will be added to the current sitemap. Otherwise the current sitemap file is finalized and a new sitemap file started, using the options you specified.

Most commonly, you’ll want to give the group’s files a distinct name using the filename option.

Options like :default_host can be used and it will only affect the links within the group. Links added outside of the group will revert to the previous default_host.



197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
# File 'lib/sitemap_generator/link_set.rb', line 197

def group(opts={}, &block)
  @created_group = true
  original_opts = opts.dup

  if (@@requires_finalization_opts & original_opts.keys).empty?
    # If no new filename or path is specified reuse the default sitemap file.
    # A new location object will be set on it for the duration of the group.
    original_opts[:sitemap] = sitemap
  elsif original_opts.key?(:sitemaps_host) && (@@new_location_opts & original_opts.keys).empty?
    # If no location options are provided we are creating the next sitemap in the
    # current series, so finalize and inherit the namer.
    finalize_sitemap!
    original_opts[:namer] = namer
  end

  opts = options_for_group(original_opts)
  @group = SitemapGenerator::LinkSet.new(opts)
  if opts.key?(:sitemap)
    # If the group is sharing the current sitemap, set the
    # new location options on the location object.
    @original_location = @sitemap.location.dup
    @sitemap.location.merge!(@group.sitemap_location)
    if block_given?
      @group.interpreter.eval(:yield_sitemap => @yield_sitemap || SitemapGenerator.yield_sitemap?, &block)
      @group.finalize_sitemap!
      @sitemap.location.merge!(@original_location)
    end
  else
    # Handle the case where a user only has one group, and it's being written
    # to a new sitemap file.  They would expect there to be an index.  So force
    # index creation.  If there is more than one group, we would have an index anyways,
    # so it's safe to force index creation in these other cases.  In the case that
    # the groups reuse the current sitemap, don't force index creation because
    # we want the default behaviour i.e. only an index if more than one sitemap file.
    # Don't force index creation if the user specifically requested no index.  This
    # unfortunately means that if they set it to :auto they may be getting an index
    # when they didn't expect one, but you shouldn't be using groups if you only have
    # one sitemap and don't want an index.  Rather, just add the links directly in the create()
    # block.
    @group.send(:create_index=, true, true) if @group.create_index != false

    if block_given?
      @group.interpreter.eval(:yield_sitemap => @yield_sitemap || SitemapGenerator.yield_sitemap?, &block)
      @group.finalize_sitemap!
    end
  end
  @group
end

#include_index?Boolean

Return a boolean indicating hether to add a link to the sitemap index file to the current sitemap. This points search engines to your Sitemap Index so they include it in the indexing of your site, but is not strictly neccessary. Default is ‘true`. Turned off when `sitemaps_host` is set or within a `group()` block.

Returns:

  • (Boolean)


351
352
353
354
355
356
357
# File 'lib/sitemap_generator/link_set.rb', line 351

def include_index?
  if default_host && sitemaps_host && sitemaps_host != default_host
    false
  else
    @include_index
  end
end

#include_root?Boolean

Return a boolean indicating whether to automatically add the root url i.e. ‘/’ to the current sitemap. Default is ‘true`. Turned off within a `group()` block.

Returns:

  • (Boolean)


361
362
363
# File 'lib/sitemap_generator/link_set.rb', line 361

def include_root?
  !!@include_root
end

Return a count of the total number of links in all sitemaps



311
312
313
# File 'lib/sitemap_generator/link_set.rb', line 311

def link_count
  sitemap_index.total_link_count
end

#ping_search_engines(*args) ⇒ Object

Ping search engines to notify them of updated sitemaps.

Search engines are already notified for you if you run ‘rake sitemap:refresh`. If you want to ping search engines separately to your sitemap generation, run `rake sitemap:refresh:no_ping` and then run a rake task or script which calls this method as in the example below.

Arguments

  • sitemap_index_url - The full URL to your sitemap index file. If not provided the location is based on the ‘host` you have set and any other options like your `sitemaps_path`. The URL will be CGI escaped for you when included as part of the search engine ping URL.

Options

A hash of one or more search engines to ping in addition to the default search engines. The key is the name of the search engine as a string or symbol and the value is the full URL to ping with a string interpolation that will be replaced by the CGI escaped sitemap index URL. If you have any literal percent characters in your URL you need to escape them with ‘%%`. For example if your sitemap index URL is `example.com/sitemap.xml.gz` and your ping url is `example.com/100%%/ping?url=%s` then the final URL that is pinged will be `example.com/100%/ping?url=http%3A%2F%2Fexample.com%2Fsitemap.xml.gz`

Examples

Both of these examples will ping the default search engines in addition to ‘superengine.com/ping?url=http%3A%2F%2Fexample.com%2Fsitemap.xml.gz`

SitemapGenerator::Sitemap.host('http://example.com/')
SitemapGenerator::Sitemap.ping_search_engines(:super_engine => 'http://superengine.com/ping?url=%s')

Is equivalent to:

SitemapGenerator::Sitemap.ping_search_engines('http://example.com/sitemap.xml.gz', :super_engine => 'http://superengine.com/ping?url=%s')


281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
# File 'lib/sitemap_generator/link_set.rb', line 281

def ping_search_engines(*args)
  require 'cgi/session'
  require 'open-uri'
  require 'timeout'

  engines = args.last.is_a?(Hash) ? args.pop : {}
  unescaped_url = args.shift || sitemap_index_url
  index_url = CGI.escape(unescaped_url)

  output("\n")
  output("Pinging with URL '#{unescaped_url}':")
  search_engines.merge(engines).each do |engine, link|
    link = link % index_url
    name = Utilities.titleize(engine.to_s)
    begin
      Timeout::timeout(10) {
        if URI.respond_to?(:open) # Available since Ruby 2.5
          URI.open(link)
        else
          open(link) # using Kernel#open became deprecated since Ruby 2.7. See https://bugs.ruby-lang.org/issues/15893
        end
      }
      output("  Successful ping of #{name}")
    rescue Timeout::Error, StandardError => e
      output("Ping failed for #{name}: #{e.inspect} (URL #{link})")
    end
  end
end

#sitemapObject

Lazy-initialize a sitemap instance and return it.



322
323
324
# File 'lib/sitemap_generator/link_set.rb', line 322

def sitemap
  @sitemap ||= SitemapGenerator::Builder::SitemapFile.new(sitemap_location)
end

#sitemap_indexObject

Lazy-initialize a sitemap index instance and return it.



327
328
329
# File 'lib/sitemap_generator/link_set.rb', line 327

def sitemap_index
  @sitemap_index ||= SitemapGenerator::Builder::SitemapIndexFile.new(sitemap_index_location)
end

#sitemap_index_urlObject

Return the full url to the sitemap index file. When ‘create_index` is `false` the first sitemap is technically the index, so this will be its URL. It’s important to use this method to get the index url because ‘sitemap_index.location.url` will not be correct in such situations.

KJV: This is somewhat confusing.



337
338
339
# File 'lib/sitemap_generator/link_set.rb', line 337

def sitemap_index_url
  sitemap_index.index_url
end

#sitemaps_hostObject

Return the host to use in links to the sitemap files. This defaults to your default_host.



317
318
319
# File 'lib/sitemap_generator/link_set.rb', line 317

def sitemaps_host
  @sitemaps_host || @default_host
end

#yield_sitemap?Boolean

Return a boolean indicating whether or not to yield the sitemap.

Returns:

  • (Boolean)


376
377
378
# File 'lib/sitemap_generator/link_set.rb', line 376

def yield_sitemap?
  @yield_sitemap.nil? ? SitemapGenerator.yield_sitemap? : !!@yield_sitemap
end