Class: WaybackArchiver::Sitemapper

Inherits:
Object
  • Object
show all
Defined in:
lib/wayback_archiver/sitemapper.rb

Overview

Fetch and parse sitemaps recursively

Constant Summary collapse

COMMON_SITEMAP_LOCATIONS =

Common locations for Sitemap(s)

%w[
  sitemap_index.xml.gz
  sitemap-index.xml.gz
  sitemap_index.xml
  sitemap-index.xml
  sitemap.xml.gz
  sitemap.xml
].freeze

Class Method Summary collapse

Class Method Details

.autodiscover(url) ⇒ Array<String>

Autodiscover the location of the Sitemap, then fetch and parse recursively. First it tries /robots.txt, then common locations for Sitemap and finally the supplied URL.

Examples:

Get URLs defined in Sitemap for google.com

Sitemapper.autodiscover('https://google.com/')

Parameters:

  • url (URI)

    to domain.

Returns:

  • (Array<String>)

    of URLs defined in Sitemap(s).

See Also:



27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# File 'lib/wayback_archiver/sitemapper.rb', line 27

def self.autodiscover(url)
  WaybackArchiver.logger.info 'Looking for Sitemap(s) in /robots.txt'
  robots = Robots.new(WaybackArchiver.user_agent)
  sitemaps = robots.other_values(url)['Sitemap']

  if sitemaps
    return sitemaps.flat_map do |sitemap|
      WaybackArchiver.logger.info "Fetching Sitemap at #{sitemap}"
      urls(url: sitemap)
    end
  end

  COMMON_SITEMAP_LOCATIONS.each do |path|
    WaybackArchiver.logger.info "Looking for Sitemap at #{path}"
    sitemap_url = [url, path].join(url.end_with?('/') ? '' : '/')
    response = Request.get(sitemap_url, raise_on_http_error: false)

    if response.success?
      WaybackArchiver.logger.info "Sitemap found at #{sitemap_url}"
      return urls(xml: response.body)
    end
  end

  WaybackArchiver.logger.info "Looking for Sitemap at #{url}"
  urls(url: url)
rescue Request::Error => e
  WaybackArchiver.logger.error "Error raised when requesting #{url}, #{e.class}, #{e.message}"
  []
end

.urls(url: nil, xml: nil, visited: Set.new) ⇒ Array<String>

Fetch and parse sitemaps recursively.

Examples:

Get URLs defined in Sitemap for google.com

Sitemapper.urls(url: 'https://google.com/sitemap.xml')

Get URLs defined in Sitemap

Sitemapper.urls(xml: xml)

Parameters:

  • url (String) (defaults to: nil)

    URL to Sitemap.

  • xml (String) (defaults to: nil)

    Sitemap XML.

Returns:

  • (Array<String>)

    of URLs defined in Sitemap(s).

See Also:



66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/wayback_archiver/sitemapper.rb', line 66

def self.urls(url: nil, xml: nil, visited: Set.new)
  if visited.include?(url)
    WaybackArchiver.logger.debug "Already visited #{url} skipping.."
    return []
  end

  visited << url if url

  xml = Request.get(url).body unless xml
  sitemap = Sitemap.new(xml)

  if sitemap.sitemap_index?
    sitemap.sitemaps.flat_map do |sitemap_url|
      urls(url: sitemap_url, visited: visited)
    end
  else
    sitemap.urls.map { |url| url&.strip }
  end
rescue Request::Error => e
  WaybackArchiver.logger.error "Error raised when requesting #{url}, #{e.class}, #{e.message}"

  []
end