Class: WaybackArchiver::Sitemapper
- Inherits:
-
Object
- Object
- WaybackArchiver::Sitemapper
- Defined in:
- lib/wayback_archiver/sitemapper.rb
Overview
Fetch and parse sitemaps recursively
Constant Summary collapse
- COMMON_SITEMAP_LOCATIONS =
Common locations for Sitemap(s)
%w[ sitemap_index.xml.gz sitemap-index.xml.gz sitemap_index.xml sitemap-index.xml sitemap.xml.gz sitemap.xml ].freeze
Class Method Summary collapse
-
.autodiscover(url) ⇒ Array<String>
Autodiscover the location of the Sitemap, then fetch and parse recursively.
-
.urls(url: nil, xml: nil, visited: Set.new) ⇒ Array<String>
Fetch and parse sitemaps recursively.
Class Method Details
.autodiscover(url) ⇒ Array<String>
Autodiscover the location of the Sitemap, then fetch and parse recursively. First it tries /robots.txt, then common locations for Sitemap and finally the supplied URL.
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
# File 'lib/wayback_archiver/sitemapper.rb', line 27 def self.autodiscover(url) WaybackArchiver.logger.info 'Looking for Sitemap(s) in /robots.txt' robots = Robots.new(WaybackArchiver.user_agent) sitemaps = robots.other_values(url)['Sitemap'] if sitemaps return sitemaps.flat_map do |sitemap| WaybackArchiver.logger.info "Fetching Sitemap at #{sitemap}" urls(url: sitemap) end end COMMON_SITEMAP_LOCATIONS.each do |path| WaybackArchiver.logger.info "Looking for Sitemap at #{path}" sitemap_url = [url, path].join(url.end_with?('/') ? '' : '/') response = Request.get(sitemap_url, raise_on_http_error: false) if response.success? WaybackArchiver.logger.info "Sitemap found at #{sitemap_url}" return urls(xml: response.body) end end WaybackArchiver.logger.info "Looking for Sitemap at #{url}" urls(url: url) rescue Request::Error => e WaybackArchiver.logger.error "Error raised when requesting #{url}, #{e.class}, #{e.}" [] end |
.urls(url: nil, xml: nil, visited: Set.new) ⇒ Array<String>
Fetch and parse sitemaps recursively.
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/wayback_archiver/sitemapper.rb', line 66 def self.urls(url: nil, xml: nil, visited: Set.new) if visited.include?(url) WaybackArchiver.logger.debug "Already visited #{url} skipping.." return [] end visited << url if url xml = Request.get(url).body unless xml sitemap = Sitemap.new(xml) if sitemap.sitemap_index? sitemap.sitemaps.flat_map do |sitemap_url| urls(url: sitemap_url, visited: visited) end else sitemap.urls.map { |url| url&.strip } end rescue Request::Error => e WaybackArchiver.logger.error "Error raised when requesting #{url}, #{e.class}, #{e.}" [] end |