Method: Wgit::Crawler#crawl_site
- Defined in:
- lib/wgit/crawler.rb
#crawl_site(url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Array<Wgit::Url>? Also known as: crawl_r
Crawls an entire website's HTML pages by recursively going through
its internal <a> links; this can be overridden with follow: xpath.
Each crawled Document is yielded to a block. Use doc.empty? to
determine if the crawled link was successful / is valid.
Use the allow and disallow paths params to partially and selectively
crawl a site; the glob syntax is fully supported e.g. 'wiki/\*' etc.
Only redirects to the same host are followed. For example, the Url 'http://www.example.co.uk/how' has a host of 'www.example.co.uk' meaning a link which redirects to 'https://ftp.example.co.uk' or 'https://www.example.com' will not be followed. The only exception to this is the initially crawled url which is allowed to redirect anywhere; it's host is then used for other link redirections on the site, as described above.
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
# File 'lib/wgit/crawler.rb', line 146 def crawl_site( url, follow: :default, max_pages: nil, allow_paths: nil, disallow_paths: nil, &block ) doc = crawl_url(url, &block) return nil if doc.empty? total_pages = 1 limit_reached = max_pages && total_pages >= max_pages link_opts = { xpath: follow, allow_paths:, disallow_paths: } crawled = Set.new(url.redirects_journey) externals = Set.new(doc.external_links) internals = Set.new(next_internal_links(doc, **link_opts)) return externals.to_a if internals.empty? loop do if limit_reached Wgit.logger.debug("Crawled and reached the max_pages limit of: #{max_pages}") break end links = subtract_links(internals, crawled) break if links.empty? links.each do |link| limit_reached = max_pages && total_pages >= max_pages break if limit_reached doc = crawl_url(link, follow_redirects: :host, &block) crawled += link.redirects_journey next if doc.empty? total_pages += 1 internals += next_internal_links(doc, **link_opts) externals += doc.external_links end end Wgit.logger.debug("Crawled #{total_pages} documents for the site: #{url}") externals.to_a end |