Class: Wgit::Indexer
Overview
Class which crawls and saves the Documents to a database. Can be thought of as a combination of Wgit::Crawler and Wgit::Database::DatabaseAdapter.
Constant Summary collapse
- WGIT_IGNORE_ROBOTS_TXT =
The ENV var used to omit and ignore robots.txt parsing during an index. Applies to all index_* methods if set in the ENV.
"WGIT_IGNORE_ROBOTS_TXT".freeze
- SKIP_UPSERT =
The block return value used to skip saving a crawled document to the database. Applies to all index_* methods that take a block.
:skip.freeze
Constants included from Assertable
Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG
Instance Attribute Summary collapse
-
#crawler ⇒ Object
readonly
The crawler used to index the WWW.
-
#db ⇒ Object
(also: #database)
readonly
The database instance used to store Urls and Documents in.
Instance Method Summary collapse
-
#index_site(url, insert_externals: false, follow: :default, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer
(also: #index_r)
Crawls a single website's pages and stores them into the database.
-
#index_url(url, insert_externals: false) {|doc| ... } ⇒ Object
Crawls a single webpage and stores it into the database.
-
#index_urls(*urls, insert_externals: false) {|doc| ... } ⇒ Object
(also: #index)
Crawls one or more webpages and stores them into the database.
-
#index_www(max_sites: -1,, max_data: 1_048_576_000, max_urls_per_iteration: 10) ⇒ Object
Retrieves uncrawled url's from the database and recursively crawls each site storing their internal pages into the database and adding their external url's to be crawled later on.
-
#initialize(database = Wgit::Database.new, crawler = Wgit::Crawler.new) ⇒ Indexer
constructor
Initialize the Indexer.
-
#keep_crawling?(site_count, max_sites, max_data) ⇒ Boolean
protected
Returns whether or not to keep crawling based on the DB size and current loop iteration.
-
#upsert_doc(doc) ⇒ Object
protected
Write the doc to the DB.
-
#upsert_external_urls(urls) ⇒ Integer
protected
Write the external urls to the DB.
-
#upsert_url_and_redirects(url) ⇒ Integer
protected
Upsert the url and its redirects, setting all to crawled = true.
Methods included from Assertable
#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types
Constructor Details
#initialize(database = Wgit::Database.new, crawler = Wgit::Crawler.new) ⇒ Indexer
Initialize the Indexer.
32 33 34 35 36 37 38 |
# File 'lib/wgit/indexer.rb', line 32 def initialize(database = Wgit::Database.new, crawler = Wgit::Crawler.new) assert_type(database, Wgit::Database::DatabaseAdapter) assert_type(crawler, Wgit::Crawler) @db = database @crawler = crawler end |
Instance Attribute Details
#crawler ⇒ Object (readonly)
The crawler used to index the WWW.
22 23 24 |
# File 'lib/wgit/indexer.rb', line 22 def crawler @crawler end |
#db ⇒ Object (readonly) Also known as: database
The database instance used to store Urls and Documents in.
25 26 27 |
# File 'lib/wgit/indexer.rb', line 25 def db @db end |
Instance Method Details
#index_site(url, insert_externals: false, follow: :default, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer Also known as: index_r
Crawls a single website's pages and stores them into the database. There is no max download limit so be careful which sites you index. Logs info on the crawl using Wgit.logger as it goes along. This method will honour the site's robots.txt and 'noindex' requests.
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
# File 'lib/wgit/indexer.rb', line 141 def index_site( url, insert_externals: false, follow: :default, allow_paths: nil, disallow_paths: nil ) parser = parse_robots_txt(url) if parser&.no_index? upsert_url_and_redirects(url) return 0 end allow_paths, disallow_paths = merge_paths(parser, allow_paths, disallow_paths) crawl_opts = { follow:, allow_paths:, disallow_paths: } total_pages_indexed = 0 ext_urls = @crawler.crawl_site(url, **crawl_opts) do |doc| next if no_index?(@crawler.last_response, doc) result = block_given? ? yield(doc) : true next if doc.empty? || result == SKIP_UPSERT upsert_doc(doc) total_pages_indexed += 1 end upsert_url_and_redirects(url) upsert_external_urls(ext_urls) if insert_externals && ext_urls Wgit.logger.info("Crawled and indexed #{total_pages_indexed} documents \ for the site: #{url}") total_pages_indexed end |
#index_url(url, insert_externals: false) {|doc| ... } ⇒ Object
Crawls a single webpage and stores it into the database. There is no max download limit so be careful of large pages. Logs info on the crawl using Wgit.logger as it goes along. This method will honour the site's robots.txt and 'noindex' requests in relation to the given url.
211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 |
# File 'lib/wgit/indexer.rb', line 211 def index_url(url, insert_externals: false) parser = parse_robots_txt(url) if parser && (parser.no_index? || contains_path?(parser.disallow_paths, url)) upsert_url_and_redirects(url) return end document = @crawler.crawl_url(url) do |doc| break if no_index?(@crawler.last_response, doc) result = block_given? ? yield(doc) : true break if doc.empty? || result == SKIP_UPSERT upsert_doc(doc) end upsert_url_and_redirects(url) ext_urls = document&.external_links upsert_external_urls(ext_urls) if insert_externals && ext_urls nil end |
#index_urls(*urls, insert_externals: false) {|doc| ... } ⇒ Object Also known as: index
Crawls one or more webpages and stores them into the database. There is no max download limit so be careful of large pages. Logs info on the crawl using Wgit.logger as it goes along. This method will honour the site's robots.txt and 'noindex' requests in relation to the given urls.
189 190 191 192 193 194 195 196 |
# File 'lib/wgit/indexer.rb', line 189 def index_urls(*urls, insert_externals: false, &block) raise 'You must provide at least one Url' if urls.empty? opts = { insert_externals: } Wgit::Utils.each(urls) { |url| index_url(url, **opts, &block) } nil end |
#index_www(max_sites: -1,, max_data: 1_048_576_000, max_urls_per_iteration: 10) ⇒ Object
Retrieves uncrawled url's from the database and recursively crawls each site storing their internal pages into the database and adding their external url's to be crawled later on. Logs info on the crawl using Wgit.logger as it goes along. This method will honour all site's robots.txt and 'noindex' requests.
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 |
# File 'lib/wgit/indexer.rb', line 57 def index_www(max_sites: -1, max_data: 1_048_576_000, max_urls_per_iteration: 10) if max_sites.negative? Wgit.logger.info("Indexing until the database has been filled or it \ runs out of urls to crawl (which might be never)") end site_count = 0 while keep_crawling?(site_count, max_sites, max_data) Wgit.logger.info("Current database size: #{@db.size}") uncrawled_urls = @db.uncrawled_urls(limit: max_urls_per_iteration) if uncrawled_urls.empty? Wgit.logger.info('No urls to crawl, exiting') return end Wgit.logger.info("Starting indexing loop for: #{uncrawled_urls.map(&:to_s)}") docs_count = 0 urls_count = 0 uncrawled_urls.each do |url| unless keep_crawling?(site_count, max_sites, max_data) Wgit.logger.info("Reached max number of sites to crawl or \ database capacity, exiting") return end site_count += 1 parser = parse_robots_txt(url) if parser&.no_index? upsert_url_and_redirects(url) next end site_docs_count = 0 ext_links = @crawler.crawl_site( url, allow_paths: parser&.allow_paths, disallow_paths: parser&.disallow_paths ) do |doc| next if doc.empty? || no_index?(@crawler.last_response, doc) upsert_doc(doc) docs_count += 1 site_docs_count += 1 end upsert_url_and_redirects(url) urls_count += upsert_external_urls(ext_links) end Wgit.logger.info("Crawled and indexed documents for #{docs_count} \ url(s) during this iteration") Wgit.logger.info("Found and saved #{urls_count} external url(s) for \ future iterations") end nil end |
#keep_crawling?(site_count, max_sites, max_data) ⇒ Boolean (protected)
Returns whether or not to keep crawling based on the DB size and current loop iteration.
247 248 249 250 251 252 |
# File 'lib/wgit/indexer.rb', line 247 def keep_crawling?(site_count, max_sites, max_data) return false if @db.size >= max_data return true if max_sites.negative? site_count < max_sites end |
#upsert_doc(doc) ⇒ Object (protected)
Write the doc to the DB. Note that the unique url index on the documents collection deliberately prevents duplicate inserts. If the document already exists, then it will be updated in the DB.
259 260 261 262 263 264 265 |
# File 'lib/wgit/indexer.rb', line 259 def upsert_doc(doc) if @db.upsert(doc) Wgit.logger.info("Saved document for url: #{doc.url}") else Wgit.logger.info("Updated document for url: #{doc.url}") end end |
#upsert_external_urls(urls) ⇒ Integer (protected)
Write the external urls to the DB. For any external url, its origin will be inserted e.g. if the external url is http://example.com/contact then http://example.com will be inserted into the database. Note that the unique url index on the urls collection deliberately prevents duplicate inserts.
286 287 288 289 290 291 292 293 294 295 296 297 |
# File 'lib/wgit/indexer.rb', line 286 def upsert_external_urls(urls) urls = urls .reject(&:invalid?) .map(&:to_origin) .uniq return 0 if urls.empty? count = @db.bulk_upsert(urls) Wgit.logger.info("Saved #{count} external urls") count end |
#upsert_url_and_redirects(url) ⇒ Integer (protected)
Upsert the url and its redirects, setting all to crawled = true.
271 272 273 274 275 276 |
# File 'lib/wgit/indexer.rb', line 271 def upsert_url_and_redirects(url) url.crawled = true unless url.crawled? # Upsert the url and any url redirects, setting them as crawled also. @db.bulk_upsert(url.redirects_journey) end |