Class: Wgit::Indexer

Inherits:
Object
  • Object
show all
Includes:
Assertable
Defined in:
lib/wgit/indexer.rb

Overview

Class which crawls and saves the Documents to a database. Can be thought of as a combination of Wgit::Crawler and Wgit::Database::DatabaseAdapter.

Constant Summary collapse

WGIT_IGNORE_ROBOTS_TXT =

The ENV var used to omit and ignore robots.txt parsing during an index. Applies to all index_* methods if set in the ENV.

"WGIT_IGNORE_ROBOTS_TXT".freeze
SKIP_UPSERT =

The block return value used to skip saving a crawled document to the database. Applies to all index_* methods that take a block.

:skip.freeze

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from Assertable

#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(database = Wgit::Database.new, crawler = Wgit::Crawler.new) ⇒ Indexer

Initialize the Indexer.

Parameters:

  • database (Wgit::Database::DatabaseAdapter) (defaults to: Wgit::Database.new)

    The database instance (already initialized and connected) used for indexing.

  • crawler (Wgit::Crawler) (defaults to: Wgit::Crawler.new)

    The crawler instance used for indexing.



32
33
34
35
36
37
38
# File 'lib/wgit/indexer.rb', line 32

def initialize(database = Wgit::Database.new, crawler = Wgit::Crawler.new)
  assert_type(database, Wgit::Database::DatabaseAdapter)
  assert_type(crawler, Wgit::Crawler)

  @db      = database
  @crawler = crawler
end

Instance Attribute Details

#crawlerObject (readonly)

The crawler used to index the WWW.



22
23
24
# File 'lib/wgit/indexer.rb', line 22

def crawler
  @crawler
end

#dbObject (readonly) Also known as: database

The database instance used to store Urls and Documents in.



25
26
27
# File 'lib/wgit/indexer.rb', line 25

def db
  @db
end

Instance Method Details

#index_site(url, insert_externals: false, follow: :default, allow_paths: nil, disallow_paths: nil) {|doc| ... } ⇒ Integer Also known as: index_r

Crawls a single website's pages and stores them into the database. There is no max download limit so be careful which sites you index. Logs info on the crawl using Wgit.logger as it goes along. This method will honour the site's robots.txt and 'noindex' requests.

Parameters:

  • url (Wgit::Url)

    The base Url of the website to crawl.

  • insert_externals (Boolean) (defaults to: false)

    Whether or not to insert the website's external Url's into the database.

  • follow (String) (defaults to: :default)

    The xpath extracting links to be followed during the crawl. This changes how a site is crawled. Only links pointing to the site domain are allowed. The :default is any <a> href returning HTML.

  • allow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow: links by selecting them if their path File.fnmatch? one of allow_paths.

  • disallow_paths (String, Array<String>) (defaults to: nil)

    Filters the follow links by rejecting them if their path File.fnmatch? one of disallow_paths.

Yields:

  • (doc)

    Given the Wgit::Document of each crawled web page before it's inserted into the database allowing for prior manipulation. Return nil or false from the block to prevent the document from being saved into the database.

Returns:

  • (Integer)

    The total number of webpages/documents indexed.



141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
# File 'lib/wgit/indexer.rb', line 141

def index_site(
  url, insert_externals: false, follow: :default,
  allow_paths: nil, disallow_paths: nil
)
  parser = parse_robots_txt(url)
  if parser&.no_index?
    upsert_url_and_redirects(url)

    return 0
  end

  allow_paths, disallow_paths = merge_paths(parser, allow_paths, disallow_paths)
  crawl_opts = { follow:, allow_paths:, disallow_paths: }
  total_pages_indexed = 0

  ext_urls = @crawler.crawl_site(url, **crawl_opts) do |doc|
    next if no_index?(@crawler.last_response, doc)

    result = block_given? ? yield(doc) : true
    next if doc.empty? || result == SKIP_UPSERT

    upsert_doc(doc)
    total_pages_indexed += 1
  end

  upsert_url_and_redirects(url)
  upsert_external_urls(ext_urls) if insert_externals && ext_urls

  Wgit.logger.info("Crawled and indexed #{total_pages_indexed} documents \
for the site: #{url}")

  total_pages_indexed
end

#index_url(url, insert_externals: false) {|doc| ... } ⇒ Object

Crawls a single webpage and stores it into the database. There is no max download limit so be careful of large pages. Logs info on the crawl using Wgit.logger as it goes along. This method will honour the site's robots.txt and 'noindex' requests in relation to the given url.

Parameters:

  • url (Wgit::Url)

    The webpage Url to crawl.

  • insert_externals (Boolean) (defaults to: false)

    Whether or not to insert the webpages external Url's into the database.

Yields:

  • (doc)

    Given the Wgit::Document of the crawled webpage, before it's inserted into the database allowing for prior manipulation. Return nil or false from the block to prevent the document from being saved into the database.



211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
# File 'lib/wgit/indexer.rb', line 211

def index_url(url, insert_externals: false)
  parser = parse_robots_txt(url)
  if parser && (parser.no_index? || contains_path?(parser.disallow_paths, url))
    upsert_url_and_redirects(url)

    return
  end

  document = @crawler.crawl_url(url) do |doc|
    break if no_index?(@crawler.last_response, doc)

    result = block_given? ? yield(doc) : true
    break if doc.empty? || result == SKIP_UPSERT

    upsert_doc(doc)
  end

  upsert_url_and_redirects(url)

  ext_urls = document&.external_links
  upsert_external_urls(ext_urls) if insert_externals && ext_urls

  nil
end

#index_urls(*urls, insert_externals: false) {|doc| ... } ⇒ Object Also known as: index

Crawls one or more webpages and stores them into the database. There is no max download limit so be careful of large pages. Logs info on the crawl using Wgit.logger as it goes along. This method will honour the site's robots.txt and 'noindex' requests in relation to the given urls.

Parameters:

  • urls (*Wgit::Url)

    The webpage Url's to crawl.

  • insert_externals (Boolean) (defaults to: false)

    Whether or not to insert the webpages external Url's into the database.

Yields:

  • (doc)

    Given the Wgit::Document of the crawled webpage, before it's inserted into the database allowing for prior manipulation. Return nil or false from the block to prevent the document from being saved into the database.

Raises:

  • (StandardError)

    if no urls are provided.



189
190
191
192
193
194
195
196
# File 'lib/wgit/indexer.rb', line 189

def index_urls(*urls, insert_externals: false, &block)
  raise 'You must provide at least one Url' if urls.empty?

  opts = { insert_externals: }
  Wgit::Utils.each(urls) { |url| index_url(url, **opts, &block) }

  nil
end

#index_www(max_sites: -1,, max_data: 1_048_576_000, max_urls_per_iteration: 10) ⇒ Object

Retrieves uncrawled url's from the database and recursively crawls each site storing their internal pages into the database and adding their external url's to be crawled later on. Logs info on the crawl using Wgit.logger as it goes along. This method will honour all site's robots.txt and 'noindex' requests.

Parameters:

  • max_sites (Integer) (defaults to: -1,)

    The number of separate and whole websites to be crawled before the method exits. Defaults to -1 which means the crawl will occur until manually stopped (Ctrl+C), the max_data has been reached, or it runs out of external urls to index.

  • max_data (Integer) (defaults to: 1_048_576_000)

    The maximum amount of bytes that will be scraped from the web (default is 1GB). Note, that this value is used to determine when to stop crawling; it's not a guarantee of the max data that will be obtained.

  • max_urls_per_iteration (Integer) (defaults to: 10)

    The maximum number of uncrawled urls to index for each iteration, before checking max_sites and max_data, possibly ending the crawl.



57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# File 'lib/wgit/indexer.rb', line 57

def index_www(max_sites: -1, max_data: 1_048_576_000, max_urls_per_iteration: 10)
  if max_sites.negative?
    Wgit.logger.info("Indexing until the database has been filled or it \
runs out of urls to crawl (which might be never)")
  end
  site_count = 0

  while keep_crawling?(site_count, max_sites, max_data)
    Wgit.logger.info("Current database size: #{@db.size}")

    uncrawled_urls = @db.uncrawled_urls(limit: max_urls_per_iteration)

    if uncrawled_urls.empty?
      Wgit.logger.info('No urls to crawl, exiting')

      return
    end
    Wgit.logger.info("Starting indexing loop for: #{uncrawled_urls.map(&:to_s)}")

    docs_count = 0
    urls_count = 0

    uncrawled_urls.each do |url|
      unless keep_crawling?(site_count, max_sites, max_data)
        Wgit.logger.info("Reached max number of sites to crawl or \
database capacity, exiting")

        return
      end
      site_count += 1

      parser = parse_robots_txt(url)
      if parser&.no_index?
        upsert_url_and_redirects(url)

        next
      end

      site_docs_count = 0
      ext_links = @crawler.crawl_site(
        url, allow_paths: parser&.allow_paths, disallow_paths: parser&.disallow_paths
      ) do |doc|
        next if doc.empty? || no_index?(@crawler.last_response, doc)

        upsert_doc(doc)
        docs_count += 1
        site_docs_count += 1
      end

      upsert_url_and_redirects(url)

      urls_count += upsert_external_urls(ext_links)
    end

    Wgit.logger.info("Crawled and indexed documents for #{docs_count} \
url(s) during this iteration")
    Wgit.logger.info("Found and saved #{urls_count} external url(s) for \
future iterations")
  end

  nil
end

#keep_crawling?(site_count, max_sites, max_data) ⇒ Boolean (protected)

Returns whether or not to keep crawling based on the DB size and current loop iteration.

Parameters:

  • site_count (Integer)

    The current number of crawled sites.

  • max_sites (Integer)

    The maximum number of sites to crawl before stopping. Use -1 for an infinite number of sites.

  • max_data (Integer)

    The maximum amount of data to crawl before stopping.

Returns:

  • (Boolean)

    True if the crawl should continue, false otherwise.



247
248
249
250
251
252
# File 'lib/wgit/indexer.rb', line 247

def keep_crawling?(site_count, max_sites, max_data)
  return false if @db.size >= max_data
  return true  if max_sites.negative?

  site_count < max_sites
end

#upsert_doc(doc) ⇒ Object (protected)

Write the doc to the DB. Note that the unique url index on the documents collection deliberately prevents duplicate inserts. If the document already exists, then it will be updated in the DB.

Parameters:



259
260
261
262
263
264
265
# File 'lib/wgit/indexer.rb', line 259

def upsert_doc(doc)
  if @db.upsert(doc)
    Wgit.logger.info("Saved document for url: #{doc.url}")
  else
    Wgit.logger.info("Updated document for url: #{doc.url}")
  end
end

#upsert_external_urls(urls) ⇒ Integer (protected)

Write the external urls to the DB. For any external url, its origin will be inserted e.g. if the external url is http://example.com/contact then http://example.com will be inserted into the database. Note that the unique url index on the urls collection deliberately prevents duplicate inserts.

Parameters:

  • urls (Array<Wgit::Url>)

    The external urls to write to the DB.

Returns:

  • (Integer)

    The number of upserted urls.



286
287
288
289
290
291
292
293
294
295
296
297
# File 'lib/wgit/indexer.rb', line 286

def upsert_external_urls(urls)
  urls = urls
         .reject(&:invalid?)
         .map(&:to_origin)
         .uniq
  return 0 if urls.empty?

  count = @db.bulk_upsert(urls)
  Wgit.logger.info("Saved #{count} external urls")

  count
end

#upsert_url_and_redirects(url) ⇒ Integer (protected)

Upsert the url and its redirects, setting all to crawled = true.

Parameters:

  • url (Wgit::Url)

    The url to write to the DB.

Returns:

  • (Integer)

    The number of upserted urls (url + redirect urls).



271
272
273
274
275
276
# File 'lib/wgit/indexer.rb', line 271

def upsert_url_and_redirects(url)
  url.crawled = true unless url.crawled?

  # Upsert the url and any url redirects, setting them as crawled also.
  @db.bulk_upsert(url.redirects_journey)
end