Class: Wgit::Database::MongoDB

Inherits:
DatabaseAdapter show all
Defined in:
lib/wgit/database/adapters/mongo_db.rb

Overview

Database implementer class for MongoDB.

Constant Summary collapse

URLS_COLLECTION =

The default name of the urls collection.

:urls
DOCUMENTS_COLLECTION =

The default name of the documents collection.

:documents
TEXT_INDEX =

The default name of the documents collection text search index.

'text_search'
UNIQUE_INDEX =

The default name of the urls and documents collections unique index.

'unique_url'

Constants inherited from DatabaseAdapter

DatabaseAdapter::NOT_IMPL_ERR

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Assertable

#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(connection_string = nil) ⇒ MongoDB

Initializes a connected database client using the provided connection_string or ENV['WGIT_CONNECTION_STRING'].

Parameters:

  • connection_string (String) (defaults to: nil)

    The connection string needed to connect to the database.

Raises:

  • (StandardError)

    If a connection string isn't provided, either as a parameter or via the environment.



42
43
44
45
46
47
48
49
50
51
# File 'lib/wgit/database/adapters/mongo_db.rb', line 42

def initialize(connection_string = nil)
  connection_string ||= ENV['WGIT_CONNECTION_STRING']
  raise "connection_string and ENV['WGIT_CONNECTION_STRING'] are nil" \
  unless connection_string

  @client = MongoDB.establish_connection(connection_string)
  @connection_string = connection_string

  super
end

Instance Attribute Details

#clientObject (readonly)

The database client object. Gets set when a connection is established.



30
31
32
# File 'lib/wgit/database/adapters/mongo_db.rb', line 30

def client
  @client
end

#connection_stringObject (readonly)

The connection string for the database.



27
28
29
# File 'lib/wgit/database/adapters/mongo_db.rb', line 27

def connection_string
  @connection_string
end

#last_resultObject (readonly)

The raw MongoDB client result of the most recent operation.



33
34
35
# File 'lib/wgit/database/adapters/mongo_db.rb', line 33

def last_result
  @last_result
end

Class Method Details

.connect(connection_string = nil) ⇒ Wgit::Database::MongoDB

A class alias for self.new.

Parameters:

  • connection_string (String) (defaults to: nil)

    The connection string needed to connect to the database.

Returns:

Raises:

  • (StandardError)

    If a connection string isn't provided, either as a parameter or via the environment.



60
61
62
# File 'lib/wgit/database/adapters/mongo_db.rb', line 60

def self.connect(connection_string = nil)
  new(connection_string)
end

.establish_connection(connection_string) ⇒ Mong::Client

Initializes a connected database client using the connection string.

Parameters:

  • connection_string (String)

    The connection string needed to connect to the database.

Returns:

  • (Mong::Client)

    The connected MongoDB client.

Raises:

  • (StandardError)

    If a connection cannot be established.



70
71
72
73
74
75
76
77
78
# File 'lib/wgit/database/adapters/mongo_db.rb', line 70

def self.establish_connection(connection_string)
  # Only log for error (and more severe) scenarios.
  Mongo::Logger.logger          = Wgit.logger.clone
  Mongo::Logger.logger.progname = 'mongo'
  Mongo::Logger.logger.level    = Logger::ERROR

  # Connects to the database here.
  Mongo::Client.new(connection_string)
end

Instance Method Details

#bulk_upsert(objs) ⇒ Integer

Bulk upserts the objects in the database collection. You cannot mix collection objs types, all must be Urls or Documents.

Parameters:

Returns:

  • (Integer)

    The total number of newly inserted objects.

Raises:

  • (StandardError)

    If objs is empty.



185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
# File 'lib/wgit/database/adapters/mongo_db.rb', line 185

def bulk_upsert(objs)
  assert_common_arr_types(objs, [Wgit::Url, Wgit::Document])
  raise 'objs is empty' if objs.empty?

  collection = nil
  request_objs = objs.map do |obj|
    collection, query, model = get_model_info(obj)
    data_hash = model.merge(Wgit::Model.common_update_data)

    {
      update_many: {
        filter: query,
        update: { '$set' => data_hash },
        upsert: true
      }
    }
  end

  result = @client[collection].bulk_write(request_objs)
  result.upserted_count + result.modified_count
ensure
  @last_result = result
end

#crawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>

Returns Url records that have been crawled.

Parameters:

  • limit (Integer) (defaults to: 0)

    The max number of Url's to return. 0 returns all.

  • skip (Integer) (defaults to: 0)

    Skip n amount of Url's.

Yields:

  • (url)

    Given each Url object (Wgit::Url) returned from the DB.

Returns:

  • (Array<Wgit::Url>)

    The crawled Urls obtained from the DB.



256
257
258
# File 'lib/wgit/database/adapters/mongo_db.rb', line 256

def crawled_urls(limit: 0, skip: 0, &block)
  urls(crawled: true, limit:, skip:, &block)
end

#create_collectionsnil

Creates the 'urls' and 'documents' collections.

Returns:

  • (nil)

    Always returns nil.



85
86
87
88
89
90
# File 'lib/wgit/database/adapters/mongo_db.rb', line 85

def create_collections
  @client[URLS_COLLECTION].create
  @client[DOCUMENTS_COLLECTION].create

  nil
end

#create_unique_indexesnil

Creates the urls and documents unique 'url' indexes.

Returns:

  • (nil)

    Always returns nil.



95
96
97
98
99
100
101
102
103
104
105
# File 'lib/wgit/database/adapters/mongo_db.rb', line 95

def create_unique_indexes
  @client[URLS_COLLECTION].indexes.create_one(
    { url: 1 }, name: UNIQUE_INDEX, unique: true
  )

  @client[DOCUMENTS_COLLECTION].indexes.create_one(
    { 'url.url' => 1 }, name: UNIQUE_INDEX, unique: true
  )

  nil
end

#delete(obj) ⇒ Integer

Deletes a record from the database with the matching 'url' field. Pass either a Wgit::Url or Wgit::Document instance.

Parameters:

Returns:

  • (Integer)

    The number of records deleted - should always be 0 or 1 because urls are unique.



462
463
464
465
466
467
468
# File 'lib/wgit/database/adapters/mongo_db.rb', line 462

def delete(obj)
  collection, query = get_model_info(obj)
  result = @client[collection].delete_one(query)
  result.n
ensure
  @last_result = result
end

#doc?(doc) ⇒ Boolean

Returns whether or not a record with the given doc 'url.url' field (which is unique) exists in the database's 'documents' collection.

Parameters:

Returns:

  • (Boolean)

    True if doc exists, otherwise false.



409
410
411
412
413
# File 'lib/wgit/database/adapters/mongo_db.rb', line 409

def doc?(doc)
  assert_type(doc, Wgit::Document)
  query = { 'url.url' => doc.url }
  retrieve(DOCUMENTS_COLLECTION, query, limit: 1).any?
end

#docs(limit: 0, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>

Returns all Document records from the DB. Use #search to filter based on the Wgit::Model.search_fields of the documents collection.

All Documents are sorted by date_added ascending, in other words the first doc returned is the first one that was inserted into the DB.

Parameters:

  • limit (Integer) (defaults to: 0)

    The max number of returned records. 0 returns all.

  • skip (Integer) (defaults to: 0)

    Skip n records.

Yields:

  • (doc)

    Given each Document object (Wgit::Document) returned from the DB.

Returns:



222
223
224
225
226
227
228
# File 'lib/wgit/database/adapters/mongo_db.rb', line 222

def docs(limit: 0, skip: 0, &block)
  results = retrieve(DOCUMENTS_COLLECTION, {},
                     sort: { date_added: 1 }, limit:, skip:)
  return [] if results.count < 1 # results#empty? doesn't exist.

  map_documents(results, &block)
end

#emptyInteger Also known as: empty!

Deletes everything in the urls and documents collections.

Returns:

  • (Integer)

    The number of deleted records.



493
494
495
# File 'lib/wgit/database/adapters/mongo_db.rb', line 493

def empty
  empty_urls + empty_docs
end

#empty_docsInteger

Deletes everything in the documents collection.

Returns:

  • (Integer)

    The number of deleted records.



483
484
485
486
487
488
# File 'lib/wgit/database/adapters/mongo_db.rb', line 483

def empty_docs
  result = @client[DOCUMENTS_COLLECTION].delete_many({})
  result.n
ensure
  @last_result = result
end

#empty_urlsInteger

Deletes everything in the urls collection.

Returns:

  • (Integer)

    The number of deleted records.



473
474
475
476
477
478
# File 'lib/wgit/database/adapters/mongo_db.rb', line 473

def empty_urls
  result = @client[URLS_COLLECTION].delete_many({})
  result.n
ensure
  @last_result = result
end

#exists?(obj) ⇒ Boolean

Returns if a record exists with the given obj's url.

Parameters:

Returns:

  • (Boolean)

    True if a record exists with the url, false otherwise.



420
421
422
# File 'lib/wgit/database/adapters/mongo_db.rb', line 420

def exists?(obj)
  obj.is_a?(String) ? url?(obj) : doc?(obj)
end

#get(obj) ⇒ Wgit::Url, ...

Returns a record from the database with the matching 'url' field; or nil. Pass either a Wgit::Url or Wgit::Document instance.

Parameters:

Returns:



430
431
432
433
434
435
436
437
# File 'lib/wgit/database/adapters/mongo_db.rb', line 430

def get(obj)
  collection, query = get_model_info(obj)

  record = retrieve(collection, query, limit: 1).first
  return nil unless record

  obj.class.new(record)
end

#insert(data) ⇒ Object

Insert one or more Url or Document objects into the DB.

Parameters:

Raises:

  • (StandardError)

    If data isn't valid.



147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# File 'lib/wgit/database/adapters/mongo_db.rb', line 147

def insert(data)
  collection = nil
  request_obj = nil

  if data.respond_to?(:map)
    request_obj = data.map do |obj|
      collection, _, model = get_model_info(obj)
      model
    end
  else
    collection, _, model = get_model_info(data)
    request_obj = model
  end

  create(collection, request_obj)
end

#num_docsInteger

Returns the total number of Document records in the DB.

Returns:

  • (Integer)

    The current number of Document records.



382
383
384
# File 'lib/wgit/database/adapters/mongo_db.rb', line 382

def num_docs
  @client[DOCUMENTS_COLLECTION].count
end

#num_recordsInteger Also known as: num_objects

Returns the total number of records (urls + docs) in the DB.

Returns:

  • (Integer)

    The current number of URL and Document records.



389
390
391
# File 'lib/wgit/database/adapters/mongo_db.rb', line 389

def num_records
  num_urls + num_docs
end

#num_urlsInteger

Returns the total number of URL records in the DB.

Returns:

  • (Integer)

    The current number of URL records.



375
376
377
# File 'lib/wgit/database/adapters/mongo_db.rb', line 375

def num_urls
  @client[URLS_COLLECTION].count
end

#search(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>

Searches the database's Documents for the given query using the Wgit::Model.search_fields.

The MongoDB search algorithm ranks/sorts the results in order (highest first) based on each document's "textScore" (which records the number of query hits). The "textScore" is then stored in each Document result object for use elsewhere if needed; accessed via Wgit::Document#score.

Parameters:

  • query (String)

    The text query to search with.

  • case_sensitive (Boolean) (defaults to: false)

    Whether character case must match.

  • whole_sentence (Boolean) (defaults to: true)

    Whether multiple words should be searched for separately.

  • limit (Integer) (defaults to: 10)

    The max number of results to return.

  • skip (Integer) (defaults to: 0)

    The number of results to skip.

Yields:

  • (doc)

    Given each search result (Wgit::Document) returned from the DB.

Returns:

  • (Array<Wgit::Document>)

    The search results obtained from the DB.



287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
# File 'lib/wgit/database/adapters/mongo_db.rb', line 287

def search(
  query, case_sensitive: false, whole_sentence: true,
  limit: 10, skip: 0, &block
)
  query = query.to_s.strip
  query.replace("\"#{query}\"") if whole_sentence

  # Sort based on the most search hits (aka "textScore").
  # We use the sort_proj hash as both a sort and a projection below.
  sort_proj = { score: { :$meta => 'textScore' } }
  query = {
    :$text => {
      :$search => query,
      :$caseSensitive => case_sensitive
    }
  }

  results = retrieve(DOCUMENTS_COLLECTION, query,
                    sort: sort_proj, projection: sort_proj,
                    limit:, skip:)
  map_documents(results, &block)
end

#search!(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, top_result_only: false) {|doc| ... } ⇒ Hash<String, String | Array<String>>

Searches the database's Documents for the given query and then searches each result in turn using doc.search. Instead of an Array of Documents, this method returns a Hash of the docs url => search_results creating a search engine like result set for quick access to text matches.

Parameters:

  • query (String)

    The text query to search with.

  • case_sensitive (Boolean) (defaults to: false)

    Whether character case must match.

  • whole_sentence (Boolean) (defaults to: true)

    Whether multiple words should be searched for separately.

  • limit (Integer) (defaults to: 10)

    The max number of results to return.

  • skip (Integer) (defaults to: 0)

    The number of results to skip.

  • sentence_limit (Integer) (defaults to: 80)

    The max length of each search result sentence.

  • top_result_only (Boolean) (defaults to: false)

    Whether to return all of the documents search results (Array) or just the top (most relevent) result (String).

Yields:

  • (doc)

    Given each search result (Wgit::Document) returned from the DB.

Returns:

  • (Hash<String, String | Array<String>>)

    The search results obtained from the DB having mapped the docs url => search_results. The format of search_results depends on the value of top_result_only.



331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
# File 'lib/wgit/database/adapters/mongo_db.rb', line 331

def search!(
  query, case_sensitive: false, whole_sentence: true,
  limit: 10, skip: 0, sentence_limit: 80, top_result_only: false
)
  results = search(query, case_sensitive:, whole_sentence:, limit:, skip:)

  results
    .map do |doc|
      yield(doc) if block_given?

      results = doc.search(
        query, case_sensitive:, whole_sentence:, sentence_limit:
      )

      if results.empty?
        Wgit.logger.warn("MongoDB and Document #search calls have \
differing results")
        next nil
      end

      results = results.first if top_result_only
      [doc.url, results]
    end
    .compact
    .to_h
end

#search_fieldsHash

Gets the documents collection text search fields and their weights.

Returns:

  • (Hash)

    The fields and their weights.



133
134
135
136
# File 'lib/wgit/database/adapters/mongo_db.rb', line 133

def search_fields
  indexes = @client[DOCUMENTS_COLLECTION].indexes
  indexes.get(TEXT_INDEX)&.[]('weights')
end

#search_fields=(fields) ⇒ Object

Sets the documents collection search fields via a text index. This method is called from Wgit::Model.set_search_fields and shouldn't be called directly.

This method is labor intensive on large collections so change rarely and wisely. This method is idempotent in that it will remove the index if it already exists before it creates the new index.

Parameters:

  • fields (Hash<Symbol, Integer>)

    The field names or the field names and their coresponding search weights.

Raises:

  • (StandardError)

    If fields is not a Hash.



118
119
120
121
122
123
124
125
126
127
128
# File 'lib/wgit/database/adapters/mongo_db.rb', line 118

def search_fields=(fields)
  assert_type(fields, Hash)

  indexes = @client[DOCUMENTS_COLLECTION].indexes

  indexes.drop_one(TEXT_INDEX) if indexes.get(TEXT_INDEX)
  indexes.create_one(
    fields.transform_values { 'text' },
    { name: TEXT_INDEX, weights: fields, background: true }
  )
end

#sizeInteger

Returns the current size of the database.

Returns:

  • (Integer)

    The current size of the DB.



368
369
370
# File 'lib/wgit/database/adapters/mongo_db.rb', line 368

def size
  stats[:dataSize]
end

#statsBSON::Document#[]#fetch

Returns statistics about the database.

Returns:

  • (BSON::Document#[]#fetch)

    Similar to a Hash instance.



361
362
363
# File 'lib/wgit/database/adapters/mongo_db.rb', line 361

def stats
  @client.command(dbStats: 0).documents[0]
end

#uncrawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>

Returns Url records that haven't yet been crawled.

Parameters:

  • limit (Integer) (defaults to: 0)

    The max number of Url's to return. 0 returns all.

  • skip (Integer) (defaults to: 0)

    Skip n amount of Url's.

Yields:

  • (url)

    Given each Url object (Wgit::Url) returned from the DB.

Returns:

  • (Array<Wgit::Url>)

    The uncrawled Urls obtained from the DB.



266
267
268
# File 'lib/wgit/database/adapters/mongo_db.rb', line 266

def uncrawled_urls(limit: 0, skip: 0, &block)
  urls(crawled: false, limit:, skip:, &block)
end

#update(obj) ⇒ Integer

Update a Url or Document object in the DB.

Parameters:

Returns:

  • (Integer)

    The number of updated records/objects.

Raises:

  • (StandardError)

    If the obj is not valid.



446
447
448
449
450
451
# File 'lib/wgit/database/adapters/mongo_db.rb', line 446

def update(obj)
  collection, query, model = get_model_info(obj)
  data_hash = model.merge(Wgit::Model.common_update_data)

  mutate(collection, query, { '$set' => data_hash })
end

#upsert(obj) ⇒ Boolean

Inserts or updates the object in the database.

Parameters:

Returns:

  • (Boolean)

    True if inserted, false if updated.



168
169
170
171
172
173
174
175
176
# File 'lib/wgit/database/adapters/mongo_db.rb', line 168

def upsert(obj)
  collection, query, model = get_model_info(obj)
  data_hash = model.merge(Wgit::Model.common_update_data)
  result = @client[collection].replace_one(query, data_hash, upsert: true)

  result.matched_count.zero?
ensure
  @last_result = result
end

#url?(url) ⇒ Boolean

Returns whether or not a record with the given 'url' field (which is unique) exists in the database's 'urls' collection.

Parameters:

  • url (Wgit::Url)

    The Url to search the DB for.

Returns:

  • (Boolean)

    True if url exists, otherwise false.



398
399
400
401
402
# File 'lib/wgit/database/adapters/mongo_db.rb', line 398

def url?(url)
  assert_type(url, String) # This includes Wgit::Url's.
  query = { url: }
  retrieve(URLS_COLLECTION, query, limit: 1).any?
end

#urls(crawled: nil, limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>

Returns all Url records from the DB.

All Urls are sorted by date_added ascending, in other words the first url returned is the first one that was inserted into the DB.

Parameters:

  • crawled (Boolean) (defaults to: nil)

    Filter by Url#crawled value. nil returns all.

  • limit (Integer) (defaults to: 0)

    The max number of Url's to return. 0 returns all.

  • skip (Integer) (defaults to: 0)

    Skip n amount of Url's.

Yields:

  • (url)

    Given each Url object (Wgit::Url) returned from the DB.

Returns:

  • (Array<Wgit::Url>)

    The Urls obtained from the DB.



240
241
242
243
244
245
246
247
248
# File 'lib/wgit/database/adapters/mongo_db.rb', line 240

def urls(crawled: nil, limit: 0, skip: 0, &block)
  query = crawled.nil? ? {} : { crawled: }
  sort = { date_added: 1 }

  results = retrieve(URLS_COLLECTION, query, sort:, limit:, skip:)
  return [] if results.count < 1 # results#empty? doesn't exist.

  map_urls(results, &block)
end