Class: Wgit::Database::MongoDB
- Inherits:
-
DatabaseAdapter
- Object
- DatabaseAdapter
- Wgit::Database::MongoDB
- Defined in:
- lib/wgit/database/adapters/mongo_db.rb
Overview
Database implementer class for MongoDB.
Constant Summary collapse
- URLS_COLLECTION =
The default name of the urls collection.
:urls
- DOCUMENTS_COLLECTION =
The default name of the documents collection.
:documents
- TEXT_INDEX =
The default name of the documents collection text search index.
'text_search'
- UNIQUE_INDEX =
The default name of the urls and documents collections unique index.
'unique_url'
Constants inherited from DatabaseAdapter
Constants included from Assertable
Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG
Instance Attribute Summary collapse
-
#client ⇒ Object
readonly
The database client object.
-
#connection_string ⇒ Object
readonly
The connection string for the database.
-
#last_result ⇒ Object
readonly
The raw MongoDB client result of the most recent operation.
Class Method Summary collapse
-
.connect(connection_string = nil) ⇒ Wgit::Database::MongoDB
A class alias for self.new.
-
.establish_connection(connection_string) ⇒ Mong::Client
Initializes a connected database client using the connection string.
Instance Method Summary collapse
-
#bulk_upsert(objs) ⇒ Integer
Bulk upserts the objects in the database collection.
-
#crawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns Url records that have been crawled.
-
#create_collections ⇒ nil
Creates the 'urls' and 'documents' collections.
-
#create_unique_indexes ⇒ nil
Creates the urls and documents unique 'url' indexes.
-
#delete(obj) ⇒ Integer
Deletes a record from the database with the matching 'url' field.
-
#doc?(doc) ⇒ Boolean
Returns whether or not a record with the given doc 'url.url' field (which is unique) exists in the database's 'documents' collection.
-
#docs(limit: 0, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Returns all Document records from the DB.
-
#empty ⇒ Integer
(also: #empty!)
Deletes everything in the urls and documents collections.
-
#empty_docs ⇒ Integer
Deletes everything in the documents collection.
-
#empty_urls ⇒ Integer
Deletes everything in the urls collection.
-
#exists?(obj) ⇒ Boolean
Returns if a record exists with the given obj's url.
-
#get(obj) ⇒ Wgit::Url, ...
Returns a record from the database with the matching 'url' field; or nil.
-
#initialize(connection_string = nil) ⇒ MongoDB
constructor
Initializes a connected database client using the provided connection_string or ENV['WGIT_CONNECTION_STRING'].
-
#insert(data) ⇒ Object
Insert one or more Url or Document objects into the DB.
-
#num_docs ⇒ Integer
Returns the total number of Document records in the DB.
-
#num_records ⇒ Integer
(also: #num_objects)
Returns the total number of records (urls + docs) in the DB.
-
#num_urls ⇒ Integer
Returns the total number of URL records in the DB.
-
#search(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Searches the database's Documents for the given query using the
Wgit::Model.search_fields
. -
#search!(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, top_result_only: false) {|doc| ... } ⇒ Hash<String, String | Array<String>>
Searches the database's Documents for the given query and then searches each result in turn using
doc.search
. -
#search_fields ⇒ Hash
Gets the documents collection text search fields and their weights.
-
#search_fields=(fields) ⇒ Object
Sets the documents collection search fields via a text index.
-
#size ⇒ Integer
Returns the current size of the database.
-
#stats ⇒ BSON::Document#[]#fetch
Returns statistics about the database.
-
#uncrawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns Url records that haven't yet been crawled.
-
#update(obj) ⇒ Integer
Update a Url or Document object in the DB.
-
#upsert(obj) ⇒ Boolean
Inserts or updates the object in the database.
-
#url?(url) ⇒ Boolean
Returns whether or not a record with the given 'url' field (which is unique) exists in the database's 'urls' collection.
-
#urls(crawled: nil, limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns all Url records from the DB.
Methods included from Assertable
#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types
Constructor Details
#initialize(connection_string = nil) ⇒ MongoDB
Initializes a connected database client using the provided connection_string or ENV['WGIT_CONNECTION_STRING'].
42 43 44 45 46 47 48 49 50 51 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 42 def initialize(connection_string = nil) connection_string ||= ENV['WGIT_CONNECTION_STRING'] raise "connection_string and ENV['WGIT_CONNECTION_STRING'] are nil" \ unless connection_string @client = MongoDB.establish_connection(connection_string) @connection_string = connection_string super end |
Instance Attribute Details
#client ⇒ Object (readonly)
The database client object. Gets set when a connection is established.
30 31 32 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 30 def client @client end |
#connection_string ⇒ Object (readonly)
The connection string for the database.
27 28 29 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 27 def connection_string @connection_string end |
#last_result ⇒ Object (readonly)
The raw MongoDB client result of the most recent operation.
33 34 35 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 33 def last_result @last_result end |
Class Method Details
.connect(connection_string = nil) ⇒ Wgit::Database::MongoDB
A class alias for self.new.
60 61 62 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 60 def self.connect(connection_string = nil) new(connection_string) end |
.establish_connection(connection_string) ⇒ Mong::Client
Initializes a connected database client using the connection string.
70 71 72 73 74 75 76 77 78 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 70 def self.establish_connection(connection_string) # Only log for error (and more severe) scenarios. Mongo::Logger.logger = Wgit.logger.clone Mongo::Logger.logger.progname = 'mongo' Mongo::Logger.logger.level = Logger::ERROR # Connects to the database here. Mongo::Client.new(connection_string) end |
Instance Method Details
#bulk_upsert(objs) ⇒ Integer
Bulk upserts the objects in the database collection. You cannot mix collection objs types, all must be Urls or Documents.
185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 185 def bulk_upsert(objs) assert_common_arr_types(objs, [Wgit::Url, Wgit::Document]) raise 'objs is empty' if objs.empty? collection = nil request_objs = objs.map do |obj| collection, query, model = get_model_info(obj) data_hash = model.merge(Wgit::Model.common_update_data) { update_many: { filter: query, update: { '$set' => data_hash }, upsert: true } } end result = @client[collection].bulk_write(request_objs) result.upserted_count + result.modified_count ensure @last_result = result end |
#crawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns Url records that have been crawled.
256 257 258 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 256 def crawled_urls(limit: 0, skip: 0, &block) urls(crawled: true, limit:, skip:, &block) end |
#create_collections ⇒ nil
Creates the 'urls' and 'documents' collections.
85 86 87 88 89 90 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 85 def create_collections @client[URLS_COLLECTION].create @client[DOCUMENTS_COLLECTION].create nil end |
#create_unique_indexes ⇒ nil
Creates the urls and documents unique 'url' indexes.
95 96 97 98 99 100 101 102 103 104 105 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 95 def create_unique_indexes @client[URLS_COLLECTION].indexes.create_one( { url: 1 }, name: UNIQUE_INDEX, unique: true ) @client[DOCUMENTS_COLLECTION].indexes.create_one( { 'url.url' => 1 }, name: UNIQUE_INDEX, unique: true ) nil end |
#delete(obj) ⇒ Integer
Deletes a record from the database with the matching 'url' field. Pass either a Wgit::Url or Wgit::Document instance.
462 463 464 465 466 467 468 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 462 def delete(obj) collection, query = get_model_info(obj) result = @client[collection].delete_one(query) result.n ensure @last_result = result end |
#doc?(doc) ⇒ Boolean
Returns whether or not a record with the given doc 'url.url' field (which is unique) exists in the database's 'documents' collection.
409 410 411 412 413 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 409 def doc?(doc) assert_type(doc, Wgit::Document) query = { 'url.url' => doc.url } retrieve(DOCUMENTS_COLLECTION, query, limit: 1).any? end |
#docs(limit: 0, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Returns all Document records from the DB. Use #search to filter based on the Wgit::Model.search_fields of the documents collection.
All Documents are sorted by date_added ascending, in other words the first doc returned is the first one that was inserted into the DB.
222 223 224 225 226 227 228 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 222 def docs(limit: 0, skip: 0, &block) results = retrieve(DOCUMENTS_COLLECTION, {}, sort: { date_added: 1 }, limit:, skip:) return [] if results.count < 1 # results#empty? doesn't exist. map_documents(results, &block) end |
#empty ⇒ Integer Also known as: empty!
Deletes everything in the urls and documents collections.
493 494 495 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 493 def empty empty_urls + empty_docs end |
#empty_docs ⇒ Integer
Deletes everything in the documents collection.
483 484 485 486 487 488 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 483 def empty_docs result = @client[DOCUMENTS_COLLECTION].delete_many({}) result.n ensure @last_result = result end |
#empty_urls ⇒ Integer
Deletes everything in the urls collection.
473 474 475 476 477 478 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 473 def empty_urls result = @client[URLS_COLLECTION].delete_many({}) result.n ensure @last_result = result end |
#exists?(obj) ⇒ Boolean
Returns if a record exists with the given obj's url.
420 421 422 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 420 def exists?(obj) obj.is_a?(String) ? url?(obj) : doc?(obj) end |
#get(obj) ⇒ Wgit::Url, ...
Returns a record from the database with the matching 'url' field; or nil. Pass either a Wgit::Url or Wgit::Document instance.
430 431 432 433 434 435 436 437 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 430 def get(obj) collection, query = get_model_info(obj) record = retrieve(collection, query, limit: 1).first return nil unless record obj.class.new(record) end |
#insert(data) ⇒ Object
Insert one or more Url or Document objects into the DB.
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 147 def insert(data) collection = nil request_obj = nil if data.respond_to?(:map) request_obj = data.map do |obj| collection, _, model = get_model_info(obj) model end else collection, _, model = get_model_info(data) request_obj = model end create(collection, request_obj) end |
#num_docs ⇒ Integer
Returns the total number of Document records in the DB.
382 383 384 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 382 def num_docs @client[DOCUMENTS_COLLECTION].count end |
#num_records ⇒ Integer Also known as: num_objects
Returns the total number of records (urls + docs) in the DB.
389 390 391 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 389 def num_records num_urls + num_docs end |
#num_urls ⇒ Integer
Returns the total number of URL records in the DB.
375 376 377 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 375 def num_urls @client[URLS_COLLECTION].count end |
#search(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>
Searches the database's Documents for the given query using the
Wgit::Model.search_fields
.
The MongoDB search algorithm ranks/sorts the results in order (highest first) based on each document's "textScore" (which records the number of query hits). The "textScore" is then stored in each Document result object for use elsewhere if needed; accessed via Wgit::Document#score.
287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 287 def search( query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, &block ) query = query.to_s.strip query.replace("\"#{query}\"") if whole_sentence # Sort based on the most search hits (aka "textScore"). # We use the sort_proj hash as both a sort and a projection below. sort_proj = { score: { :$meta => 'textScore' } } query = { :$text => { :$search => query, :$caseSensitive => case_sensitive } } results = retrieve(DOCUMENTS_COLLECTION, query, sort: sort_proj, projection: sort_proj, limit:, skip:) map_documents(results, &block) end |
#search!(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, top_result_only: false) {|doc| ... } ⇒ Hash<String, String | Array<String>>
Searches the database's Documents for the given query and then searches
each result in turn using doc.search
. Instead of an Array of Documents,
this method returns a Hash of the docs url => search_results creating a
search engine like result set for quick access to text matches.
331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 331 def search!( query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0, sentence_limit: 80, top_result_only: false ) results = search(query, case_sensitive:, whole_sentence:, limit:, skip:) results .map do |doc| yield(doc) if block_given? results = doc.search( query, case_sensitive:, whole_sentence:, sentence_limit: ) if results.empty? Wgit.logger.warn("MongoDB and Document #search calls have \ differing results") next nil end results = results.first if top_result_only [doc.url, results] end .compact .to_h end |
#search_fields ⇒ Hash
Gets the documents collection text search fields and their weights.
133 134 135 136 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 133 def search_fields indexes = @client[DOCUMENTS_COLLECTION].indexes indexes.get(TEXT_INDEX)&.[]('weights') end |
#search_fields=(fields) ⇒ Object
Sets the documents collection search fields via a text index. This method is called from Wgit::Model.set_search_fields and shouldn't be called directly.
This method is labor intensive on large collections so change rarely and wisely. This method is idempotent in that it will remove the index if it already exists before it creates the new index.
118 119 120 121 122 123 124 125 126 127 128 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 118 def search_fields=(fields) assert_type(fields, Hash) indexes = @client[DOCUMENTS_COLLECTION].indexes indexes.drop_one(TEXT_INDEX) if indexes.get(TEXT_INDEX) indexes.create_one( fields.transform_values { 'text' }, { name: TEXT_INDEX, weights: fields, background: true } ) end |
#size ⇒ Integer
Returns the current size of the database.
368 369 370 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 368 def size stats[:dataSize] end |
#stats ⇒ BSON::Document#[]#fetch
Returns statistics about the database.
361 362 363 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 361 def stats @client.command(dbStats: 0).documents[0] end |
#uncrawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns Url records that haven't yet been crawled.
266 267 268 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 266 def uncrawled_urls(limit: 0, skip: 0, &block) urls(crawled: false, limit:, skip:, &block) end |
#update(obj) ⇒ Integer
Update a Url or Document object in the DB.
446 447 448 449 450 451 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 446 def update(obj) collection, query, model = get_model_info(obj) data_hash = model.merge(Wgit::Model.common_update_data) mutate(collection, query, { '$set' => data_hash }) end |
#upsert(obj) ⇒ Boolean
Inserts or updates the object in the database.
168 169 170 171 172 173 174 175 176 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 168 def upsert(obj) collection, query, model = get_model_info(obj) data_hash = model.merge(Wgit::Model.common_update_data) result = @client[collection].replace_one(query, data_hash, upsert: true) result.matched_count.zero? ensure @last_result = result end |
#url?(url) ⇒ Boolean
Returns whether or not a record with the given 'url' field (which is unique) exists in the database's 'urls' collection.
398 399 400 401 402 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 398 def url?(url) assert_type(url, String) # This includes Wgit::Url's. query = { url: } retrieve(URLS_COLLECTION, query, limit: 1).any? end |
#urls(crawled: nil, limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>
Returns all Url records from the DB.
All Urls are sorted by date_added ascending, in other words the first url returned is the first one that was inserted into the DB.
240 241 242 243 244 245 246 247 248 |
# File 'lib/wgit/database/adapters/mongo_db.rb', line 240 def urls(crawled: nil, limit: 0, skip: 0, &block) query = crawled.nil? ? {} : { crawled: } sort = { date_added: 1 } results = retrieve(URLS_COLLECTION, query, sort:, limit:, skip:) return [] if results.count < 1 # results#empty? doesn't exist. map_urls(results, &block) end |