Class: Wgit::Database

Inherits:
Object
  • Object
show all
Includes:
Assertable
Defined in:
lib/wgit/database/database.rb

Overview

Class modeling a DB connection and CRUD operations for the Url and Document collections.

Constant Summary

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::WRONG_METHOD_MSG

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Assertable

#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(connection_string = nil) ⇒ Database

Initializes a connected database client using the provided connection_string or ENV['WGIT_CONNECTION_STRING'].

Parameters:

  • connection_string (String) (defaults to: nil)

    The connection string needed to connect to the database.

Raises:

  • (StandardError)

    If a connection string isn't provided, either as a parameter or via the environment.



30
31
32
33
34
35
36
37
# File 'lib/wgit/database/database.rb', line 30

def initialize(connection_string = nil)
  connection_string ||= ENV['WGIT_CONNECTION_STRING']
  raise "connection_string and ENV['WGIT_CONNECTION_STRING'] are nil" \
  unless connection_string

  @client = Database.establish_connection(connection_string)
  @connection_string = connection_string
end

Instance Attribute Details

#clientObject (readonly)

The database client object. Gets set when a connection is established.



21
22
23
# File 'lib/wgit/database/database.rb', line 21

def client
  @client
end

#connection_stringObject (readonly)

The connection string for the database.



18
19
20
# File 'lib/wgit/database/database.rb', line 18

def connection_string
  @connection_string
end

Class Method Details

.connect(connection_string = nil) ⇒ Wgit::Database

A class alias for Database.new.

Parameters:

  • connection_string (String) (defaults to: nil)

    The connection string needed to connect to the database.

Returns:

Raises:

  • (StandardError)

    If a connection string isn't provided, either as a parameter or via the environment.



46
47
48
# File 'lib/wgit/database/database.rb', line 46

def self.connect(connection_string = nil)
  new(connection_string)
end

.establish_connection(connection_string) ⇒ Mong::Client

Initializes a connected database client using the connection string.

Parameters:

  • connection_string (String)

    The connection string needed to connect to the database.

Returns:

  • (Mong::Client)

    The connected MongoDB client.

Raises:

  • (StandardError)

    If a connection cannot be established.



56
57
58
59
60
61
62
63
64
# File 'lib/wgit/database/database.rb', line 56

def self.establish_connection(connection_string)
  # Only log for error (and more severe) scenarios.
  Mongo::Logger.logger          = Wgit.logger.clone
  Mongo::Logger.logger.progname = 'mongo'
  Mongo::Logger.logger.level    = Logger::ERROR

  # Connects to the database here.
  Mongo::Client.new(connection_string)
end

Instance Method Details

#crawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>

Returns Url records that have been crawled.

Parameters:

  • limit (Integer) (defaults to: 0)

    The max number of Url's to return. 0 returns all.

  • skip (Integer) (defaults to: 0)

    Skip n amount of Url's.

Yields:

  • (url)

    Given each Url object (Wgit::Url) returned from the DB.

Returns:

  • (Array<Wgit::Url>)

    The crawled Urls obtained from the DB.



122
123
124
# File 'lib/wgit/database/database.rb', line 122

def crawled_urls(limit: 0, skip: 0, &block)
  urls(crawled: true, limit: limit, skip: skip, &block)
end

#doc?(doc) ⇒ Boolean Also known as: document?

Returns whether or not a record with the given doc 'url.url' field (which is unique) exists in the database's 'documents' collection.

Parameters:

Returns:

  • (Boolean)

    True if doc exists, otherwise false.



233
234
235
236
237
# File 'lib/wgit/database/database.rb', line 233

def doc?(doc)
  assert_type(doc, Wgit::Document)
  hash = { 'url.url' => doc.url }
  @client[:documents].find(hash).any?
end

#insert(data) ⇒ Object

Insert one or more Url or Document objects into the DB.

Parameters:

Raises:

  • (StandardError)

    If data isn't valid.



74
75
76
77
78
79
80
81
82
83
84
85
86
# File 'lib/wgit/database/database.rb', line 74

def insert(data)
  data = data.dup # Avoid modifying by reference.
  type = data.is_a?(Enumerable) ? data.first : data

  case type
  when Wgit::Url
    insert_urls(data)
  when Wgit::Document
    insert_docs(data)
  else
    raise "Unsupported type - #{data.class}: #{data}"
  end
end

#insert_docs(data) ⇒ Integer (protected) Also known as: insert_doc

Insert one or more Document objects into the DB.

Parameters:

Returns:

  • (Integer)

    The number of inserted Documents.

Raises:

  • (StandardError)

    If data type isn't supported.



283
284
285
286
287
288
289
290
291
292
293
# File 'lib/wgit/database/database.rb', line 283

def insert_docs(data)
  if data.respond_to?(:map)
    assert_arr_types(data, Wgit::Document)
    data.map! { |doc| Wgit::Model.document(doc) }
  else
    assert_types(data, Wgit::Document)
    data = Wgit::Model.document(data)
  end

  create(:documents, data)
end

#insert_urls(data) ⇒ Integer (protected) Also known as: insert_url

Insert one or more Url objects into the DB.

Parameters:

Returns:

  • (Integer)

    The number of inserted Urls.

Raises:

  • (StandardError)

    If data type isn't supported.



265
266
267
268
269
270
271
272
273
274
275
# File 'lib/wgit/database/database.rb', line 265

def insert_urls(data)
  if data.respond_to?(:map)
    assert_arr_type(data, Wgit::Url)
    data.map! { |url| Wgit::Model.url(url) }
  else
    assert_type(data, Wgit::Url)
    data = Wgit::Model.url(data)
  end

  create(:urls, data)
end

#num_docsInteger Also known as: num_documents

Returns the total number of Document records in the DB.

Returns:

  • (Integer)

    The current number of Document records.



206
207
208
# File 'lib/wgit/database/database.rb', line 206

def num_docs
  @client[:documents].count
end

#num_recordsInteger Also known as: num_objects

Returns the total number of records (urls + docs) in the DB.

Returns:

  • (Integer)

    The current number of URL and Document records.



213
214
215
# File 'lib/wgit/database/database.rb', line 213

def num_records
  num_urls + num_docs
end

#num_urlsInteger

Returns the total number of URL records in the DB.

Returns:

  • (Integer)

    The current number of URL records.



199
200
201
# File 'lib/wgit/database/database.rb', line 199

def num_urls
  @client[:urls].count
end

#search(query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0) {|doc| ... } ⇒ Array<Wgit::Document>

Searches the database's Documents for the given query.

The searched fields are decided by the text index setup on the documents collection. Currently we search against the following fields: "author", "keywords", "title" and "text" by default.

The MongoDB search algorithm ranks/sorts the results in order (highest first) based on each document's "textScore" (which records the number of query hits). The "textScore" is then stored in each Document result object for use elsewhere if needed; accessed via Wgit::Document#score.

Parameters:

  • query (String)

    The text query to search with.

  • case_sensitive (Boolean) (defaults to: false)

    Whether character case must match.

  • whole_sentence (Boolean) (defaults to: true)

    Whether multiple words should be searched for separately.

  • limit (Integer) (defaults to: 10)

    The max number of results to return.

  • skip (Integer) (defaults to: 0)

    The number of results to skip.

Yields:

  • (doc)

    Given each search result (Wgit::Document) returned from the DB.

Returns:

  • (Array<Wgit::Document>)

    The search results obtained from the DB.



156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
# File 'lib/wgit/database/database.rb', line 156

def search(
  query, case_sensitive: false, whole_sentence: true, limit: 10, skip: 0
)
  query = query.to_s.strip
  query.replace('"' + query + '"') if whole_sentence

  # Sort based on the most search hits (aka "textScore").
  # We use the sort_proj hash as both a sort and a projection below.
  sort_proj = { score: { :$meta => 'textScore' } }
  query = { :$text => {
    :$search => query,
    :$caseSensitive => case_sensitive
  } }

  results = retrieve(:documents, query,
                     sort: sort_proj, projection: sort_proj,
                     limit: limit, skip: skip)
  return [] if results.count < 1 # respond_to? :empty? == false

  # results.respond_to? :map! is false so we use map and overwrite the var.
  results = results.map { |mongo_doc| Wgit::Document.new(mongo_doc) }
  results.each { |doc| yield(doc) } if block_given?

  results
end

#sizeInteger Also known as: count, length

Returns the current size of the database.

Returns:

  • (Integer)

    The current size of the DB.



192
193
194
# File 'lib/wgit/database/database.rb', line 192

def size
  stats[:dataSize]
end

#statsBSON::Document#[]#fetch

Returns statistics about the database.

Returns:

  • (BSON::Document#[]#fetch)

    Similar to a Hash instance.



185
186
187
# File 'lib/wgit/database/database.rb', line 185

def stats
  @client.command(dbStats: 0).documents[0]
end

#uncrawled_urls(limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>

Returned Url records that haven't yet been crawled.

Parameters:

  • limit (Integer) (defaults to: 0)

    The max number of Url's to return. 0 returns all.

  • skip (Integer) (defaults to: 0)

    Skip n amount of Url's.

Yields:

  • (url)

    Given each Url object (Wgit::Url) returned from the DB.

Returns:

  • (Array<Wgit::Url>)

    The uncrawled Urls obtained from the DB.



132
133
134
# File 'lib/wgit/database/database.rb', line 132

def uncrawled_urls(limit: 0, skip: 0, &block)
  urls(crawled: false, limit: limit, skip: skip, &block)
end

#update(data) ⇒ Object

Update a Url or Document object in the DB.

Parameters:

Raises:

  • (StandardError)

    If the data is not valid.



245
246
247
248
249
250
251
252
253
254
255
256
# File 'lib/wgit/database/database.rb', line 245

def update(data)
  data = data.dup # Avoid modifying by reference.

  case data
  when Wgit::Url
    update_url(data)
  when Wgit::Document
    update_doc(data)
  else
    raise "Unsupported type - #{data.class}: #{data}"
  end
end

#update_doc(doc) ⇒ Integer (protected)

Update a Document record in the DB.

Parameters:

Returns:

  • (Integer)

    The number of updated records.



311
312
313
314
315
316
317
# File 'lib/wgit/database/database.rb', line 311

def update_doc(doc)
  assert_type(doc, Wgit::Document)
  selection = { 'url.url' => doc.url }
  doc_hash = Wgit::Model.document(doc).merge(Wgit::Model.common_update_data)
  update = { '$set' => doc_hash }
  mutate(true, :documents, selection, update)
end

#update_url(url) ⇒ Integer (protected)

Update a Url record in the DB.

Parameters:

Returns:

  • (Integer)

    The number of updated records.



299
300
301
302
303
304
305
# File 'lib/wgit/database/database.rb', line 299

def update_url(url)
  assert_type(url, Wgit::Url)
  selection = { url: url }
  url_hash = Wgit::Model.url(url).merge(Wgit::Model.common_update_data)
  update = { '$set' => url_hash }
  mutate(true, :urls, selection, update)
end

#url?(url) ⇒ Boolean

Returns whether or not a record with the given 'url' field (which is unique) exists in the database's 'urls' collection.

Parameters:

  • url (Wgit::Url)

    The Url to search the DB for.

Returns:

  • (Boolean)

    True if url exists, otherwise false.



222
223
224
225
226
# File 'lib/wgit/database/database.rb', line 222

def url?(url)
  assert_type(url, String) # This includes Wgit::Url's.
  hash = { 'url' => url }
  @client[:urls].find(hash).any?
end

#urls(crawled: nil, limit: 0, skip: 0) {|url| ... } ⇒ Array<Wgit::Url>

Returns Url records from the DB.

All Urls are sorted by date_added ascending, in other words the first url returned is the first one that was inserted into the DB.

Parameters:

  • crawled (Boolean) (defaults to: nil)

    Filter by Url#crawled value. nil returns all.

  • limit (Integer) (defaults to: 0)

    The max number of Url's to return. 0 returns all.

  • skip (Integer) (defaults to: 0)

    Skip n amount of Url's.

Yields:

  • (url)

    Given each Url object (Wgit::Url) returned from the DB.

Returns:

  • (Array<Wgit::Url>)

    The Urls obtained from the DB.



100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# File 'lib/wgit/database/database.rb', line 100

def urls(crawled: nil, limit: 0, skip: 0)
  query = crawled.nil? ? {} : { crawled: crawled }
  sort = { date_added: 1 }

  results = retrieve(:urls, query,
                     sort: sort, projection: {},
                     limit: limit, skip: skip)
  return [] if results.count < 1 # results#empty? doesn't exist.

  # results.respond_to? :map! is false so we use map and overwrite the var.
  results = results.map { |url_doc| Wgit::Url.new(url_doc) }
  results.each { |url| yield(url) } if block_given?

  results
end