Class: Wgit::Document

Inherits:
Object
  • Object
show all
Includes:
Assertable
Defined in:
lib/wgit/document.rb

Overview

Class primarily modeling a HTML web document, although other MIME types will work e.g. images etc. Also doubles as a search result when loading Documents from the database via Wgit::Database#search.

The initialize method dynamically initializes instance variables from the Document HTML / Database object e.g. text. This bit is dynamic so that the Document class can be easily extended allowing you to pull out the bits of a webpage that are important to you. See Wgit::Document.define_extension.

Constant Summary collapse

REGEX_EXTENSION_NAME =

Regex for the allowed var names when defining an extension.

/[a-z0-9_]+/.freeze
TEXT_ELEMENTS_XPATH =

The xpath used to extract the visible text on a page.

'//*/text()'.freeze

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::WRONG_METHOD_MSG

Class Attribute Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Assertable

#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(url_or_obj, html = '', encode: true) ⇒ Document

Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page). This allows for initialisation from both crawled web pages and documents/web pages retrieved from the database.

During initialisation, the Document will call any private init_*_from_html and init_*_from_object methods it can find. See the README.md and Wgit::Document.define_extension method for more details.

Parameters:

  • url_or_obj (String, Wgit::Url, #fetch)

    Either a String representing a URL or a Hash-like object responding to :fetch. e.g. a MongoDB collection object. The Object's :fetch method should support Strings as keys.

  • html (String, NilClass) (defaults to: '')

    The crawled web page's content/HTML. This param is only used if url_or_obj is a String representing the web page's URL. Otherwise, the HTML comes from the database object. A html of nil will be defaulted to an empty String.

  • encode (Boolean) (defaults to: true)

    Whether or not to UTF-8 encode the html. Set to false if the Document content is an image etc.



65
66
67
68
69
70
71
# File 'lib/wgit/document.rb', line 65

def initialize(url_or_obj, html = '', encode: true)
  if url_or_obj.is_a?(String)
    init_from_strings(url_or_obj, html, encode: encode)
  else
    init_from_object(url_or_obj, encode: encode)
  end
end

Class Attribute Details

.extensionsObject (readonly)

Class level attr_reader for the Document defined extensions.



31
32
33
# File 'lib/wgit/document.rb', line 31

def extensions
  @extensions
end

Instance Attribute Details

#docObject (readonly)

The Nokogiri::HTML document object initialized from @html.



41
42
43
# File 'lib/wgit/document.rb', line 41

def doc
  @doc
end

#htmlObject (readonly) Also known as: content

The content/HTML of the document, an instance of String.



38
39
40
# File 'lib/wgit/document.rb', line 38

def html
  @html
end

#scoreObject (readonly)

The score is only used following a Database#search and records matches.



44
45
46
# File 'lib/wgit/document.rb', line 44

def score
  @score
end

#urlObject (readonly)

The URL of the webpage, an instance of Wgit::Url.



35
36
37
# File 'lib/wgit/document.rb', line 35

def url
  @url
end

Class Method Details

.define_extension(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol

Defines an extension, which is a way to serialise HTML elements into instance variables upon Document initialization. See the default extensions defined in 'document_extensions.rb' as examples.

Note that defined extensions work for both Documents initialized from HTML (via Wgit::Crawler methods) and from database objects. An extension once defined, initializes a private instance variable with the xpath or database object result(s).

When initialising from HTML, a singleton value of true will only ever return one result; otherwise all xpath results are returned in an Array. When initialising from a database object, the value is taken as is and singleton is only used to define the default empty value. If a value cannot be found (in either the HTML or database object), then a default will be used. The default value is: singleton ? nil : [].

Parameters:

  • var (Symbol)

    The name of the variable to be initialised.

  • xpath (String, #call)

    The xpath used to find the element(s) of the webpage. Only used when initializing from HTML.

    Pass a callable object (proc etc.) if you want the xpath value to be derived on Document initialisation (instead of when the extension is defined). The call method must return a valid xpath String.

  • opts (Hash) (defaults to: {})

    The options to define an extension with. The options are only used when intializing from HTML, not the database.

Options Hash (opts):

  • :singleton (Boolean)

    The singleton option determines whether or not the result(s) should be in an Array. If multiple results are found and singleton is true then the first result will be used. Defaults to true.

  • :text_content_only (Boolean)

    The text_content_only option if true will use the text content of the Nokogiri result object, otherwise the Nokogiri object itself is returned. Defaults to true.

Yield Parameters:

  • value (Object)

    The value to be assigned to the new var.

  • source (Wgit::Document, Object)

    The source of the value.

  • type (Symbol)

    The source type, either :document or (DB) :object.

Yield Returns:

  • (Object)

    The return value of the block becomes the new var value, unless nil. Return nil if you want to inspect but not change the var value. The block is executed when a Wgit::Document is initialized, regardless of the source.

Returns:

  • (Symbol)

    The given var Symbol if successful.

Raises:

  • (StandardError)

    If the var param isn't valid.



118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# File 'lib/wgit/document.rb', line 118

def self.define_extension(var, xpath, opts = {}, &block)
  var = var.to_sym
  defaults = { singleton: true, text_content_only: true }
  opts = defaults.merge(opts)

  raise "var must match #{REGEX_EXTENSION_NAME}" unless \
  var =~ REGEX_EXTENSION_NAME

  # Define the private init_*_from_html method for HTML.
  # Gets the HTML's xpath value and creates a var for it.
  func_name = Document.send(:define_method, "init_#{var}_from_html") do
    result = find_in_html(xpath, opts, &block)
    init_var(var, result)
  end
  Document.send(:private, func_name)

  # Define the private init_*_from_object method for a Database object.
  # Gets the Object's 'key' value and creates a var for it.
  func_name = Document.send(:define_method, "init_#{var}_from_object") do |obj|
    result = find_in_object(obj, var.to_s, singleton: opts[:singleton], &block)
    init_var(var, result)
  end
  Document.send(:private, func_name)

  @extensions << var
  var
end

.remove_extension(var) ⇒ Boolean

Removes the init_* methods created when an extension is defined. Therefore, this is the opposing method to Document.define_extension. Returns true if successful or false if the method(s) cannot be found.

Parameters:

  • var (Symbol)

    The extension variable already defined.

Returns:

  • (Boolean)

    True if the extension var was found and removed; otherwise false.



153
154
155
156
157
158
159
160
161
# File 'lib/wgit/document.rb', line 153

def self.remove_extension(var)
  Document.send(:remove_method, "init_#{var}_from_html")
  Document.send(:remove_method, "init_#{var}_from_object")

  @extensions.delete(var.to_sym)
  true
rescue NameError
  false
end

Instance Method Details

#==(other) ⇒ Boolean

Determines if both the url and html match. Use doc.object_id == other.object_id for exact object comparison.

Parameters:

Returns:

  • (Boolean)

    True if @url and @html are equal, false if not.



170
171
172
173
174
# File 'lib/wgit/document.rb', line 170

def ==(other)
  return false unless other.is_a?(Wgit::Document)

  (@url == other.url) && (@html == other.html)
end

#[](range) ⇒ String

Is a shortcut for calling Document#html[range].

Parameters:

  • range (Range)

    The range of @html to return.

Returns:

  • (String)

    The given range of @html.



180
181
182
# File 'lib/wgit/document.rb', line 180

def [](range)
  @html[range]
end

#base_url(link: nil) ⇒ Wgit::Url

Returns the base URL of this Wgit::Document. The base URL is either the element's href value or @url (if @base is nil). If @base is present and relative, then @url.to_base + @base is returned. This method should be used instead of doc.url.to_base etc. when manually building absolute links from relative links; or use link.prefix_base(doc).

Provide the link: parameter to get the correct base URL for that type of link. For example, a link of #top would always return @url because it applies to that page, not a different one. Query strings work in the same way. Use this parameter if manually concatting Url's e.g.

relative_link = Wgit::Url.new('?q=hello') absolute_link = doc.base_url(link: relative_link).concat(relative_link)

This is similar to how Wgit::Document#internal_absolute_links works.

Parameters:

  • link (Wgit::Url, String) (defaults to: nil)

    The link to obtain the correct base URL for; must be relative, not absolute.

Returns:

Raises:

  • (StandardError)

    If link is relative or if a base URL can't be established e.g. the doc @url is relative and is nil.



206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
# File 'lib/wgit/document.rb', line 206

def base_url(link: nil)
  raise "Document @url ('#{@url}') cannot be relative if <base> is nil" \
  if @url.relative? && @base.nil?
  raise "Document @url ('#{@url}') and <base> ('#{@base}') both can't be relative" \
  if @url.relative? && @base&.relative?

  get_base = -> { @base.relative? ? @url.to_base.concat(@base) : @base }

  if link
    link = Wgit::Url.new(link)
    raise "link must be relative: #{link}" unless link.relative?

    if link.is_fragment? || link.is_query?
      base_url = @base ? get_base.call : @url
      return base_url.omit_fragment.omit_query
    end
  end

  base_url = @base ? get_base.call : @url.to_base
  base_url.omit_fragment.omit_query
end

#css(selector) ⇒ Nokogiri::XML::NodeSet

Uses Nokogiri's css method to search the doc's html and return the results.

Parameters:

  • selector (String)

    The CSS selector to search the @html with.

Returns:

  • (Nokogiri::XML::NodeSet)

    The result set of the CSS search.



308
309
310
# File 'lib/wgit/document.rb', line 308

def css(selector)
  @doc.css(selector)
end

#empty?Boolean

Determine if this Document's HTML is empty or not.

Returns:

  • (Boolean)

    True if @html is nil/empty, false otherwise.



288
289
290
291
292
# File 'lib/wgit/document.rb', line 288

def empty?
  return true if @html.nil?

  @html.empty?
end

Returns all external links from this Document in absolute form. External meaning a link to a different host.

Returns:

  • (Array<Wgit::Url>)

    Self's external Url's in absolute form.



349
350
351
352
353
354
355
356
357
# File 'lib/wgit/document.rb', line 349

def external_links
  return [] if @links.empty?

  links = @links
          .reject { |link| link.relative?(host: @url.to_base) }
          .map(&:omit_trailing_slash)

  Wgit::Utils.process_arr(links)
end

#find_in_html(xpath, singleton: true, text_content_only: true) {|value, source| ... } ⇒ String, Object (protected)

Returns a value/object from this Document's @html using the given xpath parameter.

Parameters:

  • xpath (String)

    Used to find the value/object in @html.

  • singleton (Boolean) (defaults to: true)

    singleton ? results.first (single Nokogiri Object) : results (Array).

  • text_content_only (Boolean) (defaults to: true)

    text_content_only ? result.content (String) : result (Nokogiri Object).

Yields:

  • (value, source)

    Given the value (String/Object) before it's set as an instance variable so that you can inspect/alter the value if desired. Return nil from the block if you don't want to override the value. Also given the source (Symbol) which is always :document.

Returns:

  • (String, Object)

    The value found in the html or the default value (singleton ? nil : []).



466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
# File 'lib/wgit/document.rb', line 466

def find_in_html(xpath, singleton: true, text_content_only: true)
  default = singleton ? nil : []
  xpath   = xpath.call if xpath.respond_to?(:call)
  results = @doc.xpath(xpath)

  return default if results.nil? || results.empty?

  result = if singleton
             text_content_only ? results.first.content : results.first
           else
             text_content_only ? results.map(&:content) : results
           end

  singleton ? Wgit::Utils.process_str(result) : Wgit::Utils.process_arr(result)

  if block_given?
    new_result = yield(result, self, :document)
    result = new_result unless new_result.nil?
  end

  result
end

#find_in_object(obj, key, singleton: true) {|value, source| ... } ⇒ String, Object (protected)

Returns a value from the obj using the given key via obj#fetch.

Parameters:

  • obj (#fetch)

    The object containing the key/value.

  • key (String)

    Used to find the value in the obj.

  • singleton (Boolean) (defaults to: true)

    True if a single value, false otherwise.

Yields:

  • (value, source)

    Given the value (String/Object) before it's set as an instance variable so that you can inspect/alter the value if desired. Return nil from the block if you don't want to override the value. Also given the source (Symbol) which is always :object.

Returns:

  • (String, Object)

    The value found in the obj or the default value (singleton ? nil : []).



500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
# File 'lib/wgit/document.rb', line 500

def find_in_object(obj, key, singleton: true)
  assert_respond_to(obj, :fetch)

  default = singleton ? nil : []
  result  = obj.fetch(key.to_s, default)

  singleton ? Wgit::Utils.process_str(result) : Wgit::Utils.process_arr(result)

  if block_given?
    new_result = yield(result, obj, :object)
    result = new_result unless new_result.nil?
  end

  result
end

#init_nokogiriNokogiri::HTML (protected)

Initializes the nokogiri object using @html, which cannot be nil. Override this method to custom configure the Nokogiri object returned. Gets called from Wgit::Document.new upon initialization.

Returns:

  • (Nokogiri::HTML)

    The initialised Nokogiri HTML object.

Raises:

  • (StandardError)

    If @html isn't set.



442
443
444
445
446
447
448
449
450
# File 'lib/wgit/document.rb', line 442

def init_nokogiri
  raise '@html must be set' unless @html

  Nokogiri::HTML(@html) do |config|
    # TODO: Remove #'s below when crawling in production.
    # config.options = Nokogiri::XML::ParseOptions::STRICT |
    #                 Nokogiri::XML::ParseOptions::NONET
  end
end

Returns all internal links from this Document in absolute form by appending them to self's #base_url. Also see Wgit::Document#internal_links.

Returns:

  • (Array<Wgit::Url>)

    Self's internal Url's in absolute form.



341
342
343
# File 'lib/wgit/document.rb', line 341

def internal_absolute_links
  internal_links.map { |link| link.prefix_base(self) }
end

Returns all internal links from this Document in relative form. Internal meaning a link to another document on the same host.

This Document's host is used to determine if an absolute URL is actually a relative link e.g. For a Document representing http://www.server.com/about, an absolute link of will be recognized and returned as an internal link because both Documents live on the same host. Also see Wgit::Document#internal_absolute_links.

Returns:



323
324
325
326
327
328
329
330
331
332
333
334
# File 'lib/wgit/document.rb', line 323

def internal_links
  return [] if @links.empty?

  links = @links
          .select { |link| link.relative?(host: @url.to_base) }
          .map(&:omit_base)
          .map do |link| # Map @url.to_host into / as it's a duplicate.
    link.to_host == @url.to_host ? Wgit::Url.new('/') : link
  end

  Wgit::Utils.process_arr(links)
end

#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>

Searches the @text for the given query and returns the results.

The number of search hits for each sentenence are recorded internally and used to rank/sort the search results before being returned. Where the Wgit::Database#search method search all documents for the most hits, this method searches each document's @text for the most hits.

Each search result comprises of a sentence of a given length. The length will be based on the sentence_limit parameter or the full length of the original sentence, which ever is less. The algorithm obviously ensures that the search query is visible somewhere in the sentence.

Parameters:

  • query (String, #to_s)

    The value to search the document's @text for.

  • case_sensitive (Boolean) (defaults to: false)

    Whether character case must match.

  • whole_sentence (Boolean) (defaults to: true)

    Whether multiple words should be searched for separately.

  • sentence_limit (Integer) (defaults to: 80)

    The max length of each search result sentence.

Returns:

  • (Array<String>)

    A subset of @text, matching the query.



379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
# File 'lib/wgit/document.rb', line 379

def search(
  query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
)
  query = query.to_s
  raise 'A search query must be provided' if query.empty?
  raise 'The sentence_limit value must be even' if sentence_limit.odd?

  query   = query.gsub(' ', '|') unless whole_sentence
  regex   = Regexp.new(query, !case_sensitive)
  results = {}

  @text.each do |sentence|
    sentence = sentence.strip
    next if results[sentence]

    hits = sentence.scan(regex).count
    next unless hits.positive?

    index = sentence.index(regex) # Index of first match.
    Wgit::Utils.format_sentence_length(sentence, index, sentence_limit)

    results[sentence] = hits
  end

  return [] if results.empty?

  results = Hash[results.sort_by { |_k, v| v }]
  results.keys.reverse
end

#search!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String

Performs a text search (see Document#search for details) but assigns the results to the @text instance variable. This can be used for sub search functionality. The original text is returned; no other reference to it is kept thereafter.

Parameters:

  • query (String, #to_s)

    The value to search the document's @text for.

  • case_sensitive (Boolean) (defaults to: false)

    Whether character case must match.

  • whole_sentence (Boolean) (defaults to: true)

    Whether multiple words should be searched for separately.

  • sentence_limit (Integer) (defaults to: 80)

    The max length of each search result sentence.

Returns:

  • (String)

    This Document's original @text value.



422
423
424
425
426
427
428
429
430
431
432
# File 'lib/wgit/document.rb', line 422

def search!(
  query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
)
  orig_text = @text
  @text = search(
    query, case_sensitive: case_sensitive,
           whole_sentence: whole_sentence, sentence_limit: sentence_limit
  )

  orig_text
end

#sizeInteger

Determine the size of this Document's HTML.

Returns:

  • (Integer)

    The total number of @html bytes.



281
282
283
# File 'lib/wgit/document.rb', line 281

def size
  stats[:html]
end

#statsHash Also known as: statistics

Returns a Hash containing this Document's instance variables and their #length (if they respond to it). Works dynamically so that any user defined extensions (and their created instance vars) will appear in the returned Hash as well. The number of text snippets as well as total number of textual bytes are always included in the returned Hash.

Returns:

  • (Hash)

    Containing self's HTML page statistics.



260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# File 'lib/wgit/document.rb', line 260

def stats
  hash = {}
  instance_variables.each do |var|
    # Add up the total bytes of text as well as the length.
    if var == :@text
      hash[:text_snippets] = @text.length
      hash[:text_bytes]    = @text.sum(&:length)
    # Else take the var's #length method return value.
    else
      next unless instance_variable_get(var).respond_to?(:length)

      hash[var[1..-1].to_sym] = instance_variable_get(var).send(:length)
    end
  end

  hash
end

#to_h(include_html: false, include_score: true) ⇒ Hash

Returns a Hash containing this Document's instance vars. Used when storing the Document in a Database e.g. MongoDB etc. By default the @html var is excluded from the returned Hash.

Parameters:

  • include_html (Boolean) (defaults to: false)

    Whether or not to include @html in the returned Hash.

Returns:

  • (Hash)

    Containing self's instance vars.



235
236
237
238
239
240
241
# File 'lib/wgit/document.rb', line 235

def to_h(include_html: false, include_score: true)
  ignore = include_html ? [] : ['@html']
  ignore << '@score' unless include_score
  ignore << '@doc' # Always ignore Nokogiri @doc.

  Wgit::Utils.to_h(self, ignore: ignore)
end

#to_json(include_html: false) ⇒ String

Converts this Document's #to_h return value to a JSON String.

Parameters:

  • include_html (Boolean) (defaults to: false)

    Whether or not to include @html in the returned JSON String.

Returns:

  • (String)

    This Document represented as a JSON String.



248
249
250
251
# File 'lib/wgit/document.rb', line 248

def to_json(include_html: false)
  h = to_h(include_html: include_html)
  JSON.generate(h)
end

#xpath(xpath) ⇒ Nokogiri::XML::NodeSet

Uses Nokogiri's xpath method to search the doc's html and return the results.

Parameters:

  • xpath (String)

    The xpath to search the @html with.

Returns:

  • (Nokogiri::XML::NodeSet)

    The result set of the xpath search.



299
300
301
# File 'lib/wgit/document.rb', line 299

def xpath(xpath)
  @doc.xpath(xpath)
end