Class: Wgit::Document

Inherits:
Object
  • Object
show all
Includes:
Assertable
Defined in:
lib/wgit/document.rb

Overview

Class modeling/serialising a HTML web document, although other MIME types will work e.g. images etc. Also doubles as a search result when loading Documents from the database via Wgit::Database::DatabaseAdapter#search.

The initialize method dynamically initializes instance variables from the Document HTML / Database object e.g. text. This bit is dynamic so that the Document class can be easily extended allowing you to extract the bits of a webpage that are important to you. See Wgit::Document.define_extractor.

Constant Summary collapse

REGEX_EXTRACTOR_NAME =

Regex for the allowed var names when defining an extractor.

/[a-z0-9_]+/

Constants included from Assertable

Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG

Class Attribute Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Assertable

#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types

Constructor Details

#initialize(url_or_obj, html = '', encode: true) ⇒ Document

Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page). This allows for initialisation from both crawled web pages and documents/web pages retrieved from the database.

During initialisation, the Document will call any private init_*_from_html and init_*_from_object methods it can find. See the Wgit::Document.define_extractor method for more details.



75
76
77
78
79
80
81
# File 'lib/wgit/document.rb', line 75

def initialize(url_or_obj, html = '', encode: true)
  if url_or_obj.is_a?(String)
    init_from_strings(url_or_obj, html, encode:)
  else
    init_from_object(url_or_obj, encode:)
  end
end

Class Attribute Details

.extractorsObject (readonly)

Set of Symbols representing the defined Document extractors. Is read-only. Use Wgit::Document.define_extractor for a new extractor.



41
42
43
# File 'lib/wgit/document.rb', line 41

def extractors
  @extractors
end

.to_h_ignore_varsObject (readonly)

Array of instance vars to ignore when Document#to_h and (in turn) Wgit::Model.document methods are called. Append your own defined extractor vars to omit them from the model (database object) when indexing. Each var should be a String starting with an '@' char e.g. "@data" etc.



37
38
39
# File 'lib/wgit/document.rb', line 37

def to_h_ignore_vars
  @to_h_ignore_vars
end

Instance Attribute Details

#htmlObject (readonly) Also known as: content

The content/HTML of the document, an instance of String.



48
49
50
# File 'lib/wgit/document.rb', line 48

def html
  @html
end

#parserObject (readonly)

The Nokogiri::HTML document object initialized from @html.



51
52
53
# File 'lib/wgit/document.rb', line 51

def parser
  @parser
end

#scoreObject (readonly)

The score is set/used following a Database#search and records matches.



54
55
56
# File 'lib/wgit/document.rb', line 54

def score
  @score
end

#urlObject (readonly)

The URL of the webpage, an instance of Wgit::Url.



45
46
47
# File 'lib/wgit/document.rb', line 45

def url
  @url
end

Class Method Details

.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol

Defines a content extractor, which extracts HTML elements/content into instance variables upon Document initialization. See the default extractors defined in 'document_extractors.rb' as examples. Defining an extractor means that every subsequently crawled/initialized document will attempt to extract the xpath's content. Use #extract for a one off content extraction on any document.

Note that defined extractors work for both Documents initialized from HTML (via Wgit::Crawler methods) and from database objects. An extractor once defined, initializes a private instance variable with the xpath or database object result(s).

When initialising from HTML, a singleton value of true will only ever return the first result found; otherwise all the results are returned in an Enumerable. When initialising from a database object, the value is taken as is and singleton is only used to define the default empty value. If a value cannot be found (in either the HTML or database object), then a default will be used. The default value is: singleton ? nil : [].

Options Hash (opts):

  • :singleton (Boolean)

    The singleton option determines whether or not the result(s) should be in an Enumerable. If multiple results are found and singleton is true then the first result will be used. Defaults to true.

  • :text_content_only (Boolean)

    The text_content_only option if true will use the text #content of the Nokogiri result object, otherwise the Nokogiri object itself is returned. The type of Nokogiri object returned depends on the given xpath query. See the Nokogiri documentation for more information. Defaults to true.

Yields:

  • The block is executed when a Wgit::Document is initialized, regardless of the source. Use it (optionally) to process the result value.

Yield Parameters:

  • value (Object)

    The result value to be assigned to the new var.

  • source (Wgit::Document, Object)

    The source of the value.

  • type (Symbol)

    The source type, either :document or (DB) :object.

Yield Returns:

  • (Object)

    The return value of the block becomes the new var's value. Return the block's value param unchanged if you want to inspect.

Raises:

  • (StandardError)

    If the var param isn't valid.



139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# File 'lib/wgit/document.rb', line 139

def self.define_extractor(var, xpath, opts = {}, &block)
  var = var.to_sym
  defaults = { singleton: true, text_content_only: true }
  opts = defaults.merge(opts)

  raise "var must match #{REGEX_EXTRACTOR_NAME}" unless \
  var =~ REGEX_EXTRACTOR_NAME

  # Define the private init_*_from_html method for HTML.
  # Gets the HTML's xpath value and creates a var for it.
  func_name = Document.send(:define_method, "init_#{var}_from_html") do
    result = extract_from_html(xpath, **opts, &block)
    init_var(var, result)
  end
  Document.send(:private, func_name)

  # Define the private init_*_from_object method for a Database object.
  # Gets the Object's 'key' value and creates a var for it.
  func_name = Document.send(
    :define_method, "init_#{var}_from_object"
  ) do |obj|
    result = extract_from_object(
      obj, var.to_s, singleton: opts[:singleton], &block
    )
    init_var(var, result)
  end
  Document.send(:private, func_name)

  @extractors << var
  var
end

.remove_extractor(var) ⇒ Boolean

Removes the init_* methods created when an extractor is defined. Therefore, this is the opposing method to Document.define_extractor. Returns true if successful or false if the method(s) cannot be found.



178
179
180
181
182
183
184
185
186
187
# File 'lib/wgit/document.rb', line 178

def self.remove_extractor(var)
  Document.send(:remove_method, "init_#{var}_from_html")
  Document.send(:remove_method, "init_#{var}_from_object")

  @extractors.delete(var.to_sym)

  true
rescue NameError
  false
end

.remove_extractorsObject

Removes all default and defined extractors by calling Document.remove_extractor underneath. See its documentation.



191
192
193
# File 'lib/wgit/document.rb', line 191

def self.remove_extractors
  @extractors.each { |var| remove_extractor(var) }
end

Instance Method Details

#==(other) ⇒ Boolean

Determines if both the url and html match. Use doc.object_id == other.object_id for exact object comparison.



209
210
211
212
213
# File 'lib/wgit/document.rb', line 209

def ==(other)
  return false unless other.is_a?(Wgit::Document)

  (@url == other.url) && (@html == other.html)
end

#[](range) ⇒ String

Shortcut for calling Document#html[range].



219
220
221
# File 'lib/wgit/document.rb', line 219

def [](range)
  @html[range]
end

#at_css(selector) ⇒ Nokogiri::XML::Element

Uses Nokogiri's at_css method to search the doc's html and return the result. Use #css for returning several results.



369
370
371
# File 'lib/wgit/document.rb', line 369

def at_css(selector)
  @parser.at_css(selector)
end

#at_xpath(xpath) ⇒ Nokogiri::XML::Element

Uses Nokogiri's at_xpath method to search the doc's html and return the result. Use #xpath for returning several results.



351
352
353
# File 'lib/wgit/document.rb', line 351

def at_xpath(xpath)
  @parser.at_xpath(xpath)
end

#base_url(link: nil) ⇒ Wgit::Url

Returns the base URL of this Wgit::Document. The base URL is either the element's href value or @url (if @base is nil). If @base is present and relative, then @url.to_origin + @base is returned. This method should be used instead of doc.url.to_origin etc. when manually building absolute links from relative links; or use link.make_absolute(doc).

Provide the link: parameter to get the correct base URL for that type of link. For example, a link of #top would always return @url because it applies to that page, not a different one. Query strings work in the same way. Use this parameter if manually joining Url's e.g.

relative_link = Wgit::Url.new('?q=hello') absolute_link = doc.base_url(link: relative_link).join(relative_link)

This is similar to how Wgit::Document#internal_absolute_links works.

Raises:

  • (StandardError)

    If link is relative or if a base URL can't be established e.g. the doc @url is relative and is nil.



245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
# File 'lib/wgit/document.rb', line 245

def base_url(link: nil)
  if @url.relative? && @base.nil?
    raise "Document @url ('#{@url}') cannot be relative if <base> is nil"
  end

  if @url.relative? && @base&.relative?
    raise "Document @url ('#{@url}') and <base> ('#{@base}') both can't \
be relative"
  end

  get_base = -> { @base.relative? ? @url.to_origin.join(@base) : @base }

  if link
    link = Wgit::Url.new(link)
    raise "link must be relative: #{link}" unless link.relative?

    if link.is_fragment? || link.is_query?
      base_url = @base ? get_base.call : @url
      return base_url.omit_fragment.omit_query
    end
  end

  base_url = @base ? get_base.call : @url.to_origin
  base_url.omit_fragment.omit_query
end

#css(selector) ⇒ Nokogiri::XML::NodeSet

Uses Nokogiri's css method to search the doc's html and return the results. Use #at_css for returning the first result only.



360
361
362
# File 'lib/wgit/document.rb', line 360

def css(selector)
  @parser.css(selector)
end

#empty?Boolean

Determine if this Document's HTML is empty or not.



331
332
333
334
335
# File 'lib/wgit/document.rb', line 331

def empty?
  return true if @html.nil?

  @html.empty?
end

Returns all unique external links from this Document in absolute form. External meaning a link to a different host.



410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
# File 'lib/wgit/document.rb', line 410

def external_links
  return [] if @links.empty?

  links = @links
          .map do |link|
            if link.scheme_relative?
              link.prefix_scheme(@url.to_scheme.to_sym)
            else
              link
            end
          end
          .reject { |link| link.relative?(host: @url.to_origin) }

  Wgit::Utils.sanitize(links)
end

#extract(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object

Extracts a value/object from this Document's @html using the given xpath parameter.

Yields:

  • (Optionally)

    Pass a block to read/write the result value before it's returned.

Yield Parameters:

  • value (Object)

    The result value to be returned.

  • source (Wgit::Document, Object)

    This Document instance.

  • type (Symbol)

    The source type, which is :document.

Yield Returns:

  • (Object)

    The return value of the block gets returned. Return the block's value param unchanged if you simply want to inspect it.



556
557
558
# File 'lib/wgit/document.rb', line 556

def extract(xpath, singleton: true, text_content_only: true, &block)
  send(:extract_from_html, xpath, singleton:, text_content_only:, &block)
end

#extract_from_html(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object (protected)

Extracts a value/object from this Document's @html using the given xpath parameter.

Yields:

  • (Optionally)

    Pass a block to read/write the result value before it's returned.

Yield Parameters:

  • value (Object)

    The result value to be returned.

  • source (Wgit::Document, Object)

    This Document instance.

  • type (Symbol)

    The source type, which is :document.

Yield Returns:

  • (Object)

    The return value of the block gets returned. Return the block's value param unchanged if you simply want to inspect it.



661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
# File 'lib/wgit/document.rb', line 661

def extract_from_html(xpath, singleton: true, text_content_only: true)
  result = nil

  if xpath
    xpath  = xpath.call if xpath.respond_to?(:call)
    result = singleton ? at_xpath(xpath) : xpath(xpath)
  end

  if result && text_content_only
    result = singleton ? result.content : result.map(&:content)
  end

  result = Wgit::Utils.sanitize(result)
  result = yield(result, self, :document) if block_given?
  result
end

#extract_from_object(obj, key, singleton: true) {|value, source, type| ... } ⇒ String, Object (protected)

Returns a value from the obj using the given key via obj#fetch.

Yields:

  • The block is executed when a Wgit::Document is initialized, regardless of the source. Use it (optionally) to process the result value.

Yield Parameters:

  • value (Object)

    The result value to be returned.

  • source (Wgit::Document, Object)

    The source of the value.

  • type (Symbol)

    The source type, either :document or (DB) :object.

Yield Returns:

  • (Object)

    The return value of the block gets returned. Return the block's value param unchanged if you simply want to inspect it.



694
695
696
697
698
699
700
701
702
703
# File 'lib/wgit/document.rb', line 694

def extract_from_object(obj, key, singleton: true)
  assert_respond_to(obj, :fetch)

  default = singleton ? nil : []
  result  = obj.fetch(key.to_s, default)

  result = Wgit::Utils.sanitize(result)
  result = yield(result, obj, :object) if block_given?
  result
end

#init_nokogiri {|config| ... } ⇒ Nokogiri::HTML (protected)

Initializes the nokogiri object using @html, which cannot be nil. Override this method to custom configure the Nokogiri object returned. Gets called from Wgit::Document.new upon initialization.

Yields:

  • (config)

    The given block is passed to Nokogiri::HTML for initialisation.

Raises:

  • (StandardError)

    If @html isn't set.



637
638
639
640
641
# File 'lib/wgit/document.rb', line 637

def init_nokogiri(&block)
  raise '@html must be set' unless @html

  Nokogiri::HTML(@html, &block)
end

#inspectString

Overrides String#inspect to shorten the printed output of a Document.



200
201
202
# File 'lib/wgit/document.rb', line 200

def inspect
  "#<Wgit::Document url=\"#{@url}\" html_size=#{size}>"
end

Returns all unique internal links from this Document in absolute form by appending them to self's #base_url. Also see Wgit::Document#internal_links.



402
403
404
# File 'lib/wgit/document.rb', line 402

def internal_absolute_links
  internal_links.map { |link| link.make_absolute(self) }
end

Returns all unique internal links from this Document in relative form. Internal meaning a link to another document on the same host.

This Document's host is used to determine if an absolute URL is actually a relative link e.g. For a Document representing http://www.server.com/about, an absolute link of will be recognized and returned as an internal link because both Documents live on the same host. Also see Wgit::Document#internal_absolute_links.



384
385
386
387
388
389
390
391
392
393
394
395
# File 'lib/wgit/document.rb', line 384

def internal_links
  return [] if @links.empty?

  links = @links
          .select { |link| link.relative?(host: @url.to_origin) }
          .map(&:omit_base)
          .map do |link| # Map @url.to_host into / as it's a duplicate.
    link.to_host == @url.to_host ? Wgit::Url.new('/') : link
  end

  Wgit::Utils.sanitize(links)
end

#nearest_fragment(el_text, el_type = "*") {|results| ... } ⇒ String?

Firstly finds the target element whose text contains el_text. Then finds the preceeding fragment element nearest to the target element and returns it's href value (starting with #). The search is performed against the @html so Documents loaded from a DB will need to contain the 'html' field in the Wgit::Model. See the Wgit::Model#include_doc_html documentation for more info.

Yields:

  • (results)

    Given the results of the xpath query. Return the target you want or nil to use the default (first) target in results.

Raises:

  • (StandardError)

    Raises if no matching target element containg el_text can be found or if @html is empty.



595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
# File 'lib/wgit/document.rb', line 595

def nearest_fragment(el_text, el_type = "*")
  raise "The @html is empty" if @html.empty?

  xpath_query = "//#{el_type}[text()[contains(.,\"#{el_text}\")]]"
  results = xpath(xpath_query)
  return nil if results.empty?

  target = results.first
  if block_given?
    result = yield(results)
    target = result if result
  end

  target_index = html_index(target)
  raise 'Failed to find target index' unless target_index

  fragment_h = fragment_indices(fragments)

  # Return the target href if the target is itself a fragment.
  return fragment_h[target_index] if fragment_h.keys.include?(target_index)

  # Find the target's nearest preceeding fragment href.
  closest_index = 0
  fragment_h.each do |fragment_index, href|
    if fragment_index.between?(closest_index, target_index)
      closest_index = fragment_index
    end
  end

  fragment_h[closest_index]
end

#no_index?Boolean

Attempts to extract and check the HTML meta tags instructing Wgit not to index this document (save it to a Database).



565
566
567
568
569
570
571
572
573
574
575
576
577
578
# File 'lib/wgit/document.rb', line 565

def no_index?
  meta_robots = extract_from_html(
    '//meta[@name="robots"]/@content',
    singleton: true,
    text_content_only: true
  )
  meta_wgit = extract_from_html(
    '//meta[@name="wgit"]/@content',
    singleton: true,
    text_content_only: true
  )

  [meta_robots, meta_wgit].include?('noindex')
end

#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80, search_fields: Wgit::Model.search_fields) {|results_hash| ... } ⇒ Array<String>

Searches the Document's instance vars for the given query and returns the results. The Wgit::Model.search_fields denote the vars to be searched, unless overridden using the search_fields: param.

The number of matches for each search field is recorded internally and used to rank/sort the search results before being returned. Where the Wgit::Database::DatabaseAdapter#search method searches all documents for matches, this method searches each individual Document for matches.

Each search result comprises of a sentence of a given length. The length will be based on the sentence_limit parameter or the full length of the original sentence, which ever is less. The algorithm obviously ensures that the search query is visible somewhere in the sentence.

Yields:

  • (results_hash)

    Given the results_hash containing each search result (String) and its score (num_matches * weight).



455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
# File 'lib/wgit/document.rb', line 455

def search(
  query, case_sensitive: false, whole_sentence: true,
  sentence_limit: 80, search_fields: Wgit::Model.search_fields
)
  raise 'The sentence_limit value must be even' if sentence_limit.odd?
  assert_type(search_fields, Hash)

  regex = Wgit::Utils.build_search_regex(
    query, case_sensitive:, whole_sentence:)
  results = {}

  search_fields.each do |field, weight|
    doc_field = instance_variable_get("@#{field}".to_sym)
    next unless doc_field

    Wgit::Utils.each(doc_field) do |text|
      assert_type(text, String)

      text = text.strip
      matches = text.scan(regex).count
      next unless matches.positive?

      index = text.index(regex) # Index of first match.
      Wgit::Utils.format_sentence_length(text, index, sentence_limit)

      # For duplicate matching text, total the text score.
      text_score = matches * weight
      existing_score = results[text]
      text_score += existing_score if existing_score

      results[text] = text_score
    end
  end

  return [] if results.empty?

  yield results if block_given?

  # Return only the matching text sentences, sorted by relevance.
  Hash[results.sort_by { |_, score| -score }].keys
end

#search_text(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>

Performs a text only search of the Document, instead of searching all search fields defined in Wgit::Model.search_fields.



509
510
511
512
513
514
515
# File 'lib/wgit/document.rb', line 509

def search_text(
  query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
)
  search(
    query, case_sensitive:, whole_sentence:,
    sentence_limit:, search_fields: { text: 1 })
end

#search_text!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String

Performs a text only search (see Document#search_text for details) but assigns the results to the @text instance variable. This can be used for sub search functionality. The original text is returned; no other reference to it is kept thereafter.



530
531
532
533
534
535
536
537
# File 'lib/wgit/document.rb', line 530

def search_text!(
  query, case_sensitive: false, whole_sentence: true, sentence_limit: 80
)
  orig_text = @text
  @text = search_text(query, case_sensitive:, whole_sentence:, sentence_limit:)

  orig_text
end

#sizeInteger

Determine the size of this Document's HTML.



324
325
326
# File 'lib/wgit/document.rb', line 324

def size
  stats[:html]
end

#statsHash Also known as: statistics

Returns a Hash containing this Document's instance variables and their #length (if they respond to it). Works dynamically so that any user defined extractors (and their created instance vars) will appear in the returned Hash as well. The number of text snippets as well as total number of textual bytes are always included in the returned Hash.



303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
# File 'lib/wgit/document.rb', line 303

def stats
  hash = {}
  instance_variables.each do |var|
    # Add up the total bytes of text as well as the length.
    if var == :@text
      hash[:text]       = @text.length
      hash[:text_bytes] = @text.sum(&:length)
    # Else take the var's #length method return value.
    else
      next unless instance_variable_get(var).respond_to?(:length)

      hash[var[1..].to_sym] = instance_variable_get(var).send(:length)
    end
  end

  hash
end

#to_h(include_html: false, include_score: true) ⇒ Hash

Returns a Hash containing this Document's instance vars. Used when storing the Document in a Database e.g. MongoDB etc. By default the @html var is excluded from the returned Hash.



278
279
280
281
282
283
284
# File 'lib/wgit/document.rb', line 278

def to_h(include_html: false, include_score: true)
  ignore = Wgit::Document.to_h_ignore_vars.dup
  ignore << '@html' unless include_html
  ignore << '@score' unless include_score

  Wgit::Utils.to_h(self, ignore:)
end

#to_json(include_html: false) ⇒ String

Converts this Document's #to_h return value to a JSON String.



291
292
293
294
# File 'lib/wgit/document.rb', line 291

def to_json(include_html: false)
  h = to_h(include_html:)
  JSON.generate(h)
end

#xpath(xpath) ⇒ Nokogiri::XML::NodeSet

Uses Nokogiri's xpath method to search the doc's html and return the results. Use #at_xpath for returning the first result only.



342
343
344
# File 'lib/wgit/document.rb', line 342

def xpath(xpath)
  @parser.xpath(xpath)
end