Class: Wgit::Document
- Inherits:
-
Object
- Object
- Wgit::Document
- Includes:
- Assertable
- Defined in:
- lib/wgit/document.rb
Overview
Class modeling/serialising a HTML web document, although other MIME types
will work e.g. images etc. Also doubles as a search result when
loading Documents from the database via
Wgit::Database::DatabaseAdapter#search
.
The initialize method dynamically initializes instance variables from the
Document HTML / Database object e.g. text. This bit is dynamic so that the
Document class can be easily extended allowing you to extract the bits of
a webpage that are important to you. See Wgit::Document.define_extractor
.
Constant Summary collapse
- REGEX_EXTRACTOR_NAME =
Regex for the allowed var names when defining an extractor.
/[a-z0-9_]+/
Constants included from Assertable
Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::MIXED_ENUMERABLE_MSG, Assertable::NON_ENUMERABLE_MSG
Class Attribute Summary collapse
-
.extractors ⇒ Object
readonly
Set of Symbols representing the defined Document extractors.
-
.to_h_ignore_vars ⇒ Object
readonly
Array of instance vars to ignore when Document#to_h and (in turn) Wgit::Model.document methods are called.
Instance Attribute Summary collapse
-
#html ⇒ Object
(also: #content)
readonly
The content/HTML of the document, an instance of String.
-
#parser ⇒ Object
readonly
The Nokogiri::HTML document object initialized from @html.
-
#score ⇒ Object
readonly
The score is set/used following a
Database#search
and records matches. -
#url ⇒ Object
readonly
The URL of the webpage, an instance of Wgit::Url.
Class Method Summary collapse
-
.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines a content extractor, which extracts HTML elements/content into instance variables upon Document initialization.
-
.remove_extractor(var) ⇒ Boolean
Removes the
init_*
methods created when an extractor is defined. -
.remove_extractors ⇒ Object
Removes all default and defined extractors by calling
Document.remove_extractor
underneath.
Instance Method Summary collapse
-
#==(other) ⇒ Boolean
Determines if both the url and html match.
-
#[](range) ⇒ String
Shortcut for calling Document#html[range].
-
#at_css(selector) ⇒ Nokogiri::XML::Element
Uses Nokogiri's
at_css
method to search the doc's html and return the result. -
#at_xpath(xpath) ⇒ Nokogiri::XML::Element
Uses Nokogiri's
at_xpath
method to search the doc's html and return the result. -
#base_url(link: nil) ⇒ Wgit::Url
Returns the base URL of this Wgit::Document.
-
#css(selector) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's
css
method to search the doc's html and return the results. -
#empty? ⇒ Boolean
Determine if this Document's HTML is empty or not.
-
#external_links ⇒ Array<Wgit::Url>
(also: #external_urls)
Returns all unique external links from this Document in absolute form.
-
#extract(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object
Extracts a value/object from this Document's @html using the given xpath parameter.
-
#extract_from_html(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object
protected
Extracts a value/object from this Document's @html using the given xpath parameter.
-
#extract_from_object(obj, key, singleton: true) {|value, source, type| ... } ⇒ String, Object
protected
Returns a value from the obj using the given key via
obj#fetch
. -
#init_nokogiri {|config| ... } ⇒ Nokogiri::HTML
protected
Initializes the nokogiri object using @html, which cannot be nil.
-
#initialize(url_or_obj, html = '', encode: true) ⇒ Document
constructor
Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page).
-
#inspect ⇒ String
Overrides String#inspect to shorten the printed output of a Document.
-
#internal_absolute_links ⇒ Array<Wgit::Url>
(also: #internal_absolute_urls)
Returns all unique internal links from this Document in absolute form by appending them to self's #base_url.
-
#internal_links ⇒ Array<Wgit::Url>
(also: #internal_urls)
Returns all unique internal links from this Document in relative form.
-
#nearest_fragment(el_text, el_type = "*") {|results| ... } ⇒ String?
Firstly finds the target element whose text contains el_text.
-
#no_index? ⇒ Boolean
Attempts to extract and check the HTML meta tags instructing Wgit not to index this document (save it to a Database).
-
#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80, search_fields: Wgit::Model.search_fields) {|results_hash| ... } ⇒ Array<String>
Searches the Document's instance vars for the given query and returns the results.
-
#search_text(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>
Performs a text only search of the Document, instead of searching all search fields defined in Wgit::Model.search_fields.
-
#search_text!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String
Performs a text only search (see Document#search_text for details) but assigns the results to the @text instance variable.
-
#size ⇒ Integer
Determine the size of this Document's HTML.
-
#stats ⇒ Hash
(also: #statistics)
Returns a Hash containing this Document's instance variables and their #length (if they respond to it).
-
#to_h(include_html: false, include_score: true) ⇒ Hash
Returns a Hash containing this Document's instance vars.
-
#to_json(include_html: false) ⇒ String
Converts this Document's #to_h return value to a JSON String.
-
#xpath(xpath) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's xpath method to search the doc's html and return the results.
Methods included from Assertable
#assert_arr_types, #assert_common_arr_types, #assert_required_keys, #assert_respond_to, #assert_types
Constructor Details
#initialize(url_or_obj, html = '', encode: true) ⇒ Document
Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page). This allows for initialisation from both crawled web pages and documents/web pages retrieved from the database.
During initialisation, the Document will call any private
init_*_from_html
and init_*_from_object
methods it can find. See the
Wgit::Document.define_extractor method for more details.
75 76 77 78 79 80 81 |
# File 'lib/wgit/document.rb', line 75 def initialize(url_or_obj, html = '', encode: true) if url_or_obj.is_a?(String) init_from_strings(url_or_obj, html, encode:) else init_from_object(url_or_obj, encode:) end end |
Class Attribute Details
.extractors ⇒ Object (readonly)
Set of Symbols representing the defined Document extractors. Is read-only. Use Wgit::Document.define_extractor for a new extractor.
41 42 43 |
# File 'lib/wgit/document.rb', line 41 def extractors @extractors end |
.to_h_ignore_vars ⇒ Object (readonly)
Array of instance vars to ignore when Document#to_h and (in turn) Wgit::Model.document methods are called. Append your own defined extractor vars to omit them from the model (database object) when indexing. Each var should be a String starting with an '@' char e.g. "@data" etc.
37 38 39 |
# File 'lib/wgit/document.rb', line 37 def to_h_ignore_vars @to_h_ignore_vars end |
Instance Attribute Details
#html ⇒ Object (readonly) Also known as: content
The content/HTML of the document, an instance of String.
48 49 50 |
# File 'lib/wgit/document.rb', line 48 def html @html end |
#parser ⇒ Object (readonly)
The Nokogiri::HTML document object initialized from @html.
51 52 53 |
# File 'lib/wgit/document.rb', line 51 def parser @parser end |
#score ⇒ Object (readonly)
The score is set/used following a Database#search
and records matches.
54 55 56 |
# File 'lib/wgit/document.rb', line 54 def score @score end |
#url ⇒ Object (readonly)
The URL of the webpage, an instance of Wgit::Url.
45 46 47 |
# File 'lib/wgit/document.rb', line 45 def url @url end |
Class Method Details
.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines a content extractor, which extracts HTML elements/content
into instance variables upon Document initialization. See the default
extractors defined in 'document_extractors.rb' as examples. Defining an
extractor means that every subsequently crawled/initialized document
will attempt to extract the xpath's content. Use #extract
for a one off
content extraction on any document.
Note that defined extractors work for both Documents initialized from HTML (via Wgit::Crawler methods) and from database objects. An extractor once defined, initializes a private instance variable with the xpath or database object result(s).
When initialising from HTML, a singleton value of true will only
ever return the first result found; otherwise all the results are
returned in an Enumerable. When initialising from a database object, the
value is taken as is and singleton is only used to define the default
empty value. If a value cannot be found (in either the HTML or database
object), then a default will be used. The default value is:
singleton ? nil : []
.
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
# File 'lib/wgit/document.rb', line 139 def self.define_extractor(var, xpath, opts = {}, &block) var = var.to_sym defaults = { singleton: true, text_content_only: true } opts = defaults.merge(opts) raise "var must match #{REGEX_EXTRACTOR_NAME}" unless \ var =~ REGEX_EXTRACTOR_NAME # Define the private init_*_from_html method for HTML. # Gets the HTML's xpath value and creates a var for it. func_name = Document.send(:define_method, "init_#{var}_from_html") do result = extract_from_html(xpath, **opts, &block) init_var(var, result) end Document.send(:private, func_name) # Define the private init_*_from_object method for a Database object. # Gets the Object's 'key' value and creates a var for it. func_name = Document.send( :define_method, "init_#{var}_from_object" ) do |obj| result = extract_from_object( obj, var.to_s, singleton: opts[:singleton], &block ) init_var(var, result) end Document.send(:private, func_name) @extractors << var var end |
.remove_extractor(var) ⇒ Boolean
Removes the init_*
methods created when an extractor is defined.
Therefore, this is the opposing method to Document.define_extractor
.
Returns true if successful or false if the method(s) cannot be found.
178 179 180 181 182 183 184 185 186 187 |
# File 'lib/wgit/document.rb', line 178 def self.remove_extractor(var) Document.send(:remove_method, "init_#{var}_from_html") Document.send(:remove_method, "init_#{var}_from_object") @extractors.delete(var.to_sym) true rescue NameError false end |
.remove_extractors ⇒ Object
Removes all default and defined extractors by calling
Document.remove_extractor
underneath. See its documentation.
191 192 193 |
# File 'lib/wgit/document.rb', line 191 def self.remove_extractors @extractors.each { |var| remove_extractor(var) } end |
Instance Method Details
#==(other) ⇒ Boolean
Determines if both the url and html match. Use doc.object_id == other.object_id for exact object comparison.
209 210 211 212 213 |
# File 'lib/wgit/document.rb', line 209 def ==(other) return false unless other.is_a?(Wgit::Document) (@url == other.url) && (@html == other.html) end |
#[](range) ⇒ String
Shortcut for calling Document#html[range].
219 220 221 |
# File 'lib/wgit/document.rb', line 219 def [](range) @html[range] end |
#at_css(selector) ⇒ Nokogiri::XML::Element
Uses Nokogiri's at_css
method to search the doc's html and return the
result. Use #css
for returning several results.
369 370 371 |
# File 'lib/wgit/document.rb', line 369 def at_css(selector) @parser.at_css(selector) end |
#at_xpath(xpath) ⇒ Nokogiri::XML::Element
Uses Nokogiri's at_xpath
method to search the doc's html and return the
result. Use #xpath
for returning several results.
351 352 353 |
# File 'lib/wgit/document.rb', line 351 def at_xpath(xpath) @parser.at_xpath(xpath) end |
#base_url(link: nil) ⇒ Wgit::Url
Returns the base URL of this Wgit::Document. The base URL is either the
doc.url.to_origin
etc. when manually building
absolute links from relative links; or use link.make_absolute(doc)
.
Provide the link:
parameter to get the correct base URL for that type
of link. For example, a link of #top
would always return @url because
it applies to that page, not a different one. Query strings work in the
same way. Use this parameter if manually joining Url's e.g.
relative_link = Wgit::Url.new('?q=hello') absolute_link = doc.base_url(link: relative_link).join(relative_link)
This is similar to how Wgit::Document#internal_absolute_links works.
245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
# File 'lib/wgit/document.rb', line 245 def base_url(link: nil) if @url.relative? && @base.nil? raise "Document @url ('#{@url}') cannot be relative if <base> is nil" end if @url.relative? && @base&.relative? raise "Document @url ('#{@url}') and <base> ('#{@base}') both can't \ be relative" end get_base = -> { @base.relative? ? @url.to_origin.join(@base) : @base } if link link = Wgit::Url.new(link) raise "link must be relative: #{link}" unless link.relative? if link.is_fragment? || link.is_query? base_url = @base ? get_base.call : @url return base_url.omit_fragment.omit_query end end base_url = @base ? get_base.call : @url.to_origin base_url.omit_fragment.omit_query end |
#css(selector) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's css
method to search the doc's html and return the
results. Use #at_css
for returning the first result only.
360 361 362 |
# File 'lib/wgit/document.rb', line 360 def css(selector) @parser.css(selector) end |
#empty? ⇒ Boolean
Determine if this Document's HTML is empty or not.
331 332 333 334 335 |
# File 'lib/wgit/document.rb', line 331 def empty? return true if @html.nil? @html.empty? end |
#external_links ⇒ Array<Wgit::Url> Also known as: external_urls
Returns all unique external links from this Document in absolute form. External meaning a link to a different host.
410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 |
# File 'lib/wgit/document.rb', line 410 def external_links return [] if @links.empty? links = @links .map do |link| if link.scheme_relative? link.prefix_scheme(@url.to_scheme.to_sym) else link end end .reject { |link| link.relative?(host: @url.to_origin) } Wgit::Utils.sanitize(links) end |
#extract(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object
Extracts a value/object from this Document's @html using the given xpath parameter.
556 557 558 |
# File 'lib/wgit/document.rb', line 556 def extract(xpath, singleton: true, text_content_only: true, &block) send(:extract_from_html, xpath, singleton:, text_content_only:, &block) end |
#extract_from_html(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object (protected)
Extracts a value/object from this Document's @html using the given xpath parameter.
661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 |
# File 'lib/wgit/document.rb', line 661 def extract_from_html(xpath, singleton: true, text_content_only: true) result = nil if xpath xpath = xpath.call if xpath.respond_to?(:call) result = singleton ? at_xpath(xpath) : xpath(xpath) end if result && text_content_only result = singleton ? result.content : result.map(&:content) end result = Wgit::Utils.sanitize(result) result = yield(result, self, :document) if block_given? result end |
#extract_from_object(obj, key, singleton: true) {|value, source, type| ... } ⇒ String, Object (protected)
Returns a value from the obj using the given key via obj#fetch
.
694 695 696 697 698 699 700 701 702 703 |
# File 'lib/wgit/document.rb', line 694 def extract_from_object(obj, key, singleton: true) assert_respond_to(obj, :fetch) default = singleton ? nil : [] result = obj.fetch(key.to_s, default) result = Wgit::Utils.sanitize(result) result = yield(result, obj, :object) if block_given? result end |
#init_nokogiri {|config| ... } ⇒ Nokogiri::HTML (protected)
Initializes the nokogiri object using @html, which cannot be nil. Override this method to custom configure the Nokogiri object returned. Gets called from Wgit::Document.new upon initialization.
637 638 639 640 641 |
# File 'lib/wgit/document.rb', line 637 def init_nokogiri(&block) raise '@html must be set' unless @html Nokogiri::HTML(@html, &block) end |
#inspect ⇒ String
Overrides String#inspect to shorten the printed output of a Document.
200 201 202 |
# File 'lib/wgit/document.rb', line 200 def inspect "#<Wgit::Document url=\"#{@url}\" html_size=#{size}>" end |
#internal_absolute_links ⇒ Array<Wgit::Url> Also known as: internal_absolute_urls
Returns all unique internal links from this Document in absolute form by appending them to self's #base_url. Also see Wgit::Document#internal_links.
402 403 404 |
# File 'lib/wgit/document.rb', line 402 def internal_absolute_links internal_links.map { |link| link.make_absolute(self) } end |
#internal_links ⇒ Array<Wgit::Url> Also known as: internal_urls
Returns all unique internal links from this Document in relative form. Internal meaning a link to another document on the same host.
This Document's host is used to determine if an absolute URL is actually a relative link e.g. For a Document representing http://www.server.com/about, an absolute link of will be recognized and returned as an internal link because both Documents live on the same host. Also see Wgit::Document#internal_absolute_links.
384 385 386 387 388 389 390 391 392 393 394 395 |
# File 'lib/wgit/document.rb', line 384 def internal_links return [] if @links.empty? links = @links .select { |link| link.relative?(host: @url.to_origin) } .map(&:omit_base) .map do |link| # Map @url.to_host into / as it's a duplicate. link.to_host == @url.to_host ? Wgit::Url.new('/') : link end Wgit::Utils.sanitize(links) end |
#nearest_fragment(el_text, el_type = "*") {|results| ... } ⇒ String?
Firstly finds the target element whose text contains el_text.
Then finds the preceeding fragment element nearest to the target
element and returns it's href value (starting with #). The search is
performed against the @html so Documents loaded from a DB will need to
contain the 'html' field in the Wgit::Model. See the
Wgit::Model#include_doc_html
documentation for more info.
595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 |
# File 'lib/wgit/document.rb', line 595 def nearest_fragment(el_text, el_type = "*") raise "The @html is empty" if @html.empty? xpath_query = "//#{el_type}[text()[contains(.,\"#{el_text}\")]]" results = xpath(xpath_query) return nil if results.empty? target = results.first if block_given? result = yield(results) target = result if result end target_index = html_index(target) raise 'Failed to find target index' unless target_index fragment_h = fragment_indices(fragments) # Return the target href if the target is itself a fragment. return fragment_h[target_index] if fragment_h.keys.include?(target_index) # Find the target's nearest preceeding fragment href. closest_index = 0 fragment_h.each do |fragment_index, href| if fragment_index.between?(closest_index, target_index) closest_index = fragment_index end end fragment_h[closest_index] end |
#no_index? ⇒ Boolean
Attempts to extract and check the HTML meta tags instructing Wgit not to index this document (save it to a Database).
565 566 567 568 569 570 571 572 573 574 575 576 577 578 |
# File 'lib/wgit/document.rb', line 565 def no_index? = extract_from_html( '//meta[@name="robots"]/@content', singleton: true, text_content_only: true ) = extract_from_html( '//meta[@name="wgit"]/@content', singleton: true, text_content_only: true ) [, ].include?('noindex') end |
#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80, search_fields: Wgit::Model.search_fields) {|results_hash| ... } ⇒ Array<String>
Searches the Document's instance vars for the given query and returns
the results. The Wgit::Model.search_fields
denote the vars to be
searched, unless overridden using the search_fields: param.
The number of matches for each search field is recorded internally and used to rank/sort the search results before being returned. Where the Wgit::Database::DatabaseAdapter#search method searches all documents for matches, this method searches each individual Document for matches.
Each search result comprises of a sentence of a given length. The length will be based on the sentence_limit parameter or the full length of the original sentence, which ever is less. The algorithm obviously ensures that the search query is visible somewhere in the sentence.
455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 |
# File 'lib/wgit/document.rb', line 455 def search( query, case_sensitive: false, whole_sentence: true, sentence_limit: 80, search_fields: Wgit::Model.search_fields ) raise 'The sentence_limit value must be even' if sentence_limit.odd? assert_type(search_fields, Hash) regex = Wgit::Utils.build_search_regex( query, case_sensitive:, whole_sentence:) results = {} search_fields.each do |field, weight| doc_field = instance_variable_get("@#{field}".to_sym) next unless doc_field Wgit::Utils.each(doc_field) do |text| assert_type(text, String) text = text.strip matches = text.scan(regex).count next unless matches.positive? index = text.index(regex) # Index of first match. Wgit::Utils.format_sentence_length(text, index, sentence_limit) # For duplicate matching text, total the text score. text_score = matches * weight existing_score = results[text] text_score += existing_score if existing_score results[text] = text_score end end return [] if results.empty? yield results if block_given? # Return only the matching text sentences, sorted by relevance. Hash[results.sort_by { |_, score| -score }].keys end |
#search_text(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>
Performs a text only search of the Document, instead of searching all search fields defined in Wgit::Model.search_fields.
509 510 511 512 513 514 515 |
# File 'lib/wgit/document.rb', line 509 def search_text( query, case_sensitive: false, whole_sentence: true, sentence_limit: 80 ) search( query, case_sensitive:, whole_sentence:, sentence_limit:, search_fields: { text: 1 }) end |
#search_text!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String
Performs a text only search (see Document#search_text for details) but assigns the results to the @text instance variable. This can be used for sub search functionality. The original text is returned; no other reference to it is kept thereafter.
530 531 532 533 534 535 536 537 |
# File 'lib/wgit/document.rb', line 530 def search_text!( query, case_sensitive: false, whole_sentence: true, sentence_limit: 80 ) orig_text = @text @text = search_text(query, case_sensitive:, whole_sentence:, sentence_limit:) orig_text end |
#size ⇒ Integer
Determine the size of this Document's HTML.
324 325 326 |
# File 'lib/wgit/document.rb', line 324 def size stats[:html] end |
#stats ⇒ Hash Also known as: statistics
Returns a Hash containing this Document's instance variables and their #length (if they respond to it). Works dynamically so that any user defined extractors (and their created instance vars) will appear in the returned Hash as well. The number of text snippets as well as total number of textual bytes are always included in the returned Hash.
303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 |
# File 'lib/wgit/document.rb', line 303 def stats hash = {} instance_variables.each do |var| # Add up the total bytes of text as well as the length. if var == :@text hash[:text] = @text.length hash[:text_bytes] = @text.sum(&:length) # Else take the var's #length method return value. else next unless instance_variable_get(var).respond_to?(:length) hash[var[1..].to_sym] = instance_variable_get(var).send(:length) end end hash end |
#to_h(include_html: false, include_score: true) ⇒ Hash
Returns a Hash containing this Document's instance vars. Used when storing the Document in a Database e.g. MongoDB etc. By default the @html var is excluded from the returned Hash.
278 279 280 281 282 283 284 |
# File 'lib/wgit/document.rb', line 278 def to_h(include_html: false, include_score: true) ignore = Wgit::Document.to_h_ignore_vars.dup ignore << '@html' unless include_html ignore << '@score' unless include_score Wgit::Utils.to_h(self, ignore:) end |
#to_json(include_html: false) ⇒ String
Converts this Document's #to_h return value to a JSON String.
291 292 293 294 |
# File 'lib/wgit/document.rb', line 291 def to_json(include_html: false) h = to_h(include_html:) JSON.generate(h) end |
#xpath(xpath) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's xpath method to search the doc's html and return the
results. Use #at_xpath
for returning the first result only.
342 343 344 |
# File 'lib/wgit/document.rb', line 342 def xpath(xpath) @parser.xpath(xpath) end |