Class: Wgit::Document
- Inherits:
-
Object
- Object
- Wgit::Document
- Includes:
- Assertable
- Defined in:
- lib/wgit/document.rb
Overview
Class modeling/serialising a HTML web document, although other MIME types
will work e.g. images etc. Also doubles as a search result when
loading Documents from the database via Wgit::Database#search
.
The initialize method dynamically initializes instance variables from the
Document HTML / Database object e.g. text. This bit is dynamic so that the
Document class can be easily extended allowing you to extract the bits of
a webpage that are important to you. See Wgit::Document.define_extractor
.
Constant Summary collapse
- REGEX_EXTRACTOR_NAME =
Regex for the allowed var names when defining an extractor.
/[a-z0-9_]+/
Constants included from Assertable
Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::NON_ENUMERABLE_MSG
Class Attribute Summary collapse
-
.extractors ⇒ Object
readonly
Set of Symbols representing the defined Document extractors.
-
.text_elements ⇒ Object
readonly
Set of HTML elements that make up the visible text on a page.
-
.to_h_ignore_vars ⇒ Object
readonly
Array of instance vars to ignore when Document#to_h and in turn Model.document methods are called.
Instance Attribute Summary collapse
-
#html ⇒ Object
(also: #content)
readonly
The content/HTML of the document, an instance of String.
-
#parser ⇒ Object
readonly
The Nokogiri::HTML document object initialized from @html.
-
#score ⇒ Object
readonly
The score is only used following a
Database#search
and records matches. -
#url ⇒ Object
readonly
The URL of the webpage, an instance of Wgit::Url.
Class Method Summary collapse
-
.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines a content extractor, which extracts HTML elements/content into instance variables upon Document initialization.
-
.remove_extractor(var) ⇒ Boolean
Removes the
init_*
methods created when an extractor is defined. -
.remove_extractors ⇒ Object
Removes all default and defined extractors by calling
Document.remove_extractor
underneath. -
.text_elements_xpath ⇒ String
Uses Document.text_elements to build an xpath String, used to obtain all of the combined visual text on a webpage.
Instance Method Summary collapse
-
#==(other) ⇒ Boolean
Determines if both the url and html match.
-
#[](range) ⇒ String
Shortcut for calling Document#html[range].
-
#at_css(selector) ⇒ Nokogiri::XML::Element
Uses Nokogiri's
at_css
method to search the doc's html and return the result. -
#at_xpath(xpath) ⇒ Nokogiri::XML::Element
Uses Nokogiri's
at_xpath
method to search the doc's html and return the result. -
#base_url(link: nil) ⇒ Wgit::Url
Returns the base URL of this Wgit::Document.
-
#css(selector) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's
css
method to search the doc's html and return the results. -
#empty? ⇒ Boolean
Determine if this Document's HTML is empty or not.
-
#external_links ⇒ Array<Wgit::Url>
(also: #external_urls)
Returns all unique external links from this Document in absolute form.
-
#extract(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object
Extracts a value/object from this Document's @html using the given xpath parameter.
-
#extract_from_html(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object
protected
Extracts a value/object from this Document's @html using the given xpath parameter.
-
#extract_from_object(obj, key, singleton: true) {|value, source, type| ... } ⇒ String, Object
protected
Returns a value from the obj using the given key via
obj#fetch
. -
#init_nokogiri {|config| ... } ⇒ Nokogiri::HTML
protected
Initializes the nokogiri object using @html, which cannot be nil.
-
#initialize(url_or_obj, html = '', encode: true) ⇒ Document
constructor
Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page).
-
#inspect ⇒ String
Overrides String#inspect to shorten the printed output of a Document.
-
#internal_absolute_links ⇒ Array<Wgit::Url>
(also: #internal_absolute_urls)
Returns all unique internal links from this Document in absolute form by appending them to self's #base_url.
-
#internal_links ⇒ Array<Wgit::Url>
(also: #internal_urls)
Returns all unique internal links from this Document in relative form.
-
#no_index? ⇒ Boolean
Works with the default extractors to extract and check the HTML meta tags instructing Wgit not to index this document (save it to a Database).
-
#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>
Searches the @text for the given query and returns the results.
-
#search!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String
Performs a text search (see Document#search for details) but assigns the results to the @text instance variable.
-
#size ⇒ Integer
Determine the size of this Document's HTML.
-
#stats ⇒ Hash
(also: #statistics)
Returns a Hash containing this Document's instance variables and their #length (if they respond to it).
-
#to_h(include_html: false, include_score: true) ⇒ Hash
Returns a Hash containing this Document's instance vars.
-
#to_json(include_html: false) ⇒ String
Converts this Document's #to_h return value to a JSON String.
-
#xpath(xpath) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's xpath method to search the doc's html and return the results.
Methods included from Assertable
#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types
Constructor Details
#initialize(url_or_obj, html = '', encode: true) ⇒ Document
Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page). This allows for initialisation from both crawled web pages and documents/web pages retrieved from the database.
During initialisation, the Document will call any private
init_*_from_html
and init_*_from_object
methods it can find. See the
Wgit::Document.define_extractor method for more details.
89 90 91 92 93 94 95 |
# File 'lib/wgit/document.rb', line 89 def initialize(url_or_obj, html = '', encode: true) if url_or_obj.is_a?(String) init_from_strings(url_or_obj, html, encode:) else init_from_object(url_or_obj, encode:) end end |
Class Attribute Details
.extractors ⇒ Object (readonly)
Set of Symbols representing the defined Document extractors. Is read-only. Use Wgit::Document.define_extractor for a new extractor.
55 56 57 |
# File 'lib/wgit/document.rb', line 55 def extractors @extractors end |
.text_elements ⇒ Object (readonly)
Set of HTML elements that make up the visible text on a page. These elements are used to initialize the Wgit::Document#text. See the README.md for how to add to this Set dynamically.
45 46 47 |
# File 'lib/wgit/document.rb', line 45 def text_elements @text_elements end |
.to_h_ignore_vars ⇒ Object (readonly)
Array of instance vars to ignore when Document#to_h and in turn Model.document methods are called. Append your own defined extractor vars to omit them from the model (database object) when indexing. Each var should be a String starting with an '@' char e.g. "@data" etc.
51 52 53 |
# File 'lib/wgit/document.rb', line 51 def to_h_ignore_vars @to_h_ignore_vars end |
Instance Attribute Details
#html ⇒ Object (readonly) Also known as: content
The content/HTML of the document, an instance of String.
62 63 64 |
# File 'lib/wgit/document.rb', line 62 def html @html end |
#parser ⇒ Object (readonly)
The Nokogiri::HTML document object initialized from @html.
65 66 67 |
# File 'lib/wgit/document.rb', line 65 def parser @parser end |
#score ⇒ Object (readonly)
The score is only used following a Database#search
and records matches.
68 69 70 |
# File 'lib/wgit/document.rb', line 68 def score @score end |
#url ⇒ Object (readonly)
The URL of the webpage, an instance of Wgit::Url.
59 60 61 |
# File 'lib/wgit/document.rb', line 59 def url @url end |
Class Method Details
.define_extractor(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines a content extractor, which extracts HTML elements/content
into instance variables upon Document initialization. See the default
extractors defined in 'document_extractors.rb' as examples. Defining an
extractor means that every subsequently crawled/initialized document
will attempt to extract the xpath's content. Use #extract
for a one off
content extraction on any document.
Note that defined extractors work for both Documents initialized from HTML (via Wgit::Crawler methods) and from database objects. An extractor once defined, initializes a private instance variable with the xpath or database object result(s).
When initialising from HTML, a singleton value of true will only
ever return the first result found; otherwise all the results are
returned in an Enumerable. When initialising from a database object, the
value is taken as is and singleton is only used to define the default
empty value. If a value cannot be found (in either the HTML or database
object), then a default will be used. The default value is:
singleton ? nil : []
.
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 |
# File 'lib/wgit/document.rb', line 163 def self.define_extractor(var, xpath, opts = {}, &block) var = var.to_sym defaults = { singleton: true, text_content_only: true } opts = defaults.merge(opts) raise "var must match #{REGEX_EXTRACTOR_NAME}" unless \ var =~ REGEX_EXTRACTOR_NAME # Define the private init_*_from_html method for HTML. # Gets the HTML's xpath value and creates a var for it. func_name = Document.send(:define_method, "init_#{var}_from_html") do result = extract_from_html(xpath, **opts, &block) init_var(var, result) end Document.send(:private, func_name) # Define the private init_*_from_object method for a Database object. # Gets the Object's 'key' value and creates a var for it. func_name = Document.send( :define_method, "init_#{var}_from_object" ) do |obj| result = extract_from_object( obj, var.to_s, singleton: opts[:singleton], &block ) init_var(var, result) end Document.send(:private, func_name) @extractors << var var end |
.remove_extractor(var) ⇒ Boolean
Removes the init_*
methods created when an extractor is defined.
Therefore, this is the opposing method to Document.define_extractor
.
Returns true if successful or false if the method(s) cannot be found.
202 203 204 205 206 207 208 209 210 211 |
# File 'lib/wgit/document.rb', line 202 def self.remove_extractor(var) Document.send(:remove_method, "init_#{var}_from_html") Document.send(:remove_method, "init_#{var}_from_object") @extractors.delete(var.to_sym) true rescue NameError false end |
.remove_extractors ⇒ Object
Removes all default and defined extractors by calling
Document.remove_extractor
underneath. See its documentation.
215 216 217 |
# File 'lib/wgit/document.rb', line 215 def self.remove_extractors @extractors.each { |var| remove_extractor(var) } end |
.text_elements_xpath ⇒ String
Uses Document.text_elements to build an xpath String, used to obtain all of the combined visual text on a webpage.
103 104 105 106 107 108 |
# File 'lib/wgit/document.rb', line 103 def self.text_elements_xpath @text_elements.each_with_index.reduce('') do |xpath, (el, i)| xpath += ' | ' unless i.zero? xpath + format('//%s/text()', el) end end |
Instance Method Details
#==(other) ⇒ Boolean
Determines if both the url and html match. Use doc.object_id == other.object_id for exact object comparison.
233 234 235 236 237 |
# File 'lib/wgit/document.rb', line 233 def ==(other) return false unless other.is_a?(Wgit::Document) (@url == other.url) && (@html == other.html) end |
#[](range) ⇒ String
Shortcut for calling Document#html[range].
243 244 245 |
# File 'lib/wgit/document.rb', line 243 def [](range) @html[range] end |
#at_css(selector) ⇒ Nokogiri::XML::Element
Uses Nokogiri's at_css
method to search the doc's html and return the
result. Use #css
for returning several results.
393 394 395 |
# File 'lib/wgit/document.rb', line 393 def at_css(selector) @parser.at_css(selector) end |
#at_xpath(xpath) ⇒ Nokogiri::XML::Element
Uses Nokogiri's at_xpath
method to search the doc's html and return the
result. Use #xpath
for returning several results.
375 376 377 |
# File 'lib/wgit/document.rb', line 375 def at_xpath(xpath) @parser.at_xpath(xpath) end |
#base_url(link: nil) ⇒ Wgit::Url
Returns the base URL of this Wgit::Document. The base URL is either the
doc.url.to_origin
etc. when manually building
absolute links from relative links; or use link.make_absolute(doc)
.
Provide the link:
parameter to get the correct base URL for that type
of link. For example, a link of #top
would always return @url because
it applies to that page, not a different one. Query strings work in the
same way. Use this parameter if manually joining Url's e.g.
relative_link = Wgit::Url.new('?q=hello') absolute_link = doc.base_url(link: relative_link).join(relative_link)
This is similar to how Wgit::Document#internal_absolute_links works.
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 |
# File 'lib/wgit/document.rb', line 269 def base_url(link: nil) if @url.relative? && @base.nil? raise "Document @url ('#{@url}') cannot be relative if <base> is nil" end if @url.relative? && @base&.relative? raise "Document @url ('#{@url}') and <base> ('#{@base}') both can't \ be relative" end get_base = -> { @base.relative? ? @url.to_origin.join(@base) : @base } if link link = Wgit::Url.new(link) raise "link must be relative: #{link}" unless link.relative? if link.is_fragment? || link.is_query? base_url = @base ? get_base.call : @url return base_url.omit_fragment.omit_query end end base_url = @base ? get_base.call : @url.to_origin base_url.omit_fragment.omit_query end |
#css(selector) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's css
method to search the doc's html and return the
results. Use #at_css
for returning the first result only.
384 385 386 |
# File 'lib/wgit/document.rb', line 384 def css(selector) @parser.css(selector) end |
#empty? ⇒ Boolean
Determine if this Document's HTML is empty or not.
355 356 357 358 359 |
# File 'lib/wgit/document.rb', line 355 def empty? return true if @html.nil? @html.empty? end |
#external_links ⇒ Array<Wgit::Url> Also known as: external_urls
Returns all unique external links from this Document in absolute form. External meaning a link to a different host.
434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 |
# File 'lib/wgit/document.rb', line 434 def external_links return [] if @links.empty? links = @links .map do |link| if link.scheme_relative? link.prefix_scheme(@url.to_scheme.to_sym) else link end end .reject { |link| link.relative?(host: @url.to_origin) } Wgit::Utils.sanitize(links) end |
#extract(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object
Extracts a value/object from this Document's @html using the given xpath parameter.
543 544 545 |
# File 'lib/wgit/document.rb', line 543 def extract(xpath, singleton: true, text_content_only: true, &block) send(:extract_from_html, xpath, singleton:, text_content_only:, &block) end |
#extract_from_html(xpath, singleton: true, text_content_only: true) {|Optionally| ... } ⇒ String, Object (protected)
Extracts a value/object from this Document's @html using the given xpath parameter.
590 591 592 593 594 595 596 597 598 599 600 601 |
# File 'lib/wgit/document.rb', line 590 def extract_from_html(xpath, singleton: true, text_content_only: true) xpath = xpath.call if xpath.respond_to?(:call) result = singleton ? at_xpath(xpath) : xpath(xpath) if result && text_content_only result = singleton ? result.content : result.map(&:content) end result = Wgit::Utils.sanitize(result) result = yield(result, self, :document) if block_given? result end |
#extract_from_object(obj, key, singleton: true) {|value, source, type| ... } ⇒ String, Object (protected)
Returns a value from the obj using the given key via obj#fetch
.
619 620 621 622 623 624 625 626 627 628 |
# File 'lib/wgit/document.rb', line 619 def extract_from_object(obj, key, singleton: true) assert_respond_to(obj, :fetch) default = singleton ? nil : [] result = obj.fetch(key.to_s, default) result = Wgit::Utils.sanitize(result) result = yield(result, obj, :object) if block_given? result end |
#init_nokogiri {|config| ... } ⇒ Nokogiri::HTML (protected)
Initializes the nokogiri object using @html, which cannot be nil. Override this method to custom configure the Nokogiri object returned. Gets called from Wgit::Document.new upon initialization.
567 568 569 570 571 |
# File 'lib/wgit/document.rb', line 567 def init_nokogiri(&block) raise '@html must be set' unless @html Nokogiri::HTML(@html, &block) end |
#inspect ⇒ String
Overrides String#inspect to shorten the printed output of a Document.
224 225 226 |
# File 'lib/wgit/document.rb', line 224 def inspect "#<Wgit::Document url=\"#{@url}\" html_size=#{size}>" end |
#internal_absolute_links ⇒ Array<Wgit::Url> Also known as: internal_absolute_urls
Returns all unique internal links from this Document in absolute form by appending them to self's #base_url. Also see Wgit::Document#internal_links.
426 427 428 |
# File 'lib/wgit/document.rb', line 426 def internal_absolute_links internal_links.map { |link| link.make_absolute(self) } end |
#internal_links ⇒ Array<Wgit::Url> Also known as: internal_urls
Returns all unique internal links from this Document in relative form. Internal meaning a link to another document on the same host.
This Document's host is used to determine if an absolute URL is actually a relative link e.g. For a Document representing http://www.server.com/about, an absolute link of will be recognized and returned as an internal link because both Documents live on the same host. Also see Wgit::Document#internal_absolute_links.
408 409 410 411 412 413 414 415 416 417 418 419 |
# File 'lib/wgit/document.rb', line 408 def internal_links return [] if @links.empty? links = @links .select { |link| link.relative?(host: @url.to_origin) } .map(&:omit_base) .map do |link| # Map @url.to_host into / as it's a duplicate. link.to_host == @url.to_host ? Wgit::Url.new('/') : link end Wgit::Utils.sanitize(links) end |
#no_index? ⇒ Boolean
Works with the default extractors to extract and check the HTML meta tags instructing Wgit not to index this document (save it to a Database). If the default extractors are removed, this method will always return false.
553 554 555 |
# File 'lib/wgit/document.rb', line 553 def no_index? [@meta_robots, @meta_wgit].include?('noindex') end |
#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>
Searches the @text for the given query and returns the results.
The number of search hits for each sentenence are recorded internally and used to rank/sort the search results before being returned. Where the Wgit::Database#search method search all documents for the most hits, this method searches each document's @text for the most hits.
Each search result comprises of a sentence of a given length. The length will be based on the sentence_limit parameter or the full length of the original sentence, which ever is less. The algorithm obviously ensures that the search query is visible somewhere in the sentence.
470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 |
# File 'lib/wgit/document.rb', line 470 def search( query, case_sensitive: false, whole_sentence: true, sentence_limit: 80 ) raise 'The sentence_limit value must be even' if sentence_limit.odd? if query.is_a?(Regexp) regex = query else # query.respond_to? :to_s == true query = query.to_s query = query.gsub(' ', '|') unless whole_sentence regex = Regexp.new(query, !case_sensitive) end results = {} @text.each do |sentence| sentence = sentence.strip next if results[sentence] hits = sentence.scan(regex).count next unless hits.positive? index = sentence.index(regex) # Index of first match. Wgit::Utils.format_sentence_length(sentence, index, sentence_limit) results[sentence] = hits end return [] if results.empty? results = Hash[results.sort_by { |_k, v| v }] results.keys.reverse end |
#search!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String
Performs a text search (see Document#search for details) but assigns the results to the @text instance variable. This can be used for sub search functionality. The original text is returned; no other reference to it is kept thereafter.
517 518 519 520 521 522 523 524 |
# File 'lib/wgit/document.rb', line 517 def search!( query, case_sensitive: false, whole_sentence: true, sentence_limit: 80 ) orig_text = @text @text = search(query, case_sensitive:, whole_sentence:, sentence_limit:) orig_text end |
#size ⇒ Integer
Determine the size of this Document's HTML.
348 349 350 |
# File 'lib/wgit/document.rb', line 348 def size stats[:html] end |
#stats ⇒ Hash Also known as: statistics
Returns a Hash containing this Document's instance variables and their #length (if they respond to it). Works dynamically so that any user defined extractors (and their created instance vars) will appear in the returned Hash as well. The number of text snippets as well as total number of textual bytes are always included in the returned Hash.
327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 |
# File 'lib/wgit/document.rb', line 327 def stats hash = {} instance_variables.each do |var| # Add up the total bytes of text as well as the length. if var == :@text hash[:text] = @text.length hash[:text_bytes] = @text.sum(&:length) # Else take the var's #length method return value. else next unless instance_variable_get(var).respond_to?(:length) hash[var[1..].to_sym] = instance_variable_get(var).send(:length) end end hash end |
#to_h(include_html: false, include_score: true) ⇒ Hash
Returns a Hash containing this Document's instance vars. Used when storing the Document in a Database e.g. MongoDB etc. By default the @html var is excluded from the returned Hash.
302 303 304 305 306 307 308 |
# File 'lib/wgit/document.rb', line 302 def to_h(include_html: false, include_score: true) ignore = Wgit::Document.to_h_ignore_vars.dup ignore << '@html' unless include_html ignore << '@score' unless include_score Wgit::Utils.to_h(self, ignore:) end |
#to_json(include_html: false) ⇒ String
Converts this Document's #to_h return value to a JSON String.
315 316 317 318 |
# File 'lib/wgit/document.rb', line 315 def to_json(include_html: false) h = to_h(include_html:) JSON.generate(h) end |
#xpath(xpath) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's xpath method to search the doc's html and return the
results. Use #at_xpath
for returning the first result only.
366 367 368 |
# File 'lib/wgit/document.rb', line 366 def xpath(xpath) @parser.xpath(xpath) end |