Class: Wgit::Document
- Inherits:
-
Object
- Object
- Wgit::Document
- Includes:
- Assertable
- Defined in:
- lib/wgit/document.rb
Overview
Class primarily modeling a HTML web document, although other MIME types
will work e.g. images etc. Also doubles as a search result when
loading Documents from the database via Wgit::Database#search.
The initialize method dynamically initializes instance variables from the
Document HTML / Database object e.g. text. This bit is dynamic so that the
Document class can be easily extended allowing you to pull out the bits of
a webpage that are important to you. See Wgit::Document.define_extension.
Constant Summary collapse
- REGEX_EXTENSION_NAME =
Regex for the allowed var names when defining an extension.
/[a-z0-9_]+/.freeze
- TEXT_ELEMENTS_XPATH =
The xpath used to extract the visible text on a page.
'//*/text()'.freeze
Constants included from Assertable
Assertable::DEFAULT_DUCK_FAIL_MSG, Assertable::DEFAULT_REQUIRED_KEYS_MSG, Assertable::DEFAULT_TYPE_FAIL_MSG, Assertable::WRONG_METHOD_MSG
Class Attribute Summary collapse
-
.extensions ⇒ Object
readonly
Class level attr_reader for the Document defined extensions.
Instance Attribute Summary collapse
-
#doc ⇒ Object
readonly
The Nokogiri::HTML document object initialized from @html.
-
#html ⇒ Object
(also: #content)
readonly
The content/HTML of the document, an instance of String.
-
#score ⇒ Object
readonly
The score is only used following a
Database#searchand records matches. -
#url ⇒ Object
readonly
The URL of the webpage, an instance of Wgit::Url.
Class Method Summary collapse
-
.define_extension(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines an extension, which is a way to serialise HTML elements into instance variables upon Document initialization.
-
.remove_extension(var) ⇒ Boolean
Removes the init_* methods created when an extension is defined.
Instance Method Summary collapse
-
#==(other) ⇒ Boolean
Determines if both the url and html match.
-
#[](range) ⇒ String
Is a shortcut for calling Document#html[range].
-
#base_url(link: nil) ⇒ Wgit::Url
Returns the base URL of this Wgit::Document.
-
#css(selector) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's css method to search the doc's html and return the results.
-
#empty? ⇒ Boolean
Determine if this Document's HTML is empty or not.
-
#external_links ⇒ Array<Wgit::Url>
(also: #external_urls)
Returns all external links from this Document in absolute form.
-
#find_in_html(xpath, singleton: true, text_content_only: true) {|value, source| ... } ⇒ String, Object
protected
Returns a value/object from this Document's @html using the given xpath parameter.
-
#find_in_object(obj, key, singleton: true) {|value, source| ... } ⇒ String, Object
protected
Returns a value from the obj using the given key via obj#fetch.
-
#init_nokogiri ⇒ Nokogiri::HTML
protected
Initializes the nokogiri object using @html, which cannot be nil.
-
#initialize(url_or_obj, html = '', encode: true) ⇒ Document
constructor
Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page).
-
#internal_absolute_links ⇒ Array<Wgit::Url>
(also: #internal_absolute_urls)
Returns all internal links from this Document in absolute form by appending them to self's #base_url.
-
#internal_links ⇒ Array<Wgit::Url>
(also: #internal_urls)
Returns all internal links from this Document in relative form.
-
#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>
Searches the @text for the given query and returns the results.
-
#search!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String
Performs a text search (see Document#search for details) but assigns the results to the @text instance variable.
-
#size ⇒ Integer
Determine the size of this Document's HTML.
-
#stats ⇒ Hash
(also: #statistics)
Returns a Hash containing this Document's instance variables and their #length (if they respond to it).
-
#to_h(include_html: false, include_score: true) ⇒ Hash
Returns a Hash containing this Document's instance vars.
-
#to_json(include_html: false) ⇒ String
Converts this Document's #to_h return value to a JSON String.
-
#xpath(xpath) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's xpath method to search the doc's html and return the results.
Methods included from Assertable
#assert_arr_types, #assert_required_keys, #assert_respond_to, #assert_types
Constructor Details
#initialize(url_or_obj, html = '', encode: true) ⇒ Document
Initialize takes either two strings (representing the URL and HTML) or an object representing a database record (of a HTTP crawled web page). This allows for initialisation from both crawled web pages and documents/web pages retrieved from the database.
During initialisation, the Document will call any private
init_*_from_html and init_*_from_object methods it can find. See the
README.md and Wgit::Document.define_extension method for more details.
65 66 67 68 69 70 71 |
# File 'lib/wgit/document.rb', line 65 def initialize(url_or_obj, html = '', encode: true) if url_or_obj.is_a?(String) init_from_strings(url_or_obj, html, encode: encode) else init_from_object(url_or_obj, encode: encode) end end |
Class Attribute Details
.extensions ⇒ Object (readonly)
Class level attr_reader for the Document defined extensions.
31 32 33 |
# File 'lib/wgit/document.rb', line 31 def extensions @extensions end |
Instance Attribute Details
#doc ⇒ Object (readonly)
The Nokogiri::HTML document object initialized from @html.
41 42 43 |
# File 'lib/wgit/document.rb', line 41 def doc @doc end |
#html ⇒ Object (readonly) Also known as: content
The content/HTML of the document, an instance of String.
38 39 40 |
# File 'lib/wgit/document.rb', line 38 def html @html end |
#score ⇒ Object (readonly)
The score is only used following a Database#search and records matches.
44 45 46 |
# File 'lib/wgit/document.rb', line 44 def score @score end |
#url ⇒ Object (readonly)
The URL of the webpage, an instance of Wgit::Url.
35 36 37 |
# File 'lib/wgit/document.rb', line 35 def url @url end |
Class Method Details
.define_extension(var, xpath, opts = {}) {|value, source, type| ... } ⇒ Symbol
Defines an extension, which is a way to serialise HTML elements into instance variables upon Document initialization. See the default extensions defined in 'document_extensions.rb' as examples.
Note that defined extensions work for both Documents initialized from HTML (via Wgit::Crawler methods) and from database objects. An extension once defined, initializes a private instance variable with the xpath or database object result(s).
When initialising from HTML, a singleton value of true will only
ever return one result; otherwise all xpath results are returned in an
Array. When initialising from a database object, the value is taken as
is and singleton is only used to define the default empty value.
If a value cannot be found (in either the HTML or database object), then
a default will be used. The default value is: singleton ? nil : [].
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 |
# File 'lib/wgit/document.rb', line 118 def self.define_extension(var, xpath, opts = {}, &block) var = var.to_sym defaults = { singleton: true, text_content_only: true } opts = defaults.merge(opts) raise "var must match #{REGEX_EXTENSION_NAME}" unless \ var =~ REGEX_EXTENSION_NAME # Define the private init_*_from_html method for HTML. # Gets the HTML's xpath value and creates a var for it. func_name = Document.send(:define_method, "init_#{var}_from_html") do result = find_in_html(xpath, opts, &block) init_var(var, result) end Document.send(:private, func_name) # Define the private init_*_from_object method for a Database object. # Gets the Object's 'key' value and creates a var for it. func_name = Document.send(:define_method, "init_#{var}_from_object") do |obj| result = find_in_object(obj, var.to_s, singleton: opts[:singleton], &block) init_var(var, result) end Document.send(:private, func_name) @extensions << var var end |
.remove_extension(var) ⇒ Boolean
Removes the init_* methods created when an extension is defined. Therefore, this is the opposing method to Document.define_extension. Returns true if successful or false if the method(s) cannot be found.
153 154 155 156 157 158 159 160 161 |
# File 'lib/wgit/document.rb', line 153 def self.remove_extension(var) Document.send(:remove_method, "init_#{var}_from_html") Document.send(:remove_method, "init_#{var}_from_object") @extensions.delete(var.to_sym) true rescue NameError false end |
Instance Method Details
#==(other) ⇒ Boolean
Determines if both the url and html match. Use doc.object_id == other.object_id for exact object comparison.
170 171 172 173 174 |
# File 'lib/wgit/document.rb', line 170 def ==(other) return false unless other.is_a?(Wgit::Document) (@url == other.url) && (@html == other.html) end |
#[](range) ⇒ String
Is a shortcut for calling Document#html[range].
180 181 182 |
# File 'lib/wgit/document.rb', line 180 def [](range) @html[range] end |
#base_url(link: nil) ⇒ Wgit::Url
Returns the base URL of this Wgit::Document. The base URL is either the
doc.url.to_base etc. when manually building
absolute links from relative links; or use link.prefix_base(doc).
Provide the link: parameter to get the correct base URL for that type
of link. For example, a link of #top would always return @url because
it applies to that page, not a different one. Query strings work in the
same way. Use this parameter if manually concatting Url's e.g.
relative_link = Wgit::Url.new('?q=hello') absolute_link = doc.base_url(link: relative_link).concat(relative_link)
This is similar to how Wgit::Document#internal_absolute_links works.
206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
# File 'lib/wgit/document.rb', line 206 def base_url(link: nil) raise "Document @url ('#{@url}') cannot be relative if <base> is nil" \ if @url.relative? && @base.nil? raise "Document @url ('#{@url}') and <base> ('#{@base}') both can't be relative" \ if @url.relative? && @base&.relative? get_base = -> { @base.relative? ? @url.to_base.concat(@base) : @base } if link link = Wgit::Url.new(link) raise "link must be relative: #{link}" unless link.relative? if link.is_fragment? || link.is_query? base_url = @base ? get_base.call : @url return base_url.omit_fragment.omit_query end end base_url = @base ? get_base.call : @url.to_base base_url.omit_fragment.omit_query end |
#css(selector) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's css method to search the doc's html and return the results.
308 309 310 |
# File 'lib/wgit/document.rb', line 308 def css(selector) @doc.css(selector) end |
#empty? ⇒ Boolean
Determine if this Document's HTML is empty or not.
288 289 290 291 292 |
# File 'lib/wgit/document.rb', line 288 def empty? return true if @html.nil? @html.empty? end |
#external_links ⇒ Array<Wgit::Url> Also known as: external_urls
Returns all external links from this Document in absolute form. External meaning a link to a different host.
349 350 351 352 353 354 355 356 357 |
# File 'lib/wgit/document.rb', line 349 def external_links return [] if @links.empty? links = @links .reject { |link| link.relative?(host: @url.to_base) } .map(&:omit_trailing_slash) Wgit::Utils.process_arr(links) end |
#find_in_html(xpath, singleton: true, text_content_only: true) {|value, source| ... } ⇒ String, Object (protected)
Returns a value/object from this Document's @html using the given xpath parameter.
466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 |
# File 'lib/wgit/document.rb', line 466 def find_in_html(xpath, singleton: true, text_content_only: true) default = singleton ? nil : [] xpath = xpath.call if xpath.respond_to?(:call) results = @doc.xpath(xpath) return default if results.nil? || results.empty? result = if singleton text_content_only ? results.first.content : results.first else text_content_only ? results.map(&:content) : results end singleton ? Wgit::Utils.process_str(result) : Wgit::Utils.process_arr(result) if block_given? new_result = yield(result, self, :document) result = new_result unless new_result.nil? end result end |
#find_in_object(obj, key, singleton: true) {|value, source| ... } ⇒ String, Object (protected)
Returns a value from the obj using the given key via obj#fetch.
500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 |
# File 'lib/wgit/document.rb', line 500 def find_in_object(obj, key, singleton: true) assert_respond_to(obj, :fetch) default = singleton ? nil : [] result = obj.fetch(key.to_s, default) singleton ? Wgit::Utils.process_str(result) : Wgit::Utils.process_arr(result) if block_given? new_result = yield(result, obj, :object) result = new_result unless new_result.nil? end result end |
#init_nokogiri ⇒ Nokogiri::HTML (protected)
Initializes the nokogiri object using @html, which cannot be nil. Override this method to custom configure the Nokogiri object returned. Gets called from Wgit::Document.new upon initialization.
442 443 444 445 446 447 448 449 450 |
# File 'lib/wgit/document.rb', line 442 def init_nokogiri raise '@html must be set' unless @html Nokogiri::HTML(@html) do |config| # TODO: Remove #'s below when crawling in production. # config.options = Nokogiri::XML::ParseOptions::STRICT | # Nokogiri::XML::ParseOptions::NONET end end |
#internal_absolute_links ⇒ Array<Wgit::Url> Also known as: internal_absolute_urls
Returns all internal links from this Document in absolute form by appending them to self's #base_url. Also see Wgit::Document#internal_links.
341 342 343 |
# File 'lib/wgit/document.rb', line 341 def internal_absolute_links internal_links.map { |link| link.prefix_base(self) } end |
#internal_links ⇒ Array<Wgit::Url> Also known as: internal_urls
Returns all internal links from this Document in relative form. Internal meaning a link to another document on the same host.
This Document's host is used to determine if an absolute URL is actually a relative link e.g. For a Document representing http://www.server.com/about, an absolute link of will be recognized and returned as an internal link because both Documents live on the same host. Also see Wgit::Document#internal_absolute_links.
323 324 325 326 327 328 329 330 331 332 333 334 |
# File 'lib/wgit/document.rb', line 323 def internal_links return [] if @links.empty? links = @links .select { |link| link.relative?(host: @url.to_base) } .map(&:omit_base) .map do |link| # Map @url.to_host into / as it's a duplicate. link.to_host == @url.to_host ? Wgit::Url.new('/') : link end Wgit::Utils.process_arr(links) end |
#search(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ Array<String>
Searches the @text for the given query and returns the results.
The number of search hits for each sentenence are recorded internally and used to rank/sort the search results before being returned. Where the Wgit::Database#search method search all documents for the most hits, this method searches each document's @text for the most hits.
Each search result comprises of a sentence of a given length. The length will be based on the sentence_limit parameter or the full length of the original sentence, which ever is less. The algorithm obviously ensures that the search query is visible somewhere in the sentence.
379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 |
# File 'lib/wgit/document.rb', line 379 def search( query, case_sensitive: false, whole_sentence: true, sentence_limit: 80 ) query = query.to_s raise 'A search query must be provided' if query.empty? raise 'The sentence_limit value must be even' if sentence_limit.odd? query = query.gsub(' ', '|') unless whole_sentence regex = Regexp.new(query, !case_sensitive) results = {} @text.each do |sentence| sentence = sentence.strip next if results[sentence] hits = sentence.scan(regex).count next unless hits.positive? index = sentence.index(regex) # Index of first match. Wgit::Utils.format_sentence_length(sentence, index, sentence_limit) results[sentence] = hits end return [] if results.empty? results = Hash[results.sort_by { |_k, v| v }] results.keys.reverse end |
#search!(query, case_sensitive: false, whole_sentence: true, sentence_limit: 80) ⇒ String
Performs a text search (see Document#search for details) but assigns the results to the @text instance variable. This can be used for sub search functionality. The original text is returned; no other reference to it is kept thereafter.
422 423 424 425 426 427 428 429 430 431 432 |
# File 'lib/wgit/document.rb', line 422 def search!( query, case_sensitive: false, whole_sentence: true, sentence_limit: 80 ) orig_text = @text @text = search( query, case_sensitive: case_sensitive, whole_sentence: whole_sentence, sentence_limit: sentence_limit ) orig_text end |
#size ⇒ Integer
Determine the size of this Document's HTML.
281 282 283 |
# File 'lib/wgit/document.rb', line 281 def size stats[:html] end |
#stats ⇒ Hash Also known as: statistics
Returns a Hash containing this Document's instance variables and their #length (if they respond to it). Works dynamically so that any user defined extensions (and their created instance vars) will appear in the returned Hash as well. The number of text snippets as well as total number of textual bytes are always included in the returned Hash.
260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 |
# File 'lib/wgit/document.rb', line 260 def stats hash = {} instance_variables.each do |var| # Add up the total bytes of text as well as the length. if var == :@text hash[:text_snippets] = @text.length hash[:text_bytes] = @text.sum(&:length) # Else take the var's #length method return value. else next unless instance_variable_get(var).respond_to?(:length) hash[var[1..-1].to_sym] = instance_variable_get(var).send(:length) end end hash end |
#to_h(include_html: false, include_score: true) ⇒ Hash
Returns a Hash containing this Document's instance vars. Used when storing the Document in a Database e.g. MongoDB etc. By default the @html var is excluded from the returned Hash.
235 236 237 238 239 240 241 |
# File 'lib/wgit/document.rb', line 235 def to_h(include_html: false, include_score: true) ignore = include_html ? [] : ['@html'] ignore << '@score' unless include_score ignore << '@doc' # Always ignore Nokogiri @doc. Wgit::Utils.to_h(self, ignore: ignore) end |
#to_json(include_html: false) ⇒ String
Converts this Document's #to_h return value to a JSON String.
248 249 250 251 |
# File 'lib/wgit/document.rb', line 248 def to_json(include_html: false) h = to_h(include_html: include_html) JSON.generate(h) end |
#xpath(xpath) ⇒ Nokogiri::XML::NodeSet
Uses Nokogiri's xpath method to search the doc's html and return the results.
299 300 301 |
# File 'lib/wgit/document.rb', line 299 def xpath(xpath) @doc.xpath(xpath) end |