Class: Omnivore::Document
- Inherits:
-
Object
- Object
- Omnivore::Document
- Defined in:
- lib/omnivore/document.rb
Overview
A class encapsulating an HTML document.
Defined Under Namespace
Classes: Paragraph
Constant Summary collapse
- BLOCK_TAGS =
The HTML tags signaling the start of a block or paragraph.
%w[div p frame]
Instance Attribute Summary collapse
-
#model ⇒ Object
readonly
Returns the value of attribute model.
Class Method Summary collapse
-
.from_html(html) ⇒ Document
Creates a Omnivore::Document object from a string containing HTML.
-
.from_url(url) ⇒ Document
Creates a Omnivore::Document object from a url.
Instance Method Summary collapse
-
#initialize(html) ⇒ Document
constructor
A new instance of Document.
-
#metadata ⇒ Hash
Extracts document metadata.
-
#title ⇒ String
Extracts the document title.
-
#to_html ⇒ String
A HTML representation of the document.
-
#to_paragraphs ⇒ Array
Splits the document into paragraphs, assuming that each <div> or <p> tag represents a paragraph.
-
#to_text ⇒ String
Returns the actual content of the document, without navigation, advertising, etc.
Constructor Details
#initialize(html) ⇒ Document
Returns a new instance of Document.
34 35 36 37 38 |
# File 'lib/omnivore/document.rb', line 34 def initialize(html) @model = Nokogiri::HTML.parse(html) { |config| config. = Nokogiri::XML::ParseOptions::NOBLANKS } end |
Instance Attribute Details
#model ⇒ Object (readonly)
Returns the value of attribute model.
8 9 10 |
# File 'lib/omnivore/document.rb', line 8 def model @model end |
Class Method Details
.from_html(html) ⇒ Document
Creates a Omnivore::Document object from a string containing HTML.
29 30 31 |
# File 'lib/omnivore/document.rb', line 29 def self.from_html(html) Document.new(html) end |
.from_url(url) ⇒ Document
Creates a Omnivore::Document object from a url.
21 22 23 |
# File 'lib/omnivore/document.rb', line 21 def self.from_url(url) Document.new(HttpClient.get(url)) end |
Instance Method Details
#metadata ⇒ Hash
Extracts document metadata.
57 58 59 60 61 62 |
# File 'lib/omnivore/document.rb', line 57 def @metadata ||= self.model.xpath("//meta").inject({ }) { |memo, el| memo[el.attr("name")] = el.attr("content") || "" if el.attr("name") memo } end |
#title ⇒ String
Extracts the document title.
50 51 52 |
# File 'lib/omnivore/document.rb', line 50 def title @title ||= self.model.xpath("/html/head/title").text.gsub(/\s+/, " ").strip end |
#to_html ⇒ String
A HTML representation of the document.
43 44 45 |
# File 'lib/omnivore/document.rb', line 43 def to_html self.model.to_html end |
#to_paragraphs ⇒ Array
Splits the document into paragraphs, assuming that each <div> or <p> tag represents a paragraph.
78 79 80 81 82 83 84 85 86 87 |
# File 'lib/omnivore/document.rb', line 78 def to_paragraphs self.model.xpath("//div|//p").map { |block| html = block.to_html.gsub(/\s+/, " ").strip text = flatten(block).inject([ ]) { |memo, node| memo << node.text.gsub(/\s+/, " ").strip if node.kind_of?(Nokogiri::XML::Text) memo }.join(" ") Paragraph.new(block.path.to_s, text, text.size / html.size.to_f) } end |
#to_text ⇒ String
Returns the actual content of the document, without navigation, advertising, etc.
67 68 69 70 71 72 |
# File 'lib/omnivore/document.rb', line 67 def to_text self.to_paragraphs.inject([ ]) { |buffer, p| buffer << p.text if p.text_density >= 0.25 buffer }.join("\n") end |