Class: Omnivore::Document

Inherits:
Object
  • Object
show all
Defined in:
lib/omnivore/document.rb

Overview

A class encapsulating an HTML document.

Defined Under Namespace

Classes: Paragraph

Constant Summary collapse

BLOCK_TAGS =

The HTML tags signaling the start of a block or paragraph.

%w[div p frame]

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(html) ⇒ Document

Returns a new instance of Document.



34
35
36
37
38
# File 'lib/omnivore/document.rb', line 34

def initialize(html)
  @model = Nokogiri::HTML.parse(html) { |config|
    config.options = Nokogiri::XML::ParseOptions::NOBLANKS
  }
end

Instance Attribute Details

#modelObject (readonly)

Returns the value of attribute model.



8
9
10
# File 'lib/omnivore/document.rb', line 8

def model
  @model
end

Class Method Details

.from_html(html) ⇒ Document

Creates a Omnivore::Document object from a string containing HTML.

Parameters:

  • html (String)

    the HTML content

Returns:



29
30
31
# File 'lib/omnivore/document.rb', line 29

def self.from_html(html)
  Document.new(html)
end

.from_url(url) ⇒ Document

Creates a Omnivore::Document object from a url.

Parameters:

  • url (String)

    the document’s url

Returns:



21
22
23
# File 'lib/omnivore/document.rb', line 21

def self.from_url(url)
  Document.new(HttpClient.get(url))
end

Instance Method Details

#metadataHash

Extracts document metadata.

Returns:

  • (Hash)

    The metadata tags found in the document.



57
58
59
60
61
62
# File 'lib/omnivore/document.rb', line 57

def 
  @metadata ||= self.model.xpath("//meta").inject({ }) { |memo, el|
    memo[el.attr("name")] = el.attr("content") || "" if el.attr("name")  
    memo
  }
end

#titleString

Extracts the document title.

Returns:

  • (String)

    The document title.



50
51
52
# File 'lib/omnivore/document.rb', line 50

def title
  @title ||= self.model.xpath("/html/head/title").text.gsub(/\s+/, " ").strip
end

#to_htmlString

A HTML representation of the document.

Returns:

  • (String)

    A HTML representation of the document.



43
44
45
# File 'lib/omnivore/document.rb', line 43

def to_html
  self.model.to_html
end

#to_paragraphsArray

Splits the document into paragraphs, assuming that each <div> or <p> tag represents a paragraph.

Returns:

  • (Array)

    An array of Paragraph objects.



78
79
80
81
82
83
84
85
86
87
# File 'lib/omnivore/document.rb', line 78

def to_paragraphs
  self.model.xpath("//div|//p").map { |block|
    html = block.to_html.gsub(/\s+/, " ").strip
    text = flatten(block).inject([ ]) { |memo, node|
      memo << node.text.gsub(/\s+/, " ").strip if node.kind_of?(Nokogiri::XML::Text) 
      memo
    }.join(" ")
    Paragraph.new(block.path.to_s, text, text.size / html.size.to_f)
  }
end

#to_textString

Returns the actual content of the document, without navigation, advertising, etc.

Returns:

  • (String)

    The document’s main content.



67
68
69
70
71
72
# File 'lib/omnivore/document.rb', line 67

def to_text
  self.to_paragraphs.inject([ ]) { |buffer, p| 
    buffer << p.text if p.text_density >= 0.25
    buffer
  }.join("\n")
end