Class: Omnivore::Document

Inherits:

Object

Object
Omnivore::Document

show all

Defined in:: lib/omnivore/document.rb

Overview

A class encapsulating an HTML document.

Defined Under Namespace

Classes: Paragraph

Constant Summary collapse

BLOCK_TAGS = The HTML tags signaling the start of a block or paragraph.

%w[div p frame]

Instance Attribute Summary collapse

#model ⇒ Object readonly

Returns the value of attribute model.

Class Method Summary collapse

.from_html(html) ⇒ Document

Creates a Omnivore::Document object from a string containing HTML.
.from_url(url) ⇒ Document

Creates a Omnivore::Document object from a url.

Instance Method Summary collapse

#initialize(html) ⇒ Document constructor

A new instance of Document.
#metadata ⇒ Hash

Extracts document metadata.
#title ⇒ String

Extracts the document title.
#to_html ⇒ String

A HTML representation of the document.
#to_paragraphs ⇒ Array

Splits the document into paragraphs, assuming that each <div> or <p> tag represents a paragraph.
#to_text ⇒ String

Returns the actual content of the document, without navigation, advertising, etc.

Constructor Details

#initialize(html) ⇒ `Document`

Returns a new instance of Document.

# File 'lib/omnivore/document.rb', line 34

def initialize(html)
  @model = Nokogiri::HTML.parse(html) { |config|
    config.options = Nokogiri::XML::ParseOptions::NOBLANKS
  }
end

Instance Attribute Details

#model ⇒ `Object` (readonly)

Returns the value of attribute model.



8
9
10

# File 'lib/omnivore/document.rb', line 8

def model
  @model
end

Class Method Details

.from_html(html) ⇒ `Document`

Creates a Omnivore::Document object from a string containing HTML.

Parameters:

html (String) —

the HTML content

Returns:

(Document) —

A new Document object.



29
30
31

# File 'lib/omnivore/document.rb', line 29

def self.from_html(html)
  Document.new(html)
end

.from_url(url) ⇒ `Document`

Creates a Omnivore::Document object from a url.

Parameters:

url (String) —

the document’s url

Returns:

(Document) —

A new Document object.



21
22
23

# File 'lib/omnivore/document.rb', line 21

def self.from_url(url)
  Document.new(HttpClient.get(url))
end

Instance Method Details

#metadata ⇒ `Hash`

Extracts document metadata.

Returns:

(Hash) —

The metadata tags found in the document.

# File 'lib/omnivore/document.rb', line 57

def metadata
  @metadata ||= self.model.xpath("//meta").inject({ }) { |memo, el|
    memo[el.attr("name")] = el.attr("content") || "" if el.attr("name")  
    memo
  }
end

#title ⇒ `String`

Extracts the document title.

Returns:

(String) —

The document title.



50
51
52

# File 'lib/omnivore/document.rb', line 50

def title
  @title ||= self.model.xpath("/html/head/title").text.gsub(/\s+/, " ").strip
end

#to_html ⇒ `String`

A HTML representation of the document.

Returns:

(String) —

A HTML representation of the document.



43
44
45

# File 'lib/omnivore/document.rb', line 43

def to_html
  self.model.to_html
end

#to_paragraphs ⇒ `Array`

Splits the document into paragraphs, assuming that each <div> or <p> tag represents a paragraph.

Returns:

(Array) —

An array of Paragraph objects.

# File 'lib/omnivore/document.rb', line 78

def to_paragraphs
  self.model.xpath("//div|//p").map { |block|
    html = block.to_html.gsub(/\s+/, " ").strip
    text = flatten(block).inject([ ]) { |memo, node|
      memo << node.text.gsub(/\s+/, " ").strip if node.kind_of?(Nokogiri::XML::Text) 
      memo
    }.join(" ")
    Paragraph.new(block.path.to_s, text, text.size / html.size.to_f)
  }
end

#to_text ⇒ `String`

Returns the actual content of the document, without navigation, advertising, etc.

Returns:

(String) —

The document’s main content.

# File 'lib/omnivore/document.rb', line 67

def to_text
  self.to_paragraphs.inject([ ]) { |buffer, p| 
    buffer << p.text if p.text_density >= 0.25
    buffer
  }.join("\n")
end

Class: Omnivore::Document

Overview

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(html) ⇒ Document

Instance Attribute Details

#model ⇒ Object (readonly)

Class Method Details

.from_html(html) ⇒ Document

.from_url(url) ⇒ Document

Instance Method Details

#metadata ⇒ Hash

#title ⇒ String

#to_html ⇒ String

#to_paragraphs ⇒ Array

#to_text ⇒ String

#initialize(html) ⇒ `Document`

#model ⇒ `Object` (readonly)

.from_html(html) ⇒ `Document`

.from_url(url) ⇒ `Document`

#metadata ⇒ `Hash`

#title ⇒ `String`

#to_html ⇒ `String`

#to_paragraphs ⇒ `Array`

#to_text ⇒ `String`