Class: Distillery::Document

Inherits:
SimpleDelegator
  • Object
show all
Defined in:
lib/distillery/document.rb

Overview

Wraps a Nokogiri document for the HTML page to be disilled and holds all methods to clean and distill the document down to just its content element.

Constant Summary collapse

UNLIKELY_TAGS =

HTML elements unlikely to contain the content element.

%w[head script link meta]
UNLIKELY_IDENTIFIERS =

HTML ids and classes that are unlikely to contain the content element.

/combx|comment|disqus|foot|header|meta|nav|rss|shoutbox|sidebar|sponsor/i
REMOVAL_WHITELIST =

Elements that are whitelisted from being removed as unlikely elements

%w[a body]
BLOCK_ELEMENTS =

“Block” elements who signal its parent is less-likely to be the content element.

%w[a blockquote dl div img ol p pre table ul]
POSITIVE_IDENTIFIERS =

HTML ids and classes that are positive signals of the content element.

/article|body|content|entry|hentry|page|pagination|post|text/i
NEGATIVE_IDENTIFIERS =

HTML ids and classes that are negative signals of the content element.

/combx|comment|contact|foot|footer|footnote|link|media|promo|related|scroll|shoutbox|sponsor|tags|widget/i
UNRELATED_ELEMENTS =

HTML elements that are unrelated to the content in the content element.

%w[iframe form object]
POSSIBLE_UNRELATED_ELEMENTS =

HTML elements that are possible unrelated to the content of the content HTML element.

%w[table ul div a]
0.045

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(page_string) ⇒ Document

Create a new Document

Parameters:

  • str (String)

    The HTML document to distill as a string.



48
49
50
51
# File 'lib/distillery/document.rb', line 48

def initialize(page_string)
  @scores = Hash.new(0)
  super(::Nokogiri::HTML(page_string))
end

Instance Attribute Details

#docObject (readonly)

The Nokogiri document



40
41
42
# File 'lib/distillery/document.rb', line 40

def doc
  @doc
end

#scoresObject (readonly)

Hash of xpath => content score of elements in this document



43
44
45
# File 'lib/distillery/document.rb', line 43

def scores
  @scores
end

Instance Method Details

#clean_top_scoring_elements!(options = {}) ⇒ Object

Attempts to clean the top scoring node from non-page content items, such as advertisements, widgets, etc



114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
# File 'lib/distillery/document.rb', line 114

def clean_top_scoring_elements!(options = {})
  keep_images = !!options[:images]

  top_scoring_elements.each do |element|

    element.search("*").each do |node|
      next if contains_content_image?(node) && keep_images
      node.remove if has_empty_text?(node)
    end

    element.search("*").each do |node|
      next if contains_content_image?(node) && keep_images
      if UNRELATED_ELEMENTS.include?(node.name) ||
        (node.text.count(',') < 2 && unlikely_to_be_content?(node))
        node.remove
      end
    end
  end
end

#distill!(options = {}) ⇒ Object

Distills the document down to just its content.

Parameters:

  • options (Hash) (defaults to: {})

    Distillation options

Options Hash (options):

  • :dirty (Symbol)

    Do not clean the content element HTML



102
103
104
105
106
107
108
109
110
# File 'lib/distillery/document.rb', line 102

def distill!(options = {})
  remove_irrelevant_elements!
  remove_unlikely_elements!

  score!

  clean_top_scoring_elements!(options) unless options.delete(:clean) == false
  top_scoring_elements.map(&:inner_html).join("\n")
end

#mark_scorable_elements!Object

Marks elements that are suitable for scoring with a special HTML attribute



72
73
74
75
76
77
78
# File 'lib/distillery/document.rb', line 72

def mark_scorable_elements!
  search('div', 'p').each do |element|
    if element.name == 'p' || scorable_div?(element)
      element['data-distillery'] = 'scorable'
    end
  end
end

#remove_irrelevant_elements!(tags = UNLIKELY_TAGS) ⇒ Object

Removes irrelevent elements from the document. This is usually things like <script>, <link> and other page elements we don’t care about



55
56
57
# File 'lib/distillery/document.rb', line 55

def remove_irrelevant_elements!(tags = UNLIKELY_TAGS)
  search(*tags).each(&:remove)
end

#remove_unlikely_elements!Object

Removes unlikely elements from the document. These are elements who have classes that seem to indicate they are comments, headers, footers, nav, etc



61
62
63
64
65
66
67
68
69
# File 'lib/distillery/document.rb', line 61

def remove_unlikely_elements!
  search('*').each do |element|
    idclass = "#{element['class']}#{element['id']}"

    if idclass =~ UNLIKELY_IDENTIFIERS && !REMOVAL_WHITELIST.include?(element.name)
      element.remove
    end
  end
end

#score!Object

Scores the document elements based on an algorithm to find elements which hold page content.



82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
# File 'lib/distillery/document.rb', line 82

def score!
  mark_scorable_elements!

  scorable_elements.each do |element|
    points = 1
    points += element.text.split(',').length
    points += [element.text.length / 100, 3].min

    scores[element.path] = points
    scores[element.parent.path] += points
    scores[element.parent.parent.path] += points.to_f/2
  end

  augment_scores_by_link_weight!
end