Class: Distillery::Document
- Inherits:
-
SimpleDelegator
- Object
- SimpleDelegator
- Distillery::Document
- Defined in:
- lib/distillery/document.rb
Overview
Wraps a Nokogiri document for the HTML page to be disilled and holds all methods to clean and distill the document down to just its content element.
Constant Summary collapse
- UNLIKELY_TAGS =
HTML elements unlikely to contain the content element.
%w[head script link meta]
- UNLIKELY_IDENTIFIERS =
HTML ids and classes that are unlikely to contain the content element.
/combx|comment|community|disqus|foot|header|remark|rss|shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|popup/i
- REMOVAL_WHITELIST =
Elements that are whitelisted from being removed as unlikely elements
%w[a body]
- BLOCK_ELEMENTS =
“Block” elements who signal its parent is less-likely to be the content element.
%w[a blockquote dl div img ol p pre table ul]
- POSITIVE_IDENTIFIERS =
HTML ids and classes that are positive signals of the content element.
/article|body|content|entry|hentry|page|pagination|post|text/i
- NEGATIVE_IDENTIFIERS =
HTML ids and classes that are negative signals of the content element.
/combx|comment|contact|foot|footer|footnote|link|media|promo|related|scroll|shoutbox|sponsor|tags|widget|related/i
- UNRELATED_ELEMENTS =
HTML elements that are unrelated to the content in the content element.
%w[iframe form object]
- POSSIBLE_UNRELATED_ELEMENTS =
HTML elements that are possible unrelated to the content of the content HTML element.
%w[table ul div a]
- RELATED_SCORE_RATIO =
The ratio to the top element’s score an indentically class/id’d sibling needs to have in order to be considered related.
0.027
Instance Attribute Summary collapse
-
#doc ⇒ Object
readonly
The Nokogiri document.
-
#scores ⇒ Object
readonly
Hash of xpath => content score of elements in this document.
Instance Method Summary collapse
-
#clean_top_scoring_elements!(options = {}) ⇒ Object
Attempts to clean the top scoring node from non-page content items, such as advertisements, widgets, etc.
-
#distill!(options = {}) ⇒ Object
Distills the document down to just its content.
-
#initialize(page_string) ⇒ Document
constructor
Create a new Document.
-
#mark_scorable_elements! ⇒ Object
Marks elements that are suitable for scoring with a special HTML attribute.
-
#remove_irrelevant_elements!(tags = UNLIKELY_TAGS) ⇒ Object
Removes irrelevent elements from the document.
-
#remove_unlikely_elements! ⇒ Object
Removes unlikely elements from the document.
-
#score! ⇒ Object
Scores the document elements based on an algorithm to find elements which hold page content.
Constructor Details
#initialize(page_string) ⇒ Document
Create a new Document
48 49 50 51 |
# File 'lib/distillery/document.rb', line 48 def initialize(page_string) @scores = Hash.new(0) super(::Nokogiri::HTML(page_string)) end |
Instance Attribute Details
#doc ⇒ Object (readonly)
The Nokogiri document
40 41 42 |
# File 'lib/distillery/document.rb', line 40 def doc @doc end |
#scores ⇒ Object (readonly)
Hash of xpath => content score of elements in this document
43 44 45 |
# File 'lib/distillery/document.rb', line 43 def scores @scores end |
Instance Method Details
#clean_top_scoring_elements!(options = {}) ⇒ Object
Attempts to clean the top scoring node from non-page content items, such as advertisements, widgets, etc
114 115 116 117 118 119 120 121 122 123 124 125 |
# File 'lib/distillery/document.rb', line 114 def clean_top_scoring_elements!( = {}) keep_images = !![:images] top_scoring_elements.each do |element| element.search("*").each do |node| if cleanable?(node, keep_images) debugger if node.to_s =~ /maximum flavor/ node.remove end end end end |
#distill!(options = {}) ⇒ Object
Distills the document down to just its content.
102 103 104 105 106 107 108 109 110 |
# File 'lib/distillery/document.rb', line 102 def distill!( = {}) remove_irrelevant_elements! remove_unlikely_elements! score! clean_top_scoring_elements!() unless .delete(:clean) == false top_scoring_elements.map(&:inner_html).join("\n") end |
#mark_scorable_elements! ⇒ Object
Marks elements that are suitable for scoring with a special HTML attribute
72 73 74 75 76 77 78 |
# File 'lib/distillery/document.rb', line 72 def mark_scorable_elements! search('div', 'p').each do |element| if element.name == 'p' || scorable_div?(element) element['data-distillery'] = 'scorable' end end end |
#remove_irrelevant_elements!(tags = UNLIKELY_TAGS) ⇒ Object
Removes irrelevent elements from the document. This is usually things like <script>, <link> and other page elements we don’t care about
55 56 57 |
# File 'lib/distillery/document.rb', line 55 def remove_irrelevant_elements!( = UNLIKELY_TAGS) search(*).each(&:remove) end |
#remove_unlikely_elements! ⇒ Object
Removes unlikely elements from the document. These are elements who have classes that seem to indicate they are comments, headers, footers, nav, etc
61 62 63 64 65 66 67 68 69 |
# File 'lib/distillery/document.rb', line 61 def remove_unlikely_elements! search('*').each do |element| idclass = "#{element['class']}#{element['id']}" if idclass =~ UNLIKELY_IDENTIFIERS && !REMOVAL_WHITELIST.include?(element.name) element.remove end end end |
#score! ⇒ Object
Scores the document elements based on an algorithm to find elements which hold page content.
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 |
# File 'lib/distillery/document.rb', line 82 def score! mark_scorable_elements! scorable_elements.each do |element| points = 1 points += element.text.split(',').length points += [element.text.length / 100, 3].min scores[element.path] = points scores[element.parent.path] += points scores[element.parent.parent.path] += points.to_f/2 end augment_scores_by_link_weight! end |