Class: Treat::Workers::Formatters::Readers::HTML
- Inherits:
-
Object
- Object
- Treat::Workers::Formatters::Readers::HTML
- Defined in:
- lib/treat/workers/formatters/readers/html.rb
Overview
This class is a wrapper for the ‘ruby-readability’ gem, which extracts the primary readable content of a web page by using set of handwritten rules.
Project homepage: github.com/iterationlabs/ruby-readability
Constant Summary collapse
- DefaultOptions =
By default, don’t backup the original HTML
{ :keep_html => false, :tags => %w[p div h1 h2 h3 ul ol dl dt li img], }
Class Method Summary collapse
-
.read(document, options = {}) ⇒ Object
Read the HTML document and strip it of its markup.
Class Method Details
.read(document, options = {}) ⇒ Object
Read the HTML document and strip it of its markup.
Options:
text when cleaning the document (default: false).
-
(Boolean) :remove_empty_nodes => remove <p> tags that have no text content
-
(String) :encoding => if the page is of a known encoding, you can specify it; if left unspecified, the encoding will be guessed (only in Ruby 1.9.x)
-
(String) :html_headers => in Ruby 1.9.x these will be passed to the guess_html_encoding gem to aid with guessing the HTML encoding.
-
(Array of String) :tags => the base whitelist of tags to sanitize, defaults to %w[div p]. also removes p tags that contain only images
-
(Array of String) :attributes => list allowed attributes
-
(Array of String) :ignore_image_format => for use with images.
-
(Numeric) :min_image_height => minimum image height for images.
-
(Numeric) :min_image_width => minimum image width for images.
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 |
# File 'lib/treat/workers/formatters/readers/html.rb', line 38 def self.read(document, = {}) # set encoding with the guess_html_encoding = DefaultOptions.merge() html = File.read(document.file) silence_warnings do # Strip comments html.gsub!(/<!--[^>]*-->/m, '') d = Readability::Document.new(html, ) document.value = "<h1>#{d.title}</h1>\n" + d.content document.set :format, 'html' document.set :images, d.images end document end |