Class: Html2rss::AttributePostProcessors::SanitizeHtml

Inherits:
Base
  • Object
show all
Defined in:
lib/html2rss/attribute_post_processors/sanitize_html.rb

Overview

Returns sanitized HTML code as String.

It sanitizes by using the [sanitize gem](github.com/rgrove/sanitize) with [Sanitize::Config::RELAXED](github.com/rgrove/sanitize#sanitizeconfigrelaxed).

Furthermore, it adds:

  • ‘rel=“nofollow noopener noreferrer”` to <a> tags

  • ‘referrer-policy=’no-referrer’‘ to <img> tags

  • wraps all <img> tags, whose direct parent is not an <a>, into an <a> linking to the <img>‘s `src`.

Imagine this HTML structure:

<section>
  Lorem <b>ipsum</b> dolor...
  <iframe src="https://evil.corp/miner"></iframe>
  <script>alert();</script>
</section>

YAML usage example:

selectors:
  description:
    selector: '.section'
    extractor: html
    post_process:
      name: sanitize_html

Would return:

'<p>Lorem <b>ipsum</b> dolor ...</p>'

Instance Attribute Summary

Attributes inherited from Base

#context, #value

Class Method Summary collapse

Instance Method Summary collapse

Methods inherited from Base

assert_type, expect_options, #initialize

Constructor Details

This class inherits a constructor from Html2rss::AttributePostProcessors::Base

Class Method Details

.get(html, url) ⇒ Object

Shorthand method to get the sanitized HTML.

Parameters:

  • html (String)
  • url (String, Addressable::URI)

Raises:

  • (ArgumentError)


50
51
52
53
54
55
# File 'lib/html2rss/attribute_post_processors/sanitize_html.rb', line 50

def self.get(html, url)
  raise ArgumentError, 'url must be a String or Addressable::URI' if url.to_s.empty?
  return nil if html.to_s.empty?

  new(html, { config: Config::Channel.new({ url: }) }).get
end

.validate_args!(value, context) ⇒ Object



42
43
44
# File 'lib/html2rss/attribute_post_processors/sanitize_html.rb', line 42

def self.validate_args!(value, context)
  assert_type value, String, :value, context:
end

Instance Method Details

#getString

Returns:

  • (String)


59
60
61
62
# File 'lib/html2rss/attribute_post_processors/sanitize_html.rb', line 59

def get
  sanitized_html = Sanitize.fragment(value, sanitize_config)
  sanitized_html.to_s.gsub(/\s+/, ' ').strip
end