Class: Sanitize

Inherits:
Object
  • Object
show all
Defined in:
lib/sanitize.rb,
lib/sanitize/css.rb,
lib/sanitize/config.rb,
lib/sanitize/version.rb,
lib/sanitize/config/basic.rb,
lib/sanitize/config/default.rb,
lib/sanitize/config/relaxed.rb,
lib/sanitize/config/restricted.rb,
lib/sanitize/transformers/clean_css.rb,
lib/sanitize/transformers/clean_cdata.rb,
lib/sanitize/transformers/clean_comment.rb,
lib/sanitize/transformers/clean_doctype.rb,
lib/sanitize/transformers/clean_element.rb

Defined Under Namespace

Modules: Config, Transformers Classes: CSS, Error

Constant Summary collapse

REGEX_HTML_CONTROL_CHARACTERS =

Matches one or more control characters that should be removed from HTML before parsing, as defined by the HTML living standard.

/[\u0001-\u0008\u000b\u000e-\u001f\u007f-\u009f]+/u
REGEX_HTML_NON_CHARACTERS =

Matches one or more non-characters that should be removed from HTML before parsing, as defined by the HTML living standard.

/[\ufdd0-\ufdef\ufffe\uffff\u{1fffe}\u{1ffff}\u{2fffe}\u{2ffff}\u{3fffe}\u{3ffff}\u{4fffe}\u{4ffff}\u{5fffe}\u{5ffff}\u{6fffe}\u{6ffff}\u{7fffe}\u{7ffff}\u{8fffe}\u{8ffff}\u{9fffe}\u{9ffff}\u{afffe}\u{affff}\u{bfffe}\u{bffff}\u{cfffe}\u{cffff}\u{dfffe}\u{dffff}\u{efffe}\u{effff}\u{ffffe}\u{fffff}\u{10fffe}\u{10ffff}]+/u
REGEX_PROTOCOL =

Matches an attribute value that could be treated by a browser as a URL with a protocol prefix, such as "http:" or "javascript:". Any string of zero or more characters followed by a colon is considered a match, even if the colon is encoded as an entity and even if it's an incomplete entity (which IE6 and Opera will still parse).

/\A\s*([^\/#]*?)(?:\:|&#0*58|&#x0*3a)/i
REGEX_UNSUITABLE_CHARS =

Matches one or more characters that should be stripped from HTML before parsing. This is a combination of REGEX_HTML_CONTROL_CHARACTERS and REGEX_HTML_NON_CHARACTERS.

https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream

/(?:#{REGEX_HTML_CONTROL_CHARACTERS}|#{REGEX_HTML_NON_CHARACTERS})/u
VERSION =
'6.1.0'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(config = {}) ⇒ Sanitize

Returns a new Sanitize object initialized with the settings in config.



92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/sanitize.rb', line 92

def initialize(config = {})
  @config = Config.merge(Config::DEFAULT, config)

  @transformers = Array(@config[:transformers]).dup

  # Default transformers always run at the end of the chain, after any custom
  # transformers.
  @transformers << Transformers::CleanElement.new(@config)
  @transformers << Transformers::CleanComment unless @config[:allow_comments]

  if @config[:elements].include?('style')
    scss = Sanitize::CSS.new(config)
    @transformers << Transformers::CSS::CleanElement.new(scss)
  end

  if @config[:attributes].values.any? {|attr| attr.include?('style') }
    scss ||= Sanitize::CSS.new(config)
    @transformers << Transformers::CSS::CleanAttribute.new(scss)
  end

  @transformers << Transformers::CleanDoctype
  @transformers << Transformers::CleanCDATA

  @transformer_config = { config: @config }
end

Instance Attribute Details

#configObject (readonly)

Returns the value of attribute config.



20
21
22
# File 'lib/sanitize.rb', line 20

def config
  @config
end

Class Method Details

.cleanObject

Deprecated.

Use fragment instead.

Returns a sanitized copy of the given html fragment, using the settings in config if specified.



81
82
83
# File 'lib/sanitize.rb', line 81

def self.fragment(html, config = {})
  Sanitize.new(config).fragment(html)
end

.clean_documentObject

Deprecated.

Use document instead.

Returns a sanitized copy of the given full html document, using the settings in config if specified.

When sanitizing a document, the <html> element must be allowlisted or an error will be raised. If this is undesirable, you should probably use #fragment instead.



78
79
80
# File 'lib/sanitize.rb', line 78

def self.document(html, config = {})
  Sanitize.new(config).document(html)
end

.clean_node!Object

Deprecated.

Use node! instead.

Sanitizes the given Nokogiri::XML::Node instance and all its children.



84
85
86
# File 'lib/sanitize.rb', line 84

def self.node!(node, config = {})
  Sanitize.new(config).node!(node)
end

.document(html, config = {}) ⇒ Object

Returns a sanitized copy of the given full html document, using the settings in config if specified.

When sanitizing a document, the <html> element must be allowlisted or an error will be raised. If this is undesirable, you should probably use #fragment instead.



60
61
62
# File 'lib/sanitize.rb', line 60

def self.document(html, config = {})
  Sanitize.new(config).document(html)
end

.fragment(html, config = {}) ⇒ Object

Returns a sanitized copy of the given html fragment, using the settings in config if specified.



66
67
68
# File 'lib/sanitize.rb', line 66

def self.fragment(html, config = {})
  Sanitize.new(config).fragment(html)
end

.node!(node, config = {}) ⇒ Object

Sanitizes the given Nokogiri::XML::Node instance and all its children.



71
72
73
# File 'lib/sanitize.rb', line 71

def self.node!(node, config = {})
  Sanitize.new(config).node!(node)
end

Instance Method Details

#document(html) ⇒ Object Also known as: clean_document

Returns a sanitized copy of the given html document.

When sanitizing a document, the <html> element must be allowlisted or an error will be raised. If this is undesirable, you should probably use #fragment instead.



123
124
125
126
127
128
129
# File 'lib/sanitize.rb', line 123

def document(html)
  return '' unless html

  doc = Nokogiri::HTML5.parse(preprocess(html), **@config[:parser_options])
  node!(doc)
  to_html(doc)
end

#fragment(html) ⇒ Object Also known as: clean

Returns a sanitized copy of the given html fragment.



135
136
137
138
139
140
141
# File 'lib/sanitize.rb', line 135

def fragment(html)
  return '' unless html

  frag = Nokogiri::HTML5.fragment(preprocess(html), **@config[:parser_options])
  node!(frag)
  to_html(frag)
end

#node!(node) ⇒ Object Also known as: clean_node!

Sanitizes the given Nokogiri::XML::Node and all its children, modifying it in place.

If node is a Nokogiri::XML::Document, the <html> element must be allowlisted or an error will be raised.

Raises:

  • (ArgumentError)


151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# File 'lib/sanitize.rb', line 151

def node!(node)
  raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)

  if node.is_a?(Nokogiri::XML::Document)
    unless @config[:elements].include?('html')
      raise Error, 'When sanitizing a document, "<html>" must be allowlisted.'
    end
  end

  node_allowlist = Set.new

  traverse(node) do |n|
    transform_node!(n, node_allowlist)
  end

  node
end