Class: Sanitize

Inherits:
Object
  • Object
show all
Defined in:
lib/sanitize.rb,
lib/sanitize/css.rb,
lib/sanitize/config.rb,
lib/sanitize/version.rb,
lib/sanitize/config/basic.rb,
lib/sanitize/config/default.rb,
lib/sanitize/config/relaxed.rb,
lib/sanitize/config/restricted.rb,
lib/sanitize/transformers/clean_css.rb,
lib/sanitize/transformers/clean_cdata.rb,
lib/sanitize/transformers/clean_comment.rb,
lib/sanitize/transformers/clean_doctype.rb,
lib/sanitize/transformers/clean_element.rb
more...

Defined Under Namespace

Modules: Config, Transformers Classes: CSS, Error

Constant Summary collapse

REGEX_HTML_CONTROL_CHARACTERS =

Matches one or more control characters that should be removed from HTML before parsing, as defined by the HTML living standard.

/[\u0001-\u0008\u000b\u000e-\u001f\u007f-\u009f]+/u
REGEX_HTML_NON_CHARACTERS =

Matches one or more non-characters that should be removed from HTML before parsing, as defined by the HTML living standard.

/[\ufdd0-\ufdef\ufffe\uffff\u{1fffe}\u{1ffff}\u{2fffe}\u{2ffff}\u{3fffe}\u{3ffff}\u{4fffe}\u{4ffff}\u{5fffe}\u{5ffff}\u{6fffe}\u{6ffff}\u{7fffe}\u{7ffff}\u{8fffe}\u{8ffff}\u{9fffe}\u{9ffff}\u{afffe}\u{affff}\u{bfffe}\u{bffff}\u{cfffe}\u{cffff}\u{dfffe}\u{dffff}\u{efffe}\u{effff}\u{ffffe}\u{fffff}\u{10fffe}\u{10ffff}]+/u
REGEX_PROTOCOL =

Matches an attribute value that could be treated by a browser as a URL with a protocol prefix, such as "http:" or "javascript:". Any string of zero or more characters followed by a colon is considered a match, even if the colon is encoded as an entity and even if it's an incomplete entity (which IE6 and Opera will still parse).

/\A\s*([^\/#]*?)(?:\:|&#0*58|&#x0*3a)/i
REGEX_UNSUITABLE_CHARS =

Matches one or more characters that should be stripped from HTML before parsing. This is a combination of REGEX_HTML_CONTROL_CHARACTERS and REGEX_HTML_NON_CHARACTERS.

https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream

/(?:#{REGEX_HTML_CONTROL_CHARACTERS}|#{REGEX_HTML_NON_CHARACTERS})/u
VERSION =
'6.1.3'

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(config = {}) ⇒ Sanitize

Returns a new Sanitize object initialized with the settings in config.

[View source] [View on GitHub]

92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/sanitize.rb', line 92

def initialize(config = {})
  @config = Config.merge(Config::DEFAULT, config)

  @transformers = Array(@config[:transformers]).dup

  # Default transformers always run at the end of the chain, after any custom
  # transformers.
  @transformers << Transformers::CleanElement.new(@config)
  @transformers << Transformers::CleanComment unless @config[:allow_comments]

  if @config[:elements].include?('style')
    scss = Sanitize::CSS.new(config)
    @transformers << Transformers::CSS::CleanElement.new(scss)
  end

  if @config[:attributes].values.any? {|attr| attr.include?('style') }
    scss ||= Sanitize::CSS.new(config)
    @transformers << Transformers::CSS::CleanAttribute.new(scss)
  end

  @transformers << Transformers::CleanDoctype
  @transformers << Transformers::CleanCDATA

  @transformer_config = { config: @config }
end

Instance Attribute Details

#configObject (readonly)

Returns the value of attribute config.

[View on GitHub]

20
21
22
# File 'lib/sanitize.rb', line 20

def config
  @config
end

Class Method Details

.cleanObject

Deprecated.

Use fragment instead.

Returns a sanitized copy of the given html fragment, using the settings in config if specified.

[View source] [View on GitHub]

81
82
83
# File 'lib/sanitize.rb', line 81

def self.fragment(html, config = {})
  Sanitize.new(config).fragment(html)
end

.clean_documentObject

Deprecated.

Use document instead.

Returns a sanitized copy of the given full html document, using the settings in config if specified.

When sanitizing a document, the <html> element must be allowlisted or an error will be raised. If this is undesirable, you should probably use #fragment instead.

[View source] [View on GitHub]

78
79
80
# File 'lib/sanitize.rb', line 78

def self.document(html, config = {})
  Sanitize.new(config).document(html)
end

.clean_node!Object

Deprecated.

Use node! instead.

Sanitizes the given Nokogiri::XML::Node instance and all its children.

[View source] [View on GitHub]

84
85
86
# File 'lib/sanitize.rb', line 84

def self.node!(node, config = {})
  Sanitize.new(config).node!(node)
end

.document(html, config = {}) ⇒ Object

Returns a sanitized copy of the given full html document, using the settings in config if specified.

When sanitizing a document, the <html> element must be allowlisted or an error will be raised. If this is undesirable, you should probably use #fragment instead.

[View source] [View on GitHub]

60
61
62
# File 'lib/sanitize.rb', line 60

def self.document(html, config = {})
  Sanitize.new(config).document(html)
end

.fragment(html, config = {}) ⇒ Object

Returns a sanitized copy of the given html fragment, using the settings in config if specified.

[View source] [View on GitHub]

66
67
68
# File 'lib/sanitize.rb', line 66

def self.fragment(html, config = {})
  Sanitize.new(config).fragment(html)
end

.node!(node, config = {}) ⇒ Object

Sanitizes the given Nokogiri::XML::Node instance and all its children.

[View source] [View on GitHub]

71
72
73
# File 'lib/sanitize.rb', line 71

def self.node!(node, config = {})
  Sanitize.new(config).node!(node)
end

Instance Method Details

#document(html) ⇒ Object Also known as: clean_document

Returns a sanitized copy of the given html document.

When sanitizing a document, the <html> element must be allowlisted or an error will be raised. If this is undesirable, you should probably use #fragment instead.

[View source] [View on GitHub]

123
124
125
126
127
128
129
# File 'lib/sanitize.rb', line 123

def document(html)
  return '' unless html

  doc = Nokogiri::HTML5.parse(preprocess(html), **@config[:parser_options])
  node!(doc)
  to_html(doc)
end

#fragment(html) ⇒ Object Also known as: clean

Returns a sanitized copy of the given html fragment.

[View source] [View on GitHub]

135
136
137
138
139
140
141
# File 'lib/sanitize.rb', line 135

def fragment(html)
  return '' unless html

  frag = Nokogiri::HTML5.fragment(preprocess(html), **@config[:parser_options])
  node!(frag)
  to_html(frag)
end

#node!(node) ⇒ Object Also known as: clean_node!

Sanitizes the given Nokogiri::XML::Node and all its children, modifying it in place.

If node is a Nokogiri::XML::Document, the <html> element must be allowlisted or an error will be raised.

Raises:

  • (ArgumentError)
[View source] [View on GitHub]

151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# File 'lib/sanitize.rb', line 151

def node!(node)
  raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)

  if node.is_a?(Nokogiri::XML::Document)
    unless @config[:elements].include?('html')
      raise Error, 'When sanitizing a document, "<html>" must be allowlisted.'
    end
  end

  node_allowlist = Set.new

  traverse(node) do |n|
    transform_node!(n, node_allowlist)
  end

  node
end