Class: FeedNormalizer::HtmlCleaner
- Inherits:
-
Object
- Object
- FeedNormalizer::HtmlCleaner
- Defined in:
- lib/html-cleaner.rb
Overview
Various methods for cleaning up HTML and preparing it for safe public consumption.
Documents used for refrence:
Constant Summary collapse
- HTML_ELEMENTS =
allowed html elements.
%w( a abbr acronym address area b bdo big blockquote br button caption center cite code col colgroup dd del dfn dir div dl dt em fieldset font h1 h2 h3 h4 h5 h6 hr i img ins kbd label legend li map menu ol optgroup p pre q s samp small span strike strong sub sup table tbody td tfoot th thead tr tt u ul var )
- HTML_ATTRS =
allowed attributes.
%w( abbr accept accept-charset accesskey align alt axis border cellpadding cellspacing char charoff charset checked cite class clear cols colspan color compact coords datetime dir disabled for frame headers height href hreflang hspace id ismap label lang longdesc maxlength media method multiple name nohref noshade nowrap readonly rel rev rows rowspan rules scope selected shape size span src start summary tabindex target title type usemap valign value vspace width )
- HTML_URI_ATTRS =
allowed attributes, but they can contain URIs, extra caution required. NOTE: That means this doesnt list all URI attrs, just the ones that are allowed.
%w( href src cite usemap longdesc )
- DODGY_URI_SCHEMES =
%w( javascript vbscript mocha livescript data )
Class Method Summary collapse
-
.add_entities(str) ⇒ Object
Adds entities where possible.
-
.clean(str) ⇒ Object
Does this: - Unescape HTML - Parse HTML into tree - Find ‘body’ if present, and extract tree inside that tag, otherwise parse whole tree - Each tag: - remove tag if not whitelisted - escape HTML tag contents - remove all attributes not on whitelist - extra-scrub URI attrs; see dodgy_uri?.
-
.dodgy_uri?(uri) ⇒ Boolean
Returns true if the given string contains a suspicious URL, i.e.
-
.flatten(str) ⇒ Object
For all other feed elements: - Unescape HTML.
-
.unescapeHTML(str, xml = true) ⇒ Object
unescapes HTML.
Class Method Details
.add_entities(str) ⇒ Object
Adds entities where possible. Works like CGI.escapeHTML, but will not escape existing entities; i.e. { will NOT become {
This method could be improved by adding a whitelist of html entities.
152 153 154 |
# File 'lib/html-cleaner.rb', line 152 def add_entities(str) str.to_s.gsub(/\"/n, '"').gsub(/>/n, '>').gsub(/</n, '<').gsub(/&(?!(\#\d+|\#x([0-9a-f]+)|\w{2,8});)/nmi, '&') end |
.clean(str) ⇒ Object
Does this:
-
Unescape HTML
-
Parse HTML into tree
-
Find ‘body’ if present, and extract tree inside that tag, otherwise parse whole tree
-
Each tag:
-
remove tag if not whitelisted
-
escape HTML tag contents
-
remove all attributes not on whitelist
-
extra-scrub URI attrs; see dodgy_uri?
-
Extra (i.e. unmatched) ending tags and comments are removed.
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
# File 'lib/html-cleaner.rb', line 60 def clean(str) str = unescapeHTML(str) doc = Hpricot(str, :fixup_tags => true) doc = subtree(doc, :body) # get all the tags in the document # Somewhere near hpricot 0.4.92 "*" starting to return all elements, # including text nodes instead of just tagged elements. = (doc/"*").inject([]) { |m,e| m << e.name if(e.respond_to?(:name) && e.name =~ /^\w+$/) ; m }.uniq # Remove tags that aren't whitelisted. (doc, - HTML_ELEMENTS) = & HTML_ELEMENTS # Remove attributes that aren't on the whitelist, or are suspicious URLs. (doc/.join(",")).each do |element| next if element.raw_attributes.nil? || element.raw_attributes.empty? element.raw_attributes.reject! do |attr,val| !HTML_ATTRS.include?(attr) || (HTML_URI_ATTRS.include?(attr) && dodgy_uri?(val)) end element.raw_attributes = element.raw_attributes.build_hash {|a,v| [a, add_entities(v)]} end unless .empty? doc.traverse_text do |t| t.swap(add_entities(t.to_html)) end # Return the tree, without comments. Ugly way of removing comments, # but can't see a way to do this in Hpricot yet. doc.to_s.gsub(/<\!--.*?-->/mi, '') end |
.dodgy_uri?(uri) ⇒ Boolean
Returns true if the given string contains a suspicious URL, i.e. a javascript link.
This method rejects javascript, vbscript, livescript, mocha and data URLs. It could be refined to only deny dangerous data URLs, however.
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 |
# File 'lib/html-cleaner.rb', line 117 def dodgy_uri?(uri) uri = uri.to_s # special case for poorly-formed entities (missing ';') # if these occur *anywhere* within the string, then throw it out. return true if (uri =~ /&\#(\d+|x[0-9a-f]+)[^;\d]/mi) # Try escaping as both HTML or URI encodings, and then trying # each scheme regexp on each [unescapeHTML(uri), CGI.unescape(uri)].each do |unesc_uri| DODGY_URI_SCHEMES.each do |scheme| regexp = "#{scheme}:".gsub(/./) do |char| "([\000-\037\177\s]*)#{char}" end # regexp looks something like # /\A([\000-\037\177\s]*)j([\000-\037\177\s]*)a([\000-\037\177\s]*)v([\000-\037\177\s]*)a([\000-\037\177\s]*)s([\000-\037\177\s]*)c([\000-\037\177\s]*)r([\000-\037\177\s]*)i([\000-\037\177\s]*)p([\000-\037\177\s]*)t([\000-\037\177\s]*):/mi return true if (unesc_uri =~ %r{\A#{regexp}}mi) end end nil end |
.flatten(str) ⇒ Object
For all other feed elements:
-
Unescape HTML.
-
Parse HTML into tree (taking ‘body’ as root, if present)
-
Takes text out of each tag, and escapes HTML.
-
Returns all text concatenated.
99 100 101 102 103 104 105 106 107 108 109 110 |
# File 'lib/html-cleaner.rb', line 99 def flatten(str) str.gsub!("\n", " ") str = unescapeHTML(str) doc = Hpricot(str, :xhtml_strict => true) doc = subtree(doc, :body) out = [] doc.traverse_text {|t| out << add_entities(t.to_html)} return out.join end |
.unescapeHTML(str, xml = true) ⇒ Object
unescapes HTML. If xml is true, also converts XML-only named entities to HTML.
143 144 145 |
# File 'lib/html-cleaner.rb', line 143 def unescapeHTML(str, xml = true) CGI.unescapeHTML(xml ? str.gsub("'", "'") : str) end |