Class: DiscourseDiff::HtmlTokenizer
- Inherits:
-
Nokogiri::XML::SAX::Document
- Object
- Nokogiri::XML::SAX::Document
- DiscourseDiff::HtmlTokenizer
- Defined in:
- lib/discourse_diff.rb
Constant Summary collapse
- USELESS_TAGS =
%w[html body]
- AUTOCLOSING_TAGS =
%w[area base br col embed hr img input meta]
Instance Attribute Summary collapse
-
#tokens ⇒ Object
Returns the value of attribute tokens.
Class Method Summary collapse
Instance Method Summary collapse
- #characters(string) ⇒ Object
- #end_element(name) ⇒ Object
-
#initialize ⇒ HtmlTokenizer
constructor
A new instance of HtmlTokenizer.
- #start_element(name, attributes = []) ⇒ Object
Constructor Details
#initialize ⇒ HtmlTokenizer
Returns a new instance of HtmlTokenizer.
274 275 276 |
# File 'lib/discourse_diff.rb', line 274 def initialize @tokens = [] end |
Instance Attribute Details
#tokens ⇒ Object
Returns the value of attribute tokens.
272 273 274 |
# File 'lib/discourse_diff.rb', line 272 def tokens @tokens end |
Class Method Details
.tokenize(html) ⇒ Object
278 279 280 281 282 283 |
# File 'lib/discourse_diff.rb', line 278 def self.tokenize(html) me = new parser = Nokogiri::HTML::SAX::Parser.new(me) parser.parse("<html><body>#{html}</body></html>") me.tokens end |
Instance Method Details
#characters(string) ⇒ Object
298 299 300 |
# File 'lib/discourse_diff.rb', line 298 def characters(string) @tokens.concat string.scan(/\W|\w+[ \t]*/).map { |x| CGI.escapeHTML(x) } end |
#end_element(name) ⇒ Object
293 294 295 296 |
# File 'lib/discourse_diff.rb', line 293 def end_element(name) return if USELESS_TAGS.include?(name) || AUTOCLOSING_TAGS.include?(name) @tokens << "</#{name}>" end |
#start_element(name, attributes = []) ⇒ Object
286 287 288 289 290 |
# File 'lib/discourse_diff.rb', line 286 def start_element(name, attributes = []) return if USELESS_TAGS.include?(name) attrs = attributes.map { |a| " #{a[0]}=\"#{CGI.escapeHTML(a[1])}\"" }.join @tokens << "<#{name}#{attrs}>" end |