taggie

The tiniest little HTML/XML parser…using regex

gem install taggie --pre

WTF, why regex?!?

Curiosity, regex practice, and proof that it could be done. If you’re interested, here’s the beast of a regex that parses arbitrarily nested tags:

/(<(\w+)[^>]*(?:\/>|>((?:<(\w+)[^>]*(?:\/>|>.*<\/\4>)|<!--.*?-->|<\?.*?\?>|[^>])*)<\/\2>)|<!--.*?-->|<\?.*?\?>|[^>]*)/m

Examples (these may not all work yet - work in progress)

html = '<div id="header"><img src="logo.png" /><h1>Your Company</h1></div><div id="body"><p class="content">some <span>content</span> here</p></div>'.to_taggie
puts html.type                                # div
puts html.tag                                 # <div id="header">
puts html.inner_html                          # <img src="logo.png" /><h1>Your Company</h1>

puts html.children.first.src                  # logo.png
html.children.first.src = '/images/logo.png'
puts html.inner_html                          # <img src="/images/logo.png" /><h1>Your Company</h1>

p = html.siblings.first.children.first
puts p.tag                                    # <p class="content">

p.id = 'content'
puts html.siblings.first.children.first       # <p class="content" id="content">Blah blah blah</p>

p.class = nil
puts html.siblings.first.children.first       # <p id="content">Blah blah blah</p>

p.class = ''
puts html.siblings.first.children.first       # <p id="content" class="">Blah blah blah</p>

TODO

  • attribute writer is broken

  • lib/taggie_unabridged.rb

  • tests

Note on Patches/Pull Requests

  • Fork the project.

  • Make your feature addition or bug fix.

  • Add tests for it. This is important so I don’t break it in a future version unintentionally.

  • Commit, do not mess with rakefile, version, or history. (if you want to have your own version, that is fine but

    bump version in a commit by itself I can ignore when I pull)
    
  • Send me a pull request. Bonus points for topic branches.