TagTreeScanner

Author

Gavin Kistner ([email protected])

Copyright

Copyright ©2005-2007 Gavin Kistner

License

MIT License

Version

0.8.0 (2007-November-24)

Overview

The TagTreeScanner class provides a generic framework for creating a nested hierarchy of tags and text (like XML or HTML) by parsing text. An example use (and the reason it was written) is to convert a wiki markup syntax into HTML.

Example Usage

require 'tagtreescanner'

class SimpleMarkup < TagTreeScanner
   @root_factory.allows_text = false

   @tag_genres[ :root ] = [ ]

   @tag_genres[ :root ] << TagFactory.new( :paragraph,
      # A line that doesn't have whitespace at the start
      :open_match => /(?=\S)/, :open_requires_bol => true,

      # Close when you see a double return
      :close_match => /\n[ \t]*\n/,
      :allows_text => true,
      :allowed_genre => :inline
   )

   @tag_genres[ :root ] << TagFactory.new( :preformatted,
      # Grab all lines that are indented up until a line that isn't
      :open_match => /((\s+).+?)\n+(?=\S)/m, :open_requires_bol => true,
      :setup => lambda{ |tag, scanner, tagtree|
         # Throw the contents I found into the tag
         # but remove leading whitespace
         tag << scanner[1].gsub( /^#{scanner[2]}/, '' )
      },
      :autoclose => :true
   )

   @tag_genres[ :inline ] = [ ]

   @tag_genres[ :inline ] << TagFactory.new( :bold,
      # An asterisk followed by a letter or number
      :open_match => /\*(?=[a-z0-9])/i,

      # Close when I see an asterisk OR a newline coming up
      :close_match => /\*|(?=\n)/,
      :allows_text => true,
      :allowed_genre => :inline
   )

   @tag_genres[ :inline ] << TagFactory.new( :italic,
      # An underscore followed by a letter or number
      :open_match => /_(?=[a-z0-9])/i,

      # Close when I see an underscore OR a newline coming up
      :close_match => /_|(?=\n)/,
      :allows_text => true,
      :allowed_genre => :inline
   )
end

raw_text = <<ENDINPUT
Hello World! You're _soaking in_ my test.
This is a *subset* of markup that I allow.

Hi paragraph two. Yo! A code sample:

  def foo
    puts "Whee!"
  end

_That, as they say, is that._

ENDINPUT

markup = SimpleMarkup.new( raw_text ).to_xml
puts markup

#=> <paragraph>Hello World! You're <italic>soaking in</italic> my test.
#=> This is a <bold>subset</bold> of markup that I allow.</paragraph>
#=> <paragraph>Hi paragraph two. Yo! A code sample:</paragraph>
#=> <preformatted>def foo
#=>   puts "Whee!"
#=> end</preformatted>
#=> <paragraph><italic>That, as they say, is that.</italic></paragraph>

Details

TagFactories at 10,000 feet

Each possible output tag is described by a TagFactory, which specifies

some or all of the following:

  • The name of the tags it creates (required)

  • The regular expression to look for to start the tag

  • The regular expression to look for to close the tag, or

  • Whether the tag is automatically closed after creation

  • What genre of tags are allowed within the tag

  • Whether the tag supports raw text inside it

  • Code to run when creating a tag

See the TagFactory class for more information on specifying factories.

Genres as a State Machine

As a new tag is opened, the scanner uses the Tag#allowed_genre property of that tag (set by the allowed_genre property on the TagFactory) to determine which tags to be looking for. A genre is specified by adding an array in the @tag_genres hash, whose key is the genre name. For example:

@tag_genres[ :inline ] = [ ]

adds a new genre named ‘inline’, with no tags in it. TagFactory instances should be pushed onto this array in the order that they should be looked for. For example:

@tag_genres[ :inline ] << TagFactory.new( :italic,
  # see the TagFactory#initialize for options
)

Note that the close_match regular expression of the current tag is always checked before looking to open/create any new tags.

Consuming Text

As the text is being parsed, there will (probably) be many cases where you have raw text that doesn’t close or open any new tags. Whenever the scanner reaches this state, it runs the @text_match regexp against the text to move the pointer ahead. If the current tag has Tag#allows_text? set to true (through TagFactory#allows_text), then this text is added as contents of the tag. If not, the text is thrown away.

The safest regular expression consumes only one character at a time:

@text_match = /./m

It is vital that your regexp match newlines (the ‘m’) unless every single one of your tags is set to close upon seeing a newline.

Unfortunately, the safest regular expression is also the slowest. If speed is an issue, your regexp should strive to eat as many characters as possible at once…while ensuring that it doesn’t eat characters that would signify the start of a new tag.

For example, setting a regexp like:

@text_match = /\w+|./m

allows the scanner to match a whole word at a time. However, if you have a tag factory set to look for “Hvv2vvO” to indicate a subscripted ‘2’, the entire string would be eaten as text and the subscript tag would never start.

Using the Scanner

As shown in the example above, consumers of your class initialize it by passing in the string to be parsed, and then calling #to_xml or #to_html on it.

(This two-step process allows the consumer to run other code after the tag parsing, before final conversion. Examples might include replacing special command tags with other input, or performing database lookups on special wiki-page-link tags and replacing with HTML anchors.)

Requirements

TagTreeScanner is built on top of the StringScanner library that is part of the standard Ruby installation.

License

(The MIT License)

Copyright © 2005-2007 Gavin Kistner

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ‘Software’), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED ‘AS IS’, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.