= TagTreeScanner
Author:: Gavin Kistner (mailto:[email protected])
Copyright:: Copyright (c)2005-2007 Gavin Kistner
License:: MIT License
Version:: 0.8.1 (2007-November-25)

== Overview
The TagTreeScanner class provides a generic framework for creating a
nested hierarchy of tags and text (like XML or HTML) by parsing text. An
example use (and the reason it was written) is to convert a wiki markup
syntax into HTML.

== Example Usage
require 'tagtreescanner'

class SimpleMarkup < TagTreeScanner
@root_factory.allows_text = false

@tag_genres[ :root ] = [ ]

@tag_genres[ :root ] << TagFactory.new( :paragraph,
# A line that doesn't have whitespace at the start
:open_match => /(?=\S)/, :open_requires_bol => true,

# Close when you see a double return
:close_match => /\n[ \t]*\n/,
:allows_text => true,
:allowed_genre => :inline
)

@tag_genres[ :root ] << TagFactory.new( :preformatted,
# Grab all lines that are indented up until a line that isn't
:open_match => /((\s+).+?)\n+(?=\S)/m, :open_requires_bol => true,
:setup => lambda{ |tag, scanner, tagtree|
# Throw the contents I found into the tag
# but remove leading whitespace
tag << scanner[1].gsub( /^#scanner[2]/, '' )
},
:autoclose => :true
)

@tag_genres[ :inline ] = [ ]

@tag_genres[ :inline ] << TagFactory.new( :bold,
# An asterisk followed by a letter or number
:open_match => /\*(?=[a-z0-9])/i,

# Close when I see an asterisk OR a newline coming up
:close_match => /\*|(?=\n)/,
:allows_text => true,
:allowed_genre => :inline
)

@tag_genres[ :inline ] << TagFactory.new( :italic,
# An underscore followed by a letter or number
:open_match => /_(?=[a-z0-9])/i,

# Close when I see an underscore OR a newline coming up
:close_match => /_|(?=\n)/,
:allows_text => true,
:allowed_genre => :inline
)
end

raw_text = <<ENDINPUT
Hello World! You're _soaking in_ my test.
This is a *subset* of markup that I allow.

Hi paragraph two. Yo! A code sample:

def foo
puts "Whee!"
end

_That, as they say, is that._

ENDINPUT

markup = SimpleMarkup.new( raw_text ).to_xml
puts markup


#=> <paragraph>Hello World! You're <italic>soaking in</italic> my test.
#=> This is a <bold>subset</bold> of markup that I allow.</paragraph>
#=> <paragraph>Hi paragraph two. Yo! A code sample:</paragraph>
#=> <preformatted>def foo
#=> puts "Whee!"
#=> end</preformatted>
#=> <paragraph><italic>That, as they say, is that.</italic></paragraph>

== Details

=== TagFactories at 10,000 feet
Each possible output tag is described by a TagFactory, which specifies
some or all of the following:
* The name of the tags it creates <i>(required)</i>
* The regular expression to look for to start the tag
* The regular expression to look for to close the tag, or
* Whether the tag is automatically closed after creation
* What genre of tags are allowed within the tag
* Whether the tag supports raw text inside it
* Code to run when creating a tag

See the TagFactory class for more information on specifying factories.

=== Genres as a State Machine
As a new tag is opened, the scanner uses the Tag#allowed_genre property
of that tag (set by the +allowed_genre+ property on the TagFactory) to
determine which tags to be looking for. A genre is specified by adding
an array in the <tt>@tag_genres</tt> hash, whose key is the genre name.
For example:
@tag_genres[ :inline ] = [ ]
adds a new genre named 'inline', with no tags in it. TagFactory instances
should be pushed onto this array <b>in the order that they should be looked
for</b>. For example:
@tag_genres[ :inline ] << TagFactory.new( :italic,
# see the TagFactory#initialize for options
)

Note that the +close_match+ regular expression of the current tag is
always checked before looking to open/create any new tags.

=== Consuming Text
As the text is being parsed, there will (probably) be many cases where
you have raw text that doesn't close or open any new tags. Whenever the
scanner reaches this state, it runs the <tt>@text_match</tt> regexp
against the text to move the pointer ahead. If the current tag has
<tt>Tag#allows_text?</tt> set to +true+ (through
<tt>TagFactory#allows_text</tt>), then this text is added as contents of
the tag. If not, the text is thrown away.

The safest regular expression consumes only one character at a time:
@text_match = /./m

<b><i>It is vital that your regexp match newlines</i></b> (the 'm')
<b><i>unless every single one of your tags is set to close upon seeing
a newline.</i></b>

Unfortunately, the safest regular expression is also the slowest. If
speed is an issue, your regexp should strive to eat as many characters as
possible at once...while ensuring that it doesn't eat characters that
would signify the start of a new tag.

For example, setting a regexp like:
@text_match = /\w+|./m
allows the scanner to match a whole word at a time. However, if you have
a tag factory set to look for "Hvv2vvO" to indicate a subscripted '2',
the entire string would be eaten as text and the subscript tag would
never start.

=== Using the Scanner
As shown in the example above, consumers of your class initialize it by
passing in the string to be parsed, and then calling #to_xml or #to_html
on it.

<i>(This two-step process allows the consumer to run other code after
the tag parsing, before final conversion. Examples might include
replacing special command tags with other input, or performing database
lookups on special wiki-page-link tags and replacing with HTML
anchors.)</i>

== Requirements
TagTreeScanner is built on top of the StringScanner library that is part
of the standard Ruby installation.

== License

(The MIT License)

Copyright (c) 2005-2007 Gavin Kistner

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
'Software'), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.