Class: TagTreeScanner

Inherits:
Object
  • Object
show all
Defined in:
lib/tagtreescanner.rb

Overview

Overview

The TagTreeScanner class provides a generic framework for creating a nested hierarchy of tags and text (like XML or HTML) by parsing text. An example use (and the reason it was written) is to convert a wiki markup syntax into HTML.

Example Usage

require 'TagTreeScanner'

class SimpleMarkup < TagTreeScanner
   @root_factory.allows_text = false

   @tag_genres[ :root ] = [ ]

   @tag_genres[ :root ] << TagFactory.new( :paragraph,
      # A line that doesn't have whitespace at the start
      :open_match => /(?=\S)/, :open_requires_bol => true,

      # Close when you see a double return
      :close_match => /\n[ \t]*\n/,
      :allows_text => true,
      :allowed_genre => :inline
   )

   @tag_genres[ :root ] << TagFactory.new( :preformatted,
      # Grab all lines that are indented up until a line that isn't
      :open_match => /((\s+).+?)\n+(?=\S)/m, :open_requires_bol => true,
      :setup => lambda{ |tag, scanner, tagtree|
         # Throw the contents I found into the tag
         # but remove leading whitespace
         tag << scanner[1].gsub( /^#{scanner[2]}/, '' )
      },
      :autoclose => :true
   )

   @tag_genres[ :inline ] = [ ]

   @tag_genres[ :inline ] << TagFactory.new( :bold,
      # An asterisk followed by a letter or number
      :open_match => /\*(?=[a-z0-9])/i,

      # Close when I see an asterisk OR a newline coming up
      :close_match => /\*|(?=\n)/,
      :allows_text => true,
      :allowed_genre => :inline
   )

   @tag_genres[ :inline ] << TagFactory.new( :italic,
      # An underscore followed by a letter or number
      :open_match => /_(?=[a-z0-9])/i,

      # Close when I see an underscore OR a newline coming up
      :close_match => /_|(?=\n)/,
      :allows_text => true,
      :allowed_genre => :inline
   )
end

raw_text = <<ENDINPUT
Hello World! You're _soaking in_ my test.
This is a *subset* of markup that I allow.

Hi paragraph two. Yo! A code sample:

  def foo
    puts "Whee!"
  end

_That, as they say, is that._

ENDINPUT

markup = SimpleMarkup.new( raw_text ).to_xml
puts markup

#=> <paragraph>Hello World! You're <italic>soaking in</italic> my test.
#=> This is a <bold>subset</bold> of markup that I allow.</paragraph>
#=> <paragraph>Hi paragraph two. Yo! A code sample:</paragraph>
#=> <preformatted>def foo
#=>   puts "Whee!"
#=> end</preformatted>
#=> <paragraph><italic>That, as they say, is that.</italic></paragraph>

Details

TagFactories at 10,000 feet

Each possible output tag is described by a TagFactory, which specifies some or all of the following:

  • The name of the tags it creates (required)

  • The regular expression to look for to start the tag

  • The regular expression to look for to close the tag, or

  • Whether the tag is automatically closed after creation

  • What genre of tags are allowed within the tag

  • Whether the tag supports raw text inside it

  • Code to run when creating a tag

See the TagFactory class for more information on specifying factories.

Genres as a State Machine

As a new tag is opened, the scanner uses the Tag#allowed_genre property of that tag (set by the allowed_genre property on the TagFactory) to determine which tags to be looking for. A genre is specified by adding an array in the @tag_genres hash, whose key is the genre name. For example:

@tag_genres[ :inline ] = [ ]

adds a new genre named ‘inline’, with no tags in it. TagFactory instances should be pushed onto this array in the order that they should be looked for. For example:

@tag_genres[ :inline ] << TagFactory.new( :italic,
  # see the TagFactory#initialize for options
)

Note that the close_match regular expression of the current tag is always checked before looking to open/create any new tags.

Consuming Text

As the text is being parsed, there will (probably) be many cases where you have raw text that doesn’t close or open any new tags. Whenever the scanner reaches this state, it runs the @text_match regexp against the text to move the pointer ahead. If the current tag has Tag#allows_text? set to true (through TagFactory#allows_text), then this text is added as contents of the tag. If not, the text is thrown away.

The safest regular expression consumes only one character at a time:

@text_match = /./m

It is vital that your regexp match newlines (the ‘m’) unless every single one of your tags is set to close upon seeing a newline.

Unfortunately, the safest regular expression is also the slowest. If speed is an issue, your regexp should strive to eat as many characters as possible at once…while ensuring that it doesn’t eat characters that would signify the start of a new tag.

For example, setting a regexp like:

@text_match = /\w+|./m

allows the scanner to match a whole word at a time. However, if you have a tag factory set to look for “Hvv2vvO” to indicate a subscripted ‘2’, the entire string would be eaten as text and the subscript tag would never start.

Using the Scanner

As shown in the example above, consumers of your class initialize it by passing in the string to be parsed, and then calling #to_xml or #to_html on it.

(This two-step process allows the consumer to run other code after the tag parsing, before final conversion. Examples might include replacing special command tags with other input, or performing database lookups on special wiki-page-link tags and replacing with HTML anchors.)

Defined Under Namespace

Classes: Tag, TagFactory, TextNode

Constant Summary collapse

VERSION =
"0.8.0"

Class Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(string_to_parse) ⇒ TagTreeScanner

Scans through string_to_parse and builds a tree of tags based on the regular expressions and rules set by the TagFactory instances present in @tag_genres.

After parsing the tree, call #to_xml or #to_html to retrieve a string representation.



752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
# File 'lib/tagtreescanner.rb', line 752

def initialize( string_to_parse )
  current = @root = self.class.root_factory.create
  tag_genres = self.class.tag_genres
  text_match = self.class.text_match

  ss = StringScanner.new( string_to_parse )
  while !ss.eos?
    # Keep popping off the current tag until we get to the root,
    # as long as the end criteria is met
    while ( current != @root ) && (!current.close_requires_bol? || ss.bol?) && ss.scan( current.close_match )
      current = current.parent_tag || @root
    end

    # No point in continuing if closing out tags consumed the rest of the string
    break if ss.eos?

    # Look for a tag to open
    if factories = tag_genres[ current.allowed_genre ]
      tag = nil
      factories.each{ |factory|
        if tag = factory.match( ss, self )
          current.append_child( tag )
          current = tag unless tag.autoclose?
          break
        end
      }
      #start at the top of the loop if we found one
      next if tag
    end

    # Couldn't find a valid tag at this spot
    # so we need to eat some characters
    consumed = ss.scan( text_match )
    current << consumed if current.allows_text?
  end
end

Class Attribute Details

.root_factoryObject

Returns the value of attribute root_factory.



715
716
717
# File 'lib/tagtreescanner.rb', line 715

def root_factory
  @root_factory
end

.tag_genresObject

Returns the value of attribute tag_genres.



715
716
717
# File 'lib/tagtreescanner.rb', line 715

def tag_genres
  @tag_genres
end

.text_matchObject

Returns the value of attribute text_match.



715
716
717
# File 'lib/tagtreescanner.rb', line 715

def text_match
  @text_match
end

Class Method Details

.inherited(child_class) ⇒ Object

When a class inherits from TagTreeScanner, defaults are set for @tag_genres, @root_factory and @text_match



825
826
827
828
829
# File 'lib/tagtreescanner.rb', line 825

def self.inherited( child_class ) #:nodoc:
  child_class.tag_genres = @tag_genres
  child_class.root_factory = @root_factory
  child_class.text_match = @text_match
end

Instance Method Details

#inspectObject

Returns a hierarchical representation of the entire tag tree



818
819
820
# File 'lib/tagtreescanner.rb', line 818

def inspect #:nodoc:
  @root.to_hier
end

#tagsObject

Returns an array of all root-level tags found



807
808
809
# File 'lib/tagtreescanner.rb', line 807

def tags
  @root.child_tags
end

#tags_by_name(name) ⇒ Object

Returns an array of all tags in the tree whose Tag#name matches the supplied name.



813
814
815
# File 'lib/tagtreescanner.rb', line 813

def tags_by_name( name )
  @root.tags_by_type( name )
end

#to_htmlObject

Returns an HTML representation of the tag tree.

This is the same as the #to_xml method except that empty tags use an explicit close tag, e.g. <div></div> versus <div />



793
794
795
# File 'lib/tagtreescanner.rb', line 793

def to_html
  @root.child_tags.inject(''){ |out, tag| out << tag.to_html }
end

#to_xmlObject

Returns an XML representation of the tag tree.

This method is the same as the #to_html method except that empty tags do not use an explicit close tag, e.g. <div /> versus <div></div>



802
803
804
# File 'lib/tagtreescanner.rb', line 802

def to_xml
  @root.child_tags.inject(''){ |out, tag| out << tag.to_xml }
end