== Welcome to Tartan
Tartan is a general purpose text parsing engine whose main target is wiki text
parsing. (see c2.com[http://c2.com/cgi/wiki?WikiWikiWeb] and
Wikipedia[http://en.wikipedia.org/wiki/Wiki]) It doesn't implement one specific
mark-up, but instead, provides a way to specify a variety of mark-ups. So,
Tartan is a bit more "involved" than a purpose built parser like
RedCloth[http://whytheluckystiff.net/ruby/redcloth/] or
BlueCloth[http://www.deveiate.org/projects/BlueCloth], but provides the
following benefits:
1. separates the specific wiki syntax specification from the
implementation
2. allows layering and extension of parsing rules
3. allows multiple output formats from the same syntax specification
The current implementation of Tartan is in Ruby and includes a full Markdown[http://daringfireball.net/projects/markdown/]
parser (described in YAML). The format of the parsing specification has been
created with an eye to having a language independent definition of wiki (and
possibly other) mark-ups. That's a lofty goal, and Tartan hasn't quite gotten
there yet, but we think there's a clear path. In any case, even if it is only
available in Ruby it will hopefully be helpful for projects needing to do
something more than just convert wiki text directly into HTML.
== Usage
So, really all you want to do is generate HTML from Markdown[http://daringfireball.net/projects/markdown/] text. Here's
how you do it:
# require 'rubygems' # if you are pulling Tartan in as a gem
require 'tartan_markdown'
html = Tartan::Markdown::Parser.new("* howdy\n* doody").to_html
# => "<ul>\n<li>howdy</li>\n<li>doody</li>\n</ul>"
Other parsers would have similar names and would have the same usage. In
particular, you will need to require the parser class file and then creat a
new instance of the parser and call <tt>to_html</tt> on that instance.
You can also have other output methods, say <tt>to_xml</tt>, which would be
called in the same way on the instance of the parser object.
=== Layering Parsers
You can add parsing syntax to existing parsers. This is done by building up a set of parsers specifications that work together.
In the Tartan distribution you have a specification for Markdown[http://daringfireball.net/projects/markdown/] and you also
have a specification for table mark-up. You can combine them by creating a new
class that layers the tables onto the Markdown[http://daringfireball.net/projects/markdown/] definition as follows in a file
called <tt>tartan_markdown_tables.rb</tt>:
require 'tartan/markdown/rules'
require 'tartan/table/rules'
modules Tartan
modules MarkdownTables
class Parser < Tartan::Parser
include TartanMarkdownDef
include TartanTableDef
end
end
end
In another file you could use this new parser:
require 'tartan_markdown_tables'
html = Tartan::MarkdownTables::Parser.new("[|*happy*||**days**|]").to_html
# => "<table class=\"\">
<tr><td><em>happy</em></td><td><strong>days</strong></td></tr>
</table>"
== The Parsing Specification
Each specific parser (Markdown[http://daringfireball.net/projects/markdown/] to HTML, Textile to HTML, your wiki to xml, etc.) needs a parsing specification to tell Tartan how to convert the text into HTML (or what ever other format you need).
=== Overall Structure
Each parser is made up of a parsing definition and optional helper methods. The specification is defined in YAML and the helper methods are defined in a parser definition class.
The parsing definition in YAML has the following general structure:
block:
- <parsing rule>
- <parsing rule>
<parsing context>:
- <parsing rules>
So the parsing rules are defined as a set of contexts and each context is an
list of parsing rules. The base context defaults to <tt>block</tt>; that is, the parser starts with the <tt>block</tt> context which may point the parser off to other contexts to parse blocks of the parsed text. More on this after the explanation of the parsing rules.
==== Parsing Rules
The following is a simple parsing rule to match paragraphs and mark them up in HTML:
title: paragraph
match: /(^[^\n]+$\n)*^[^\n]+$/m
html:
start_mark: <p>
end_mark: </p>
A paragraph, in this case, is any grouping of non blank lines.
The parser will repetitively apply the <tt>match</tt> regular expression and if it matches, the <tt>html</tt> output sub-rule will put the <tt>start_mark</tt>, <tt><p></tt>, and the <tt>end_mark</tt>, <tt></p></tt>, around the text that is matched as a paragraph.
If we wanted to also mark off blocks of code that are indented by say 2 or more spaces at the beginning of the line, we could use the following rule:
title: code
match: /(^[ ]2,\S.+?$\n)+^[ ]2,\S.+?$/m
html:
start_mark: <pre><code>
end_mark: </code></pre>
When we want to add the <tt>code</tt> rule, the ordering becomes important. If we put the <tt>paragraph</tt> rule first, it will gobble up both the paragraphs and the code blocks since it's just looking for groups of non blank lines. To prevent this we need to put the <tt>code</tt> rule first. So the overall definition would be:
block:
- title: code
match: /(^[ ]2,\S.+?$\n)+^[ ]2,\S.+?$/m
html:
start_mark: <pre><code>
end_mark: </code></pre>
- title: paragraph
match: "/(^[^\n]+$\n)+^[^\n]+$/m"
html:
start_mark: <p>
end_mark: </p>
Now, lets say we want to be able to mark-up text with emphasis (HTML <tt><em></tt>) and strong emphasis (HTML <tt><strong></tt>) in paragraph text, but not code. We'll use an asterisk (*) around text we want to have emphasis and a double asterisk around text we want to have strong emphasis (**). Note that we don't want this to happen in text in a code block.
To do this, we set up a new parsing context for paragraph body text and "point" the parser to the context when it recognizes a paragraph.
First, we create the paragraph parsing context:
paragraph:
- title: strong
match: /\*\*(.*?)\*\*/
html:
replace: <strong>\1</strong>
- rescan
- title: emphasis
match: /\*(.*?)\*/
html:
replace: <em>\1</em>
The <tt>rescan</tt> directive between the <tt>strong</tt> and <tt>emphasis</tt> rules tells the parser to "start over". This is needed because otherwise the <tt>strong</tt> rule would "claim" all the text it matched and the <tt>emphasis</tt> rule wouldn't have a chance to parse any of it. This would come into play if we had a paragraph such as:
Now listen to this **I want *you* to really hear me**.
This should get marked up as:
<p>Now listen to this <strong>I want <em>you<em> to really hear me</strong>.</p>
but we would get the following without the rescan:
<p>Now listen to this <strong>I want *you* to really hear me</strong>.</p>
You might also note that the ordering here, again, is important. If we leave out the <tt>rescan</tt>, we would get the following output instead:
<p>Now listen to this <em></em>I want <em>you</em> to really hear me<em></em>.</p>
Now, we also need to modify the paragraph rule in the <tt>block</tt> context to use the new <tt>paragraph</tt> context:
# . . .
- title: paragraph
match: /(^[^\n]+$\n)*^[^\n]+$/m
subparse: paragraph
html:
start_mark: <p>
end_mark: </p>
# . . .
To do this we use the <tt>subparse</tt> directive to tell the parser that the contents of the paragraph should be parsed by the <tt>paragraph</tt> context.
==== Creating a Mix-in
It's possible to mix-in or layer a parsing specification with a base parser. This allows you to add additional markup or change the markup of an existing syntax. You could use this to add table mark-up to Markdown[http://daringfireball.net/projects/markdown/] (in fact, this mix-in to Markdown is available as part of the Tartan code distribution).
To show how this works, we'll look at how to specify and then add character element markup to the parser example we've been working with. We want to turn things like "<", "&" and "->" into "<", "&" and "→".
We want these transformation to be done in the context of parsing paragraphs, so we'll only want to add to the <tt>paragraph</tt> context in our previous example.
So, to add this syntax parsing, you would create the following specification:
paragraph:
- rescan
- title: amp
match: /&/
html:
replace: '&'
rescan: true
- title: rightArrow
match: /->/
html:
replace: '→'
rescan: true
- title: lessThan
match: /</
html:
replace: '<'
rescan: true
- title: greaterThan
match: />/
html:
replace: '>'
That's it for the mix-in specification. Now we add these to the previous set. We didn't touch on file naming of specifications before, but now we need to. Let's say that we put the previous specification in a file called <tt>example-parser.yml</tt> and we put the new spec in <tt>entities.yml</tt>. To combine them, we would create a new Ruby class like this:
class ExampleParserWithEntities < Tartan::Parser
yaml "example-parser.yml"
yaml "entities.yml"
end
By default, the rules of a mix-in are added to the end of any given context. So, the effective resulting specification once the two sets of rules are combined would be:
block:
- title: code
match: /(^[ ]2,\S.+?$\n)+^[ ]2,\S.+?$/m
html:
start_mark: <pre><code>
end_mark: </code></pre>
- title: paragraph
match: /(^[^\n]+$\n)*^[^\n]+$/m
subparse: paragraph
html:
start_mark: <p>
end_mark: </p>
paragraph:
- title: emphasis
match: /\*(.*?)\*/
html:
replace: <em>\1</em>
- rescan
- title: amp
match: /&/
html:
replace: '&'
rescan: true
- title: rightArrow
match: /->/
html:
replace: '→'
rescan: true
- title: lessThan
match: /</
html:
replace: '<'
rescan: true
- title: greaterThan
match: />/
html:
replace: '>'
==== Going Further
Honestly, this brief tutorial just provides you with the basics of Tartan. If you want to know more, for now, the best thing is to look at the Markdown[http://daringfireball.net/projects/markdown/] and table extension specification in the code. That will show you a real-world example of how to create a base parser and a mix-in.
There will be additional documentation to follow. In particular a reference guide that covers all the parser rule directives one at a time.
If you need some help in getting Tartan to work for your project, please don't hesitate to post to the Tartan help-form[http://rubyforge.org/forum/forum.php?forum_id=8042] or write me directly at mailto:[email protected].
== The Name
Tartan is intended to weave together different parsing elements. It's intended
to be an alternative of both RedCloth[http:www.redcloth.org/] and BlueCloth[http://www.deveiate.org/projects/BlueCloth]. Tartan is a kind of cloth
that weaves different colors together in an interesting pattern.
Tartan is a general purpose text parsing engine whose main target is wiki text
parsing. (see c2.com[http://c2.com/cgi/wiki?WikiWikiWeb] and
Wikipedia[http://en.wikipedia.org/wiki/Wiki]) It doesn't implement one specific
mark-up, but instead, provides a way to specify a variety of mark-ups. So,
Tartan is a bit more "involved" than a purpose built parser like
RedCloth[http://whytheluckystiff.net/ruby/redcloth/] or
BlueCloth[http://www.deveiate.org/projects/BlueCloth], but provides the
following benefits:
1. separates the specific wiki syntax specification from the
implementation
2. allows layering and extension of parsing rules
3. allows multiple output formats from the same syntax specification
The current implementation of Tartan is in Ruby and includes a full Markdown[http://daringfireball.net/projects/markdown/]
parser (described in YAML). The format of the parsing specification has been
created with an eye to having a language independent definition of wiki (and
possibly other) mark-ups. That's a lofty goal, and Tartan hasn't quite gotten
there yet, but we think there's a clear path. In any case, even if it is only
available in Ruby it will hopefully be helpful for projects needing to do
something more than just convert wiki text directly into HTML.
== Usage
So, really all you want to do is generate HTML from Markdown[http://daringfireball.net/projects/markdown/] text. Here's
how you do it:
# require 'rubygems' # if you are pulling Tartan in as a gem
require 'tartan_markdown'
html = Tartan::Markdown::Parser.new("* howdy\n* doody").to_html
# => "<ul>\n<li>howdy</li>\n<li>doody</li>\n</ul>"
Other parsers would have similar names and would have the same usage. In
particular, you will need to require the parser class file and then creat a
new instance of the parser and call <tt>to_html</tt> on that instance.
You can also have other output methods, say <tt>to_xml</tt>, which would be
called in the same way on the instance of the parser object.
=== Layering Parsers
You can add parsing syntax to existing parsers. This is done by building up a set of parsers specifications that work together.
In the Tartan distribution you have a specification for Markdown[http://daringfireball.net/projects/markdown/] and you also
have a specification for table mark-up. You can combine them by creating a new
class that layers the tables onto the Markdown[http://daringfireball.net/projects/markdown/] definition as follows in a file
called <tt>tartan_markdown_tables.rb</tt>:
require 'tartan/markdown/rules'
require 'tartan/table/rules'
modules Tartan
modules MarkdownTables
class Parser < Tartan::Parser
include TartanMarkdownDef
include TartanTableDef
end
end
end
In another file you could use this new parser:
require 'tartan_markdown_tables'
html = Tartan::MarkdownTables::Parser.new("[|*happy*||**days**|]").to_html
# => "<table class=\"\">
<tr><td><em>happy</em></td><td><strong>days</strong></td></tr>
</table>"
== The Parsing Specification
Each specific parser (Markdown[http://daringfireball.net/projects/markdown/] to HTML, Textile to HTML, your wiki to xml, etc.) needs a parsing specification to tell Tartan how to convert the text into HTML (or what ever other format you need).
=== Overall Structure
Each parser is made up of a parsing definition and optional helper methods. The specification is defined in YAML and the helper methods are defined in a parser definition class.
The parsing definition in YAML has the following general structure:
block:
- <parsing rule>
- <parsing rule>
<parsing context>:
- <parsing rules>
So the parsing rules are defined as a set of contexts and each context is an
list of parsing rules. The base context defaults to <tt>block</tt>; that is, the parser starts with the <tt>block</tt> context which may point the parser off to other contexts to parse blocks of the parsed text. More on this after the explanation of the parsing rules.
==== Parsing Rules
The following is a simple parsing rule to match paragraphs and mark them up in HTML:
title: paragraph
match: /(^[^\n]+$\n)*^[^\n]+$/m
html:
start_mark: <p>
end_mark: </p>
A paragraph, in this case, is any grouping of non blank lines.
The parser will repetitively apply the <tt>match</tt> regular expression and if it matches, the <tt>html</tt> output sub-rule will put the <tt>start_mark</tt>, <tt><p></tt>, and the <tt>end_mark</tt>, <tt></p></tt>, around the text that is matched as a paragraph.
If we wanted to also mark off blocks of code that are indented by say 2 or more spaces at the beginning of the line, we could use the following rule:
title: code
match: /(^[ ]2,\S.+?$\n)+^[ ]2,\S.+?$/m
html:
start_mark: <pre><code>
end_mark: </code></pre>
When we want to add the <tt>code</tt> rule, the ordering becomes important. If we put the <tt>paragraph</tt> rule first, it will gobble up both the paragraphs and the code blocks since it's just looking for groups of non blank lines. To prevent this we need to put the <tt>code</tt> rule first. So the overall definition would be:
block:
- title: code
match: /(^[ ]2,\S.+?$\n)+^[ ]2,\S.+?$/m
html:
start_mark: <pre><code>
end_mark: </code></pre>
- title: paragraph
match: "/(^[^\n]+$\n)+^[^\n]+$/m"
html:
start_mark: <p>
end_mark: </p>
Now, lets say we want to be able to mark-up text with emphasis (HTML <tt><em></tt>) and strong emphasis (HTML <tt><strong></tt>) in paragraph text, but not code. We'll use an asterisk (*) around text we want to have emphasis and a double asterisk around text we want to have strong emphasis (**). Note that we don't want this to happen in text in a code block.
To do this, we set up a new parsing context for paragraph body text and "point" the parser to the context when it recognizes a paragraph.
First, we create the paragraph parsing context:
paragraph:
- title: strong
match: /\*\*(.*?)\*\*/
html:
replace: <strong>\1</strong>
- rescan
- title: emphasis
match: /\*(.*?)\*/
html:
replace: <em>\1</em>
The <tt>rescan</tt> directive between the <tt>strong</tt> and <tt>emphasis</tt> rules tells the parser to "start over". This is needed because otherwise the <tt>strong</tt> rule would "claim" all the text it matched and the <tt>emphasis</tt> rule wouldn't have a chance to parse any of it. This would come into play if we had a paragraph such as:
Now listen to this **I want *you* to really hear me**.
This should get marked up as:
<p>Now listen to this <strong>I want <em>you<em> to really hear me</strong>.</p>
but we would get the following without the rescan:
<p>Now listen to this <strong>I want *you* to really hear me</strong>.</p>
You might also note that the ordering here, again, is important. If we leave out the <tt>rescan</tt>, we would get the following output instead:
<p>Now listen to this <em></em>I want <em>you</em> to really hear me<em></em>.</p>
Now, we also need to modify the paragraph rule in the <tt>block</tt> context to use the new <tt>paragraph</tt> context:
# . . .
- title: paragraph
match: /(^[^\n]+$\n)*^[^\n]+$/m
subparse: paragraph
html:
start_mark: <p>
end_mark: </p>
# . . .
To do this we use the <tt>subparse</tt> directive to tell the parser that the contents of the paragraph should be parsed by the <tt>paragraph</tt> context.
==== Creating a Mix-in
It's possible to mix-in or layer a parsing specification with a base parser. This allows you to add additional markup or change the markup of an existing syntax. You could use this to add table mark-up to Markdown[http://daringfireball.net/projects/markdown/] (in fact, this mix-in to Markdown is available as part of the Tartan code distribution).
To show how this works, we'll look at how to specify and then add character element markup to the parser example we've been working with. We want to turn things like "<", "&" and "->" into "<", "&" and "→".
We want these transformation to be done in the context of parsing paragraphs, so we'll only want to add to the <tt>paragraph</tt> context in our previous example.
So, to add this syntax parsing, you would create the following specification:
paragraph:
- rescan
- title: amp
match: /&/
html:
replace: '&'
rescan: true
- title: rightArrow
match: /->/
html:
replace: '→'
rescan: true
- title: lessThan
match: /</
html:
replace: '<'
rescan: true
- title: greaterThan
match: />/
html:
replace: '>'
That's it for the mix-in specification. Now we add these to the previous set. We didn't touch on file naming of specifications before, but now we need to. Let's say that we put the previous specification in a file called <tt>example-parser.yml</tt> and we put the new spec in <tt>entities.yml</tt>. To combine them, we would create a new Ruby class like this:
class ExampleParserWithEntities < Tartan::Parser
yaml "example-parser.yml"
yaml "entities.yml"
end
By default, the rules of a mix-in are added to the end of any given context. So, the effective resulting specification once the two sets of rules are combined would be:
block:
- title: code
match: /(^[ ]2,\S.+?$\n)+^[ ]2,\S.+?$/m
html:
start_mark: <pre><code>
end_mark: </code></pre>
- title: paragraph
match: /(^[^\n]+$\n)*^[^\n]+$/m
subparse: paragraph
html:
start_mark: <p>
end_mark: </p>
paragraph:
- title: emphasis
match: /\*(.*?)\*/
html:
replace: <em>\1</em>
- rescan
- title: amp
match: /&/
html:
replace: '&'
rescan: true
- title: rightArrow
match: /->/
html:
replace: '→'
rescan: true
- title: lessThan
match: /</
html:
replace: '<'
rescan: true
- title: greaterThan
match: />/
html:
replace: '>'
==== Going Further
Honestly, this brief tutorial just provides you with the basics of Tartan. If you want to know more, for now, the best thing is to look at the Markdown[http://daringfireball.net/projects/markdown/] and table extension specification in the code. That will show you a real-world example of how to create a base parser and a mix-in.
There will be additional documentation to follow. In particular a reference guide that covers all the parser rule directives one at a time.
If you need some help in getting Tartan to work for your project, please don't hesitate to post to the Tartan help-form[http://rubyforge.org/forum/forum.php?forum_id=8042] or write me directly at mailto:[email protected].
== The Name
Tartan is intended to weave together different parsing elements. It's intended
to be an alternative of both RedCloth[http:www.redcloth.org/] and BlueCloth[http://www.deveiate.org/projects/BlueCloth]. Tartan is a kind of cloth
that weaves different colors together in an interesting pattern.