Minilex
A little lexer toolkit, for basic lexing needs.
It's designed for the cases where parsers do the parsing, and all you need from your lexer is an array of simple tokens.
Usage
Expression = Minilex::Lexer.new do
skip :whitespace, /\s+/
tok :number, /\d+(?:\.\d+)?/
tok :operator, /[\+\=\/\*]/
end
Expression.lex('1 + 2.34')
# => [[:number, '1', 1, 0],
# [:operator, '+', 1, 3],
# [:number, '2.34', 1, 5]
# [:eos]]
To create a lexer with Lex, instantiate a Minilex::Lexer
and define rules.
There are two methods for defining rules, skip
and tok
:
skip
takes an id
and a pattern
. The lexer will ignore all occurrences of
the pattern in the input text. The id
isn't strictly necessary, but it's nice
for readability and is a required argument.
tok
also takes an id
and a pattern
. The lexer will turn all occurrences
of the pattern into a token of the form:
[id, value, line, offset]
# id - the id you provided
# value - the matched value
# line - line number
# offset - character position in the line
Overriding the token format
If you'd like to customize the token format, override append_token
:
Digits = Minilex::Lexer.new do
skip :whitespace, /\s+/
tok :digit, /\d/
# id - the id of the matched rule
# value - the value that was matched
#
# You have access to the array of tokens via `tokens` and the current
# token's position # information via `pos`.
def append_token(id, value)
tokens << Integer(value)
end
# By default, the lexer will append an end-of-stream token to the end of
# the tokens array. You can override what the eos token is or even suppress
# it altogether with the append_eos callback.
#
# Here we'll suppress it by doing nothing
def append_eos
end
end
digits.lex('1 2 3 4')
# => [1, 2, 3, 4]
Processing values
There's one more thing you can do. It's just for convenience, though I'm not sure it really belongs in something that's supposed to do as little as possible. I might remove it.
The tok
method accepts a third optional processor
argument, which should
name a method on the lexer (you'll have to write the method, of course).
What this will do is give you a chance to get at the matched text before it gets stuffed into a token:
DigitsConverter = Minilex::Lexer.new do
skip :whitespace, /\s+/
tok :digit, /\d/, :integer
def integer(str)
Integer(str)
end
end
DigitsConverter.lex('123')
# => [[:digit, 1, 1, 0], [:digit, 2, 1, 1], [:digit, 3, 1, 2], [:eos]]
# ^ ^ ^
# ^ ^ ^
# These are Integers (would have been Strings)