Minilex

A little lexer toolkit, for basic lexing needs.

It's designed for the cases where parsers do the parsing, and all you need from your lexer is an array of simple tokens.

Usage

Expression = Minilex::Lexer.new do
  skip :whitespace, /\s+/
  tok :number, /\d+(?:\.\d+)?/
  tok :operator, /[\+\=\/\*]/
end

Expression.lex('1 + 2.34')
# => [[:number, '1', 1, 0],
#     [:operator, '+', 1, 3],
#     [:number, '2.34', 1, 5]
#     [:eos]]

To create a lexer with Lex, instantiate a Minilex::Lexer and define rules.

There are two methods for defining rules, skip and tok:

skip takes an id and a pattern. The lexer will ignore all occurrences of the pattern in the input text. The id isn't strictly necessary, but it's nice for readability and is a required argument.

tok also takes an id and a pattern. The lexer will turn all occurrences of the pattern into a token of the form:

[id, value, line, offset]

# id     - the id you provided
# value  - the matched value
# line   - line number
# offset - character position in the line

Overriding the token format

If you'd like to customize the token format, override append_token:

Digits = Minilex::Lexer.new do
  skip :whitespace, /\s+/
  tok :digit, /\d/

  # id    - the id of the matched rule
  # value - the value that was matched
  #
  # You have access to the array of tokens via `tokens` and the current
  # token's position # information via `pos`.
  def append_token(id, value)
    tokens << Integer(value)
  end

  # By default, the lexer will append an end-of-stream token to the end of
  # the tokens array. You can override what the eos token is or even suppress
  # it altogether with the append_eos callback.
  #
  # Here we'll suppress it by doing nothing
  def append_eos
  end
end

digits.lex('1 2 3 4')
# => [1, 2, 3, 4]

Processing values

There's one more thing you can do. It's just for convenience, though I'm not sure it really belongs in something that's supposed to do as little as possible. I might remove it.

The tok method accepts a third optional processor argument, which should name a method on the lexer (you'll have to write the method, of course).

What this will do is give you a chance to get at the matched text before it gets stuffed into a token:

DigitsConverter = Minilex::Lexer.new do
  skip :whitespace, /\s+/
  tok :digit, /\d/, :integer

  def integer(str)
    Integer(str)
  end
end

DigitsConverter.lex('123')
# => [[:digit, 1, 1, 0], [:digit, 2, 1, 1], [:digit, 3, 1, 2], [:eos]]
#              ^                  ^                  ^
#              ^                  ^                  ^
#            These are Integers (would have been Strings)