Class: Dhaka::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/dhaka/tokenizer/tokenizer.rb

Overview

This abstract class contains a DSL for hand-coding tokenizers. Subclass it to implement tokenizers for specific grammars.

Tokenizers are state machines. Each state of a tokenizer is identified by a Ruby symbol. The constant Dhaka::TOKENIZER_IDLE_STATE is reserved for the idle state of the tokenizer (the one that it starts in).

The following is a tokenizer for arithmetic expressions with integer terms. The tokenizer starts in the idle state creating single-character tokens for all characters excepts digits and whitespace. It shifts to :get_integer_literal when it encounters a digit character and creates a token on the stack on which it accumulates the value of the literal. When it again encounters a non-digit character, it shifts back to idle. Whitespace is treated as a delimiter, but not shifted as a token.

class ArithmeticPrecedenceTokenizer < Dhaka::Tokenizer

  digits = ('0'..'9').to_a
  parenths = ['(', ')']
  operators = ['-', '+', '/', '*', '^']
  functions = ['h', 'l']
  arg_separator = [',']
  whitespace = [' ']

  all_characters = digits + parenths + operators + functions + arg_separator + whitespace

  for_state Dhaka::TOKENIZER_IDLE_STATE do
    for_characters(all_characters - (digits + whitespace)) do
      create_token(curr_char, nil)
      advance
    end
    for_characters digits do
      create_token('n', '')
      switch_to :get_integer_literal
    end
    for_character whitespace do
      advance
    end
  end

  for_state :get_integer_literal do
    for_characters all_characters - digits do
      switch_to Dhaka::TOKENIZER_IDLE_STATE
    end
    for_characters digits do
      curr_token.value << curr_char
      advance
    end
  end

end

For languages where the lexical structure is very complicated, it may be too tedious to implement a Tokenizer by hand. In such cases, it’s a lot easier to write a LexerSpecification using regular expressions and create a Lexer from that.

Direct Known Subclasses

LexerSupport::RegexTokenizer

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input) ⇒ Tokenizer

:nodoc:



143
144
145
146
147
148
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 143

def initialize(input) #:nodoc:
  @input           = input
  @current_state   = self.class.states[TOKENIZER_IDLE_STATE]
  @curr_char_index = 0
  @tokens          = []
end

Instance Attribute Details

#tokensObject (readonly)

The tokens shifted so far.



141
142
143
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 141

def tokens
  @tokens
end

Class Method Details

.for_state(state_name, &blk) ⇒ Object

Define the action for the state named state_name.



122
123
124
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 122

def for_state(state_name, &blk)
  states[state_name].instance_eval(&blk)
end

.tokenize(input) ⇒ Object

Tokenizes a string input and returns a TokenizerErrorResult on failure or a TokenizerSuccessResult on sucess.



127
128
129
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 127

def tokenize(input)
  new(input).run
end

Instance Method Details

#advanceObject

Advance to the next character.



156
157
158
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 156

def advance
  @curr_char_index += 1
end

#create_token(symbol_name, value) ⇒ Object

Push a new token on to the stack with symbol corresponding to symbol_name and a value of value.



170
171
172
173
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 170

def create_token(symbol_name, value)
  new_token = Dhaka::Token.new(symbol_name, value, @curr_char_index)
  tokens << new_token
end

#curr_charObject

The character currently being processed.



151
152
153
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 151

def curr_char
  @input[@curr_char_index] and @input[@curr_char_index].chr 
end

#curr_tokenObject

The token currently on top of the stack.



165
166
167
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 165

def curr_token
  tokens.last
end

#inspectObject



160
161
162
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 160

def inspect
  "<Dhaka::Tokenizer grammar : #{grammar}>"
end

#runObject

:nodoc:



180
181
182
183
184
185
186
187
188
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 180

def run #:nodoc:
  while curr_char
    blk = @current_state.actions[curr_char] || @current_state.default_action
    return TokenizerErrorResult.new(@curr_char_index) unless blk
    instance_eval(&blk)
  end
  tokens << Dhaka::Token.new(Dhaka::END_SYMBOL_NAME, nil, nil)
  TokenizerSuccessResult.new(tokens)
end

#switch_to(state_name) ⇒ Object

Change the active state of the tokenizer to the state identified by the symbol state_name.



176
177
178
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 176

def switch_to state_name
  @current_state = self.class.states[state_name]
end