Class: Dhaka::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/tokenizer/tokenizer.rb

Overview

This class contains a DSL for specifying tokenizers. Subclass it to implement tokenizers for specific grammars. Subclasses of this class may not be further subclassed.

Tokenizers are state machines that are specified pretty much by hand. Each state of a tokenizer is identified by a Ruby symbol. The constant Dhaka::TOKENIZER_IDLE_STATE is reserved for the idle state of the tokenizer (the one that it starts in).

The following is a tokenizer for arithmetic expressions with integer terms. The tokenizer starts in the idle state creating single-character tokens for all characters excepts digits and whitespace. It shifts to :get_integer_literal when it encounters a digit character and creates a token on the stack on which it accumulates the value of the literal. When it again encounters a non-digit character, it shifts back to idle. Whitespace is treated as a delimiter, but not shifted as a token.

class ArithmeticPrecedenceTokenizer < Dhaka::Tokenizer

  digits = ('0'..'9').to_a
  parenths = ['(', ')']
  operators = ['-', '+', '/', '*', '^']
  functions = ['h', 'l']
  arg_separator = [',']
  whitespace = [' ']

  all_characters = digits + parenths + operators + functions + arg_separator + whitespace

  for_state Dhaka::TOKENIZER_IDLE_STATE do
    for_characters(all_characters - (digits + whitespace)) do
      create_token(curr_char, nil)
      advance
    end
    for_characters digits do
      create_token('n', '')
      switch_to :get_integer_literal
    end
    for_character whitespace do
      advance
    end
  end

  for_state :get_integer_literal do
    for_characters all_characters - digits do
      switch_to Dhaka::TOKENIZER_IDLE_STATE
    end
    for_characters digits do
      curr_token.value << curr_char
      advance
    end
  end

end

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input) ⇒ Tokenizer

:nodoc:



136
137
138
139
140
141
# File 'lib/tokenizer/tokenizer.rb', line 136

def initialize(input) #:nodoc:
  @input           = input
  @current_state   = self.class.states[TOKENIZER_IDLE_STATE]
  @curr_char_index = 0
  @tokens          = []
end

Instance Attribute Details

#tokensObject (readonly)

The tokens shifted so far.



134
135
136
# File 'lib/tokenizer/tokenizer.rb', line 134

def tokens
  @tokens
end

Class Method Details

.for_state(state_name, &blk) ⇒ Object

Define the action for the state named state_name.



115
116
117
# File 'lib/tokenizer/tokenizer.rb', line 115

def for_state(state_name, &blk)
  states[state_name].instance_eval(&blk)
end

.tokenize(input) ⇒ Object

Tokenizes a string input and returns a TokenizerErrorResult on failure or a TokenizerSuccessResult on sucess.



120
121
122
# File 'lib/tokenizer/tokenizer.rb', line 120

def tokenize(input)
  new(input).run
end

Instance Method Details

#advanceObject

Advance to the next character.



149
150
151
# File 'lib/tokenizer/tokenizer.rb', line 149

def advance
  @curr_char_index += 1
end

#create_token(symbol_name, value) ⇒ Object

Push a new token on to the stack with symbol corresponding to symbol_name and a value of value.



163
164
165
166
# File 'lib/tokenizer/tokenizer.rb', line 163

def create_token(symbol_name, value)
  new_token = Dhaka::Token.new(symbol_name, value, @curr_char_index)
  tokens << new_token
end

#curr_charObject

The character currently being processed.



144
145
146
# File 'lib/tokenizer/tokenizer.rb', line 144

def curr_char
  @input[@curr_char_index] and @input[@curr_char_index].chr 
end

#curr_tokenObject

The token currently on top of the stack.



158
159
160
# File 'lib/tokenizer/tokenizer.rb', line 158

def curr_token
  tokens.last
end

#inspectObject



153
154
155
# File 'lib/tokenizer/tokenizer.rb', line 153

def inspect
  "<Dhaka::Tokenizer grammar : #{grammar}>"
end

#runObject

:nodoc:



173
174
175
176
177
178
179
180
181
# File 'lib/tokenizer/tokenizer.rb', line 173

def run #:nodoc:
  while curr_char
    blk = @current_state.actions[curr_char]
    return TokenizerErrorResult.new(@curr_char_index) unless blk
    instance_eval(&blk)
  end
  tokens << Dhaka::Token.new(Dhaka::END_SYMBOL_NAME, nil, nil)
  TokenizerSuccessResult.new(tokens)
end

#switch_to(state_name) ⇒ Object

Change the active state of the tokenizer to the state identified by the symbol state_name.



169
170
171
# File 'lib/tokenizer/tokenizer.rb', line 169

def switch_to state_name
  @current_state = self.class.states[state_name]
end