Class: Dhaka::Tokenizer

Inherits:

Object

Object
Dhaka::Tokenizer

Defined in:: lib/tokenizer/tokenizer.rb

Overview

This class contains a DSL for specifying tokenizers. Subclass it to implement tokenizers for specific grammars. Subclasses of this class may not be further subclassed.

Tokenizers are state machines that are specified pretty much by hand. Each state of a tokenizer is identified by a Ruby symbol. The constant Dhaka::TOKENIZER_IDLE_STATE is reserved for the idle state of the tokenizer (the one that it starts in).

The following is a tokenizer for arithmetic expressions with integer terms. The tokenizer starts in the idle state creating single-character tokens for all characters excepts digits and whitespace. It shifts to :get_integer_literal when it encounters a digit character and creates a token on the stack on which it accumulates the value of the literal. When it again encounters a non-digit character, it shifts back to idle. Whitespace is treated as a delimiter, but not shifted as a token.

class ArithmeticPrecedenceTokenizer < Dhaka::Tokenizer

  digits = ('0'..'9').to_a
  parenths = ['(', ')']
  operators = ['-', '+', '/', '*', '^']
  functions = ['h', 'l']
  arg_separator = [',']
  whitespace = [' ']

  all_characters = digits + parenths + operators + functions + arg_separator + whitespace

  for_state Dhaka::TOKENIZER_IDLE_STATE do
    for_characters(all_characters - (digits + whitespace)) do
      create_token(curr_char, nil)
      advance
    end
    for_characters digits do
      create_token('n', '')
      switch_to :get_integer_literal
    end
    for_character whitespace do
      advance
    end
  end

  for_state :get_integer_literal do
    for_characters all_characters - digits do
      switch_to Dhaka::TOKENIZER_IDLE_STATE
    end
    for_characters digits do
      curr_token.value << curr_char
      advance
    end
  end

end

Instance Attribute Summary collapse

#tokens ⇒ Object readonly

The tokens shifted so far.

Class Method Summary collapse

.for_state(state_name, &blk) ⇒ Object

Define the action for the state named state_name.
.tokenize(input) ⇒ Object

Tokenizes a string input and returns a TokenizerErrorResult on failure or a TokenizerSuccessResult on sucess.

Instance Method Summary collapse

#advance ⇒ Object

Advance to the next character.
#create_token(symbol_name, value) ⇒ Object

Push a new token on to the stack with symbol corresponding to symbol_name and a value of value.
#curr_char ⇒ Object

The character currently being processed.
#curr_token ⇒ Object

The token currently on top of the stack.
#initialize(input) ⇒ Tokenizer constructor

:nodoc:.
#inspect ⇒ Object
#run ⇒ Object

:nodoc:.
#switch_to(state_name) ⇒ Object

Change the active state of the tokenizer to the state identified by the symbol state_name.

Constructor Details

#initialize(input) ⇒ `Tokenizer`

:nodoc:

# File 'lib/tokenizer/tokenizer.rb', line 136

def initialize(input) #:nodoc:
  @input           = input
  @current_state   = self.class.states[TOKENIZER_IDLE_STATE]
  @curr_char_index = 0
  @tokens          = []
end

Instance Attribute Details

#tokens ⇒ `Object` (readonly)

The tokens shifted so far.



134
135
136

# File 'lib/tokenizer/tokenizer.rb', line 134

def tokens
  @tokens
end

Class Method Details

.for_state(state_name, &blk) ⇒ `Object`

Define the action for the state named state_name.



115
116
117

# File 'lib/tokenizer/tokenizer.rb', line 115

def for_state(state_name, &blk)
  states[state_name].instance_eval(&blk)
end

.tokenize(input) ⇒ `Object`

Tokenizes a string input and returns a TokenizerErrorResult on failure or a TokenizerSuccessResult on sucess.



120
121
122

# File 'lib/tokenizer/tokenizer.rb', line 120

def tokenize(input)
  new(input).run
end

Instance Method Details

#advance ⇒ `Object`

Advance to the next character.



149
150
151

# File 'lib/tokenizer/tokenizer.rb', line 149

def advance
  @curr_char_index += 1
end

#create_token(symbol_name, value) ⇒ `Object`

Push a new token on to the stack with symbol corresponding to symbol_name and a value of value.

# File 'lib/tokenizer/tokenizer.rb', line 163

def create_token(symbol_name, value)
  new_token = Dhaka::Token.new(symbol_name, value, @curr_char_index)
  tokens << new_token
end

#curr_char ⇒ `Object`

The character currently being processed.



144
145
146

# File 'lib/tokenizer/tokenizer.rb', line 144

def curr_char
  @input[@curr_char_index] and @input[@curr_char_index].chr 
end

#curr_token ⇒ `Object`

The token currently on top of the stack.



158
159
160

# File 'lib/tokenizer/tokenizer.rb', line 158

def curr_token
  tokens.last
end

#inspect ⇒ `Object`



153
154
155

# File 'lib/tokenizer/tokenizer.rb', line 153

def inspect
  "<Dhaka::Tokenizer grammar : #{grammar}>"
end

#run ⇒ `Object`

:nodoc:

# File 'lib/tokenizer/tokenizer.rb', line 173

def run #:nodoc:
  while curr_char
    blk = @current_state.actions[curr_char]
    return TokenizerErrorResult.new(@curr_char_index) unless blk
    instance_eval(&blk)
  end
  tokens << Dhaka::Token.new(Dhaka::END_SYMBOL_NAME, nil, nil)
  TokenizerSuccessResult.new(tokens)
end

#switch_to(state_name) ⇒ `Object`

Change the active state of the tokenizer to the state identified by the symbol state_name.



169
170
171

# File 'lib/tokenizer/tokenizer.rb', line 169

def switch_to state_name
  @current_state = self.class.states[state_name]
end

Class: Dhaka::Tokenizer

Overview

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input) ⇒ Tokenizer

Instance Attribute Details

#tokens ⇒ Object (readonly)

Class Method Details

.for_state(state_name, &blk) ⇒ Object

.tokenize(input) ⇒ Object

Instance Method Details

#advance ⇒ Object

#create_token(symbol_name, value) ⇒ Object

#curr_char ⇒ Object

#curr_token ⇒ Object

#inspect ⇒ Object

#run ⇒ Object

#switch_to(state_name) ⇒ Object