Class: Dhaka::Tokenizer
- Inherits:
-
Object
- Object
- Dhaka::Tokenizer
- Defined in:
- lib/dhaka/tokenizer/tokenizer.rb
Overview
This abstract class contains a DSL for hand-coding tokenizers. Subclass it to implement tokenizers for specific grammars.
Tokenizers are state machines. Each state of a tokenizer is identified by a Ruby symbol. The constant Dhaka::TOKENIZER_IDLE_STATE is reserved for the idle state of the tokenizer (the one that it starts in).
The following is a tokenizer for arithmetic expressions with integer terms. The tokenizer starts in the idle state creating single-character tokens for all characters excepts digits and whitespace. It shifts to :get_integer_literal
when it encounters a digit character and creates a token on the stack on which it accumulates the value of the literal. When it again encounters a non-digit character, it shifts back to idle. Whitespace is treated as a delimiter, but not shifted as a token.
class ArithmeticPrecedenceTokenizer < Dhaka::Tokenizer
digits = ('0'..'9').to_a
parenths = ['(', ')']
operators = ['-', '+', '/', '*', '^']
functions = ['h', 'l']
arg_separator = [',']
whitespace = [' ']
all_characters = digits + parenths + operators + functions + arg_separator + whitespace
for_state Dhaka::TOKENIZER_IDLE_STATE do
for_characters(all_characters - (digits + whitespace)) do
create_token(curr_char, nil)
advance
end
for_characters digits do
create_token('n', '')
switch_to :get_integer_literal
end
for_character whitespace do
advance
end
end
for_state :get_integer_literal do
for_characters all_characters - digits do
switch_to Dhaka::TOKENIZER_IDLE_STATE
end
for_characters digits do
curr_token.value << curr_char
advance
end
end
end
For languages where the lexical structure is very complicated, it may be too tedious to implement a Tokenizer by hand. In such cases, it’s a lot easier to write a LexerSpecification using regular expressions and create a Lexer from that.
Direct Known Subclasses
Instance Attribute Summary collapse
-
#tokens ⇒ Object
readonly
The tokens shifted so far.
Class Method Summary collapse
-
.for_state(state_name, &blk) ⇒ Object
Define the action for the state named
state_name
. -
.tokenize(input) ⇒ Object
Tokenizes a string
input
and returns a TokenizerErrorResult on failure or a TokenizerSuccessResult on sucess.
Instance Method Summary collapse
-
#advance ⇒ Object
Advance to the next character.
-
#create_token(symbol_name, value) ⇒ Object
Push a new token on to the stack with symbol corresponding to
symbol_name
and a value ofvalue
. -
#curr_char ⇒ Object
The character currently being processed.
-
#curr_token ⇒ Object
The token currently on top of the stack.
-
#initialize(input) ⇒ Tokenizer
constructor
:nodoc:.
- #inspect ⇒ Object
-
#run ⇒ Object
:nodoc:.
-
#switch_to(state_name) ⇒ Object
Change the active state of the tokenizer to the state identified by the symbol
state_name
.
Constructor Details
#initialize(input) ⇒ Tokenizer
:nodoc:
143 144 145 146 147 148 |
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 143 def initialize(input) #:nodoc: @input = input @current_state = self.class.states[TOKENIZER_IDLE_STATE] @curr_char_index = 0 @tokens = [] end |
Instance Attribute Details
#tokens ⇒ Object (readonly)
The tokens shifted so far.
141 142 143 |
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 141 def tokens @tokens end |
Class Method Details
.for_state(state_name, &blk) ⇒ Object
Define the action for the state named state_name
.
122 123 124 |
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 122 def for_state(state_name, &blk) states[state_name].instance_eval(&blk) end |
.tokenize(input) ⇒ Object
Tokenizes a string input
and returns a TokenizerErrorResult on failure or a TokenizerSuccessResult on sucess.
127 128 129 |
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 127 def tokenize(input) new(input).run end |
Instance Method Details
#advance ⇒ Object
Advance to the next character.
156 157 158 |
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 156 def advance @curr_char_index += 1 end |
#create_token(symbol_name, value) ⇒ Object
Push a new token on to the stack with symbol corresponding to symbol_name
and a value of value
.
170 171 172 173 |
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 170 def create_token(symbol_name, value) new_token = Dhaka::Token.new(symbol_name, value, @curr_char_index) tokens << new_token end |
#curr_char ⇒ Object
The character currently being processed.
151 152 153 |
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 151 def curr_char @input[@curr_char_index] and @input[@curr_char_index].chr end |
#curr_token ⇒ Object
The token currently on top of the stack.
165 166 167 |
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 165 def curr_token tokens.last end |
#inspect ⇒ Object
160 161 162 |
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 160 def inspect "<Dhaka::Tokenizer grammar : #{grammar}>" end |
#run ⇒ Object
:nodoc:
180 181 182 183 184 185 186 187 188 |
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 180 def run #:nodoc: while curr_char blk = @current_state.actions[curr_char] || @current_state.default_action return TokenizerErrorResult.new(@curr_char_index) unless blk instance_eval(&blk) end tokens << Dhaka::Token.new(Dhaka::END_SYMBOL_NAME, nil, nil) TokenizerSuccessResult.new(tokens) end |
#switch_to(state_name) ⇒ Object
Change the active state of the tokenizer to the state identified by the symbol state_name
.
176 177 178 |
# File 'lib/dhaka/tokenizer/tokenizer.rb', line 176 def switch_to state_name @current_state = self.class.states[state_name] end |