Class: Dhaka::Tokenizer
- Inherits:
-
Object
- Object
- Dhaka::Tokenizer
- Defined in:
- lib/tokenizer/tokenizer.rb
Overview
This class contains a DSL for specifying tokenizers. Subclass it to implement tokenizers for specific grammars. Subclasses of this class may not be further subclassed.
Tokenizers are state machines that are specified pretty much by hand. Each state of a tokenizer is identified by a Ruby symbol. The constant Dhaka::TOKENIZER_IDLE_STATE is reserved for the idle state of the tokenizer (the one that it starts in).
The following is a tokenizer for arithmetic expressions with integer terms. The tokenizer starts in the idle state creating single-character tokens for all characters excepts digits and whitespace. It shifts to :get_integer_literal
when it encounters a digit character and creates a token on the stack on which it accumulates the value of the literal. When it again encounters a non-digit character, it shifts back to idle. Whitespace is treated as a delimiter, but not shifted as a token.
class ArithmeticPrecedenceTokenizer < Dhaka::Tokenizer
digits = ('0'..'9').to_a
parenths = ['(', ')']
operators = ['-', '+', '/', '*', '^']
functions = ['h', 'l']
arg_separator = [',']
whitespace = [' ']
all_characters = digits + parenths + operators + functions + arg_separator + whitespace
for_state Dhaka::TOKENIZER_IDLE_STATE do
for_characters(all_characters - (digits + whitespace)) do
create_token(curr_char, nil)
advance
end
for_characters digits do
create_token('n', '')
switch_to :get_integer_literal
end
for_character whitespace do
advance
end
end
for_state :get_integer_literal do
for_characters all_characters - digits do
switch_to Dhaka::TOKENIZER_IDLE_STATE
end
for_characters digits do
curr_token.value << curr_char
advance
end
end
end
Instance Attribute Summary collapse
-
#tokens ⇒ Object
readonly
The tokens shifted so far.
Class Method Summary collapse
-
.for_state(state_name, &blk) ⇒ Object
Define the action for the state named
state_name
. -
.tokenize(input) ⇒ Object
Tokenizes a string
input
and returns a TokenizerErrorResult on failure or a TokenizerSuccessResult on sucess.
Instance Method Summary collapse
-
#advance ⇒ Object
Advance to the next character.
-
#create_token(symbol_name, value) ⇒ Object
Push a new token on to the stack with symbol corresponding to
symbol_name
and a value ofvalue
. -
#curr_char ⇒ Object
The character currently being processed.
-
#curr_token ⇒ Object
The token currently on top of the stack.
-
#initialize(input) ⇒ Tokenizer
constructor
:nodoc:.
- #inspect ⇒ Object
-
#run ⇒ Object
:nodoc:.
-
#switch_to(state_name) ⇒ Object
Change the active state of the tokenizer to the state identified by the symbol
state_name
.
Constructor Details
#initialize(input) ⇒ Tokenizer
:nodoc:
136 137 138 139 140 141 |
# File 'lib/tokenizer/tokenizer.rb', line 136 def initialize(input) #:nodoc: @input = input @current_state = self.class.states[TOKENIZER_IDLE_STATE] @curr_char_index = 0 @tokens = [] end |
Instance Attribute Details
#tokens ⇒ Object (readonly)
The tokens shifted so far.
134 135 136 |
# File 'lib/tokenizer/tokenizer.rb', line 134 def tokens @tokens end |
Class Method Details
.for_state(state_name, &blk) ⇒ Object
Define the action for the state named state_name
.
115 116 117 |
# File 'lib/tokenizer/tokenizer.rb', line 115 def for_state(state_name, &blk) states[state_name].instance_eval(&blk) end |
.tokenize(input) ⇒ Object
Tokenizes a string input
and returns a TokenizerErrorResult on failure or a TokenizerSuccessResult on sucess.
120 121 122 |
# File 'lib/tokenizer/tokenizer.rb', line 120 def tokenize(input) new(input).run end |
Instance Method Details
#advance ⇒ Object
Advance to the next character.
149 150 151 |
# File 'lib/tokenizer/tokenizer.rb', line 149 def advance @curr_char_index += 1 end |
#create_token(symbol_name, value) ⇒ Object
Push a new token on to the stack with symbol corresponding to symbol_name
and a value of value
.
163 164 165 166 |
# File 'lib/tokenizer/tokenizer.rb', line 163 def create_token(symbol_name, value) new_token = Dhaka::Token.new(symbol_name, value, @curr_char_index) tokens << new_token end |
#curr_char ⇒ Object
The character currently being processed.
144 145 146 |
# File 'lib/tokenizer/tokenizer.rb', line 144 def curr_char @input[@curr_char_index] and @input[@curr_char_index].chr end |
#curr_token ⇒ Object
The token currently on top of the stack.
158 159 160 |
# File 'lib/tokenizer/tokenizer.rb', line 158 def curr_token tokens.last end |
#inspect ⇒ Object
153 154 155 |
# File 'lib/tokenizer/tokenizer.rb', line 153 def inspect "<Dhaka::Tokenizer grammar : #{grammar}>" end |
#run ⇒ Object
:nodoc:
173 174 175 176 177 178 179 180 181 |
# File 'lib/tokenizer/tokenizer.rb', line 173 def run #:nodoc: while curr_char blk = @current_state.actions[curr_char] return TokenizerErrorResult.new(@curr_char_index) unless blk instance_eval(&blk) end tokens << Dhaka::Token.new(Dhaka::END_SYMBOL_NAME, nil, nil) TokenizerSuccessResult.new(tokens) end |
#switch_to(state_name) ⇒ Object
Change the active state of the tokenizer to the state identified by the symbol state_name
.
169 170 171 |
# File 'lib/tokenizer/tokenizer.rb', line 169 def switch_to state_name @current_state = self.class.states[state_name] end |