Class: Sneaql::Core::Tokenizer

Inherits:
Object
  • Object
show all
Defined in:
lib/sneaql_lib/tokenizer.rb

Overview

used to process a command string into an array of tokens. the handling here is pretty basic and geared toward providing string literal functionality. a string literal is enclosed in single quotes, with backslash as an escape character. the only escapable characters are single quotes and backslashes. this process does not interpret whether or not a token is valid in any way, it only seeks to break it down reliably. string literal tokens will not have escape characters removed, and will be enclosed in single quotes.

Instance Method Summary collapse

Instance Method Details

#classify(input_char) ⇒ Symbol

classifies a single character during lexical parsing

Parameters:

  • input_char (String)

    single character to classify

Returns:

  • (Symbol)

    classification for character



97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# File 'lib/sneaql_lib/tokenizer.rb', line 97

def classify(input_char)
  # whitespace delimits tokens not in string lteral
  return :whitespace if input_char.match(/\s/)

  # escape character can escape itself
  return :escape if input_char.match(/\\/)

  # any word character
  # also includes - for use in negative numbers
  return :word if input_char.match(/\w|\-/)

  # colon is used to represent variables
  return :colon if input_char.match(/\:/)

  # indicates start of string literal
  return :singlequote if input_char.match(/\'/)

  # deprecated, old variable reference syntax
  return :openbrace if input_char.match(/\{/)
  return :closebrace if input_char.match(/\}/)

  # comparison operator chars
  return :operator if input_char.match(/\=|\>|\<|\=|\!/)

  # any non-word characters
  return :nonword if input_char.match(/\W/)
end

#classify_all(string) ⇒ Array<Symbol>

returns an array with a classification for each character in input string

Parameters:

  • string (String)

Returns:

  • (Array<Symbol>)

    array of classification symbols



129
130
131
132
133
134
135
# File 'lib/sneaql_lib/tokenizer.rb', line 129

def classify_all(string)
  classified = []
  string.split('').each do |x|
    classified << classify(x)
  end
  classified
end

#tokenize(string) ⇒ Array<String>

returns an array of tokens.

Parameters:

  • string (String)

    command string to tokenize

Returns:

  • (Array<String>)

    tokens in left to right order



140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
# File 'lib/sneaql_lib/tokenizer.rb', line 140

def tokenize(string)
  # perform lexical analysis
  classified = classify_all(string)

  # set initial state
  state = :outside_word

  # array to collect tokens
  tokens = []

  # will be rebuilt for each token
  current_token = ''

  # iterate through each character
  classified.each_with_index do |c, i|
    # perform the actions appropriate to character
    # classification and current state
    Sneaql::Core.tokenizer_state_map[c][state].each do |action|
      case
      when action == :no_action then
        nil
      when action == :new_token then
        # rotate the current token if it is not empty string
        tokens << current_token unless current_token == ''
        current_token = ''
      when action == :concat then
        # concatenage current character to current token
        current_token += string[i]
      when action == :error then
        raise 'tokenization error'
      when Sneaql::Core.valid_tokenizer_states.include?(action)
        # if the action is a state name, set the state
        state = action
      end
    end
  end
  # close current token if not empty
  tokens << current_token unless current_token == ''

  # return array of tokens
  tokens
end