Class: Rly::Lex

Inherits:

Object

Object
Rly::Lex

show all

Defined in:: lib/rly/lex.rb,
lib/rly/helpers.rb

Overview

Base class for your lexer.

Generally, you define a new lexer by subclassing Rly::Lex. Your code should use methods Lex.token, Lex.ignore, Lex.literals, Lex.on_error to make the lexer configuration (check the methods documentation for details).

Once you got your lexer configured, you can create its instances passing a String to be tokenized. You can then use #next method to get tokens. If you have more string to tokenize, you can append it to input buffer at any time with #input.

Direct Known Subclasses

FileLex

Instance Attribute Summary collapse

#lineno ⇒ Fixnum

Tracks the current line number for generated tokens.
#pos ⇒ Fixnum

Tracks the current position in the input string.

DSL Class Methods collapse

.ignore(ign) ⇒ Object

Specifies a list of one-char symbols to be ignored in input.
.literals(lit) ⇒ Object

Specifies a list of one-char literals.
.on_error(&block) ⇒ Object

Specifies a block that should be called on error.
.token(*args) {|tok| ... } ⇒ Object

Adds a token definition to a class.

Class Method Summary collapse

Instance Method Summary collapse

#build_token(type, value) ⇒ Object
#ignore_symbol ⇒ Object
#initialize(input = "") ⇒ Lex constructor

Creates a new lexer instance for given input.
#input(input) ⇒ Object

Appends string to input buffer.
#inspect ⇒ Object
#next ⇒ LexToken^?

Processes the next token in input.

Constructor Details

#initialize(input = "") ⇒ `Lex`

Creates a new lexer instance for given input

Examples:

class MyLexer < Rly::Lex
  ignore " "
  token :LOWERS, /[a-z]+/
  token :UPPERS, /[A-Z]+/
end

lex = MyLexer.new("hello WORLD")
t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "LOWERS -> hello"
t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "UPPERS -> WORLD"
t = lex.next # => nil

Parameters:

input (String) (defaults to: "") —

a string to be tokenized

# File 'lib/rly/lex.rb', line 63

def initialize(input="")
  @input = input
  @pos = 0
  @lineno = 0
end

Instance Attribute Details

#lineno ⇒ `Fixnum`

Tracks the current line number for generated tokens

lineno’s value should be increased manually. Check the example for a demo rule.

Examples:

token /\n+/ do |t| t.lexer.lineno = t.value.count("\n"); t end

Returns:

(Fixnum) —

current line number



30
31
32

# File 'lib/rly/lex.rb', line 30

def lineno
  @lineno
end

#pos ⇒ `Fixnum`

Tracks the current position in the input string

Genreally, it should only be used to skip a few characters in the error hander.

Examples:

on_error do |t|
  t.lexer.pos += 1
  nil # skip the bad character
end

Returns:

(Fixnum) —

index of a starting character for current token



44
45
46

# File 'lib/rly/lex.rb', line 44

def pos
  @pos
end

Class Method Details

.callables ⇒ `Object`



191
192
193

# File 'lib/rly/lex.rb', line 191

def callables
  @callables ||= {}
end

.ignore(ign) ⇒ `Object`

Specifies a list of one-char symbols to be ignored in input

This method allows to skip over formatting symbols (like tabs and spaces) quickly.

Examples:

class MyLexer < Rly::Lex
  literals "+-"
  token :INT, /\d+/
  ignore " \t"
end

lex = MyLexer.new("2 + 2")
lex.each do |tok|
  puts "#{tok.type} -> #{tok.value}" #=> "INT -> 2"
                                     #=> "+ -> +"
                                     #=> "INT -> 2"
end

Parameters:

ign (String) —

the list of ignored symbols

See Also:

token

# File 'lib/rly/lex.rb', line 346

def ignore(ign)
  @ignores = ign
  nil
end

.ignore_spaces_and_tabs ⇒ `Object`



4
5
6

# File 'lib/rly/helpers.rb', line 4

def self.ignore_spaces_and_tabs
	ignore " \t"
end

.lex_double_quoted_string_tokens ⇒ `Object`

# File 'lib/rly/helpers.rb', line 15

def self.lex_double_quoted_string_tokens
	token :STRING, /"[^"]*"/ do |t|
		t.value = t.value[1...-1]
		t
	end
end

.lex_number_tokens ⇒ `Object`

# File 'lib/rly/helpers.rb', line 8

def self.lex_number_tokens
	token :NUMBER, /\d+/ do |t|
		t.value = t.value.to_i
		t
	end
end

.literals(lit) ⇒ `Object`

Specifies a list of one-char literals

Literals may be used in the case when you have several one-character tokens and you don’t want to define them one by one using token method.

Examples:

class MyLexer < Rly::Lex
  literals "+-/*"
end

lex = MyLexer.new("+-")
lex.each do |tok|
  puts "#{tok.type} -> #{tok.value}" #=> "+ -> +"
                                     #=> "- -> -"
end

Parameters:

lit (String) —

the list of literals

See Also:

token

# File 'lib/rly/lex.rb', line 320

def literals(lit)
  @literals = lit
  nil
end

.metatokens(*args) ⇒ `Object`



218
219
220

# File 'lib/rly/lex.rb', line 218

def metatokens(*args)
  @metatokens_list = args
end

.metatokens_list ⇒ `Object`



214
215
216

# File 'lib/rly/lex.rb', line 214

def metatokens_list
  @metatokens_list ||= []
end

.on_error(&block) ⇒ `Object`

Specifies a block that should be called on error

In case of lexing error the lexer first tries to fix it by providing a chance for developer to look on the failing character. If this block is not provided, the lexing error always results in Rly::LexError.

You must increment the lexer’s #pos as part of the action. You may also return a new Rly::LexToken or nil to skip the input

Examples:

class MyLexer < Rly::Lex
  token :INT, /\d+/
  on_error do |tok|
    tok.lexer.pos += 1 # just skip the offending character
  end
end

lex = MyLexer.new("123qwe")
lex.each do |tok|
  puts "#{tok.type} -> #{tok.value}" #=> "INT -> 123"
end

See Also:

token

# File 'lib/rly/lex.rb', line 375

def on_error(&block)
  @error_block = block
  nil
end

.terminals ⇒ `Object`



187
188
189

# File 'lib/rly/lex.rb', line 187

def terminals
  self.tokens.map { |t,r,b| t }.compact + self.literals_list.chars.to_a + self.metatokens_list
end

.token(*args) {|tok| ... } ⇒ `Object`

Adds a token definition to a class

This method adds a token definition to be lated used to tokenize input. It can be used to register normal tokens, and also functional tokens (the latter ones are processed as usual but are not being returned).

Examples:

class MyLexer < Rly::Lex
  token :LOWERS, /[a-z]+/   # this would match LOWERS on 1+ lowercase letters

  token :INT, /\d+/ do |t|  # this would match on integers
    t.value = t.value.to_i  # additionally the value is converted to Fixnum
    t                       # the updated token is returned
  end

  token /\n/ do |t|        # this would match on newlines
    t.lexer.lineno += 1    # the block will be executed on match, but
  end                      # no token will be returned (as name is not specified)

end

Parameters:

type (Symbol) —

token type. It should be an all-caps symbol by convention
regex (Regexp) —

a regular expression to match the token

Yield Parameters:

tok (LexToken) —

a new token instance for processed input

Yield Returns:

(LexToken) —

the same or modified token instance. Return nil to ignore the input

See Also:

literals
ignores

# File 'lib/rly/lex.rb', line 290

def token(*args, &block)
  if args.length == 2
    self.tokens << [args[0], args[1], block]
  elsif args.length == 1
    self.tokens << [nil, args[0], block]
  else
    raise ArgumentError
  end
  nil
end

.token_regexps ⇒ `Object`

# File 'lib/rly/lex.rb', line 195

def token_regexps
  return @token_regexps if @token_regexps

  collector = []
  self.tokens.each do |name, rx, block|
    name = "__anonymous_#{block.hash}".to_sym unless name

    self.callables[name] = block
    
    rxs = rx.to_s
    named_rxs = "\\A(?<#{name}>#{rxs})"

    collector << named_rxs
  end

  rxss = collector.join('|')
  @token_regexps = Regexp.new(rxss)
end

Instance Method Details

#build_token(type, value) ⇒ `Object`



178
179
180

# File 'lib/rly/lex.rb', line 178

def build_token(type, value)
  LexToken.new(type, value, self, @pos, @lineno)
end

#ignore_symbol ⇒ `Object`



182
183
184

# File 'lib/rly/lex.rb', line 182

def ignore_symbol
  @pos += 1
end

#input(input) ⇒ `Object`

Appends string to input buffer

The given string is appended to input buffer, further #next calls will tokenize it as usual.

Examples:

lex = MyLexer.new("hello")

t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "LOWERS -> hello"
t = lex.next # => nil
lex.input("WORLD")
t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "UPPERS -> WORLD"
t = lex.next # => nil

# File 'lib/rly/lex.rb', line 90

def input(input)
  @input << input
  nil
end

#inspect ⇒ `Object`



69
70
71

# File 'lib/rly/lex.rb', line 69

def inspect
  "#<#{self.class} pos=#{@pos} len=#{@input.length} lineno=#{@lineno}>"
end

#next ⇒ `LexToken`^?

Processes the next token in input

This is the main interface to lexer. It returns next available token or nil if there are no more tokens available in the input string.

#each Raises Rly::LexError if the input cannot be processed. This happens if there were no matches by ‘token’ rules and no matches by ‘literals’ rule. If the on_error handler is not set, the exception will be raised immediately, however, if the handler is set, the eception will be raised only if the #pos after returning from error handler is still unchanged.

Examples:

lex = MyLexer.new("hello WORLD")

t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "LOWERS -> hello"
t = lex.next
puts "#{tok.type} -> #{tok.value}" #=> "UPPERS -> WORLD"
t = lex.next # => nil

Returns:

(LexToken) —

if the next chunk of input was processed successfully
(nil) —

if there are no more tokens available in input

Raises:

(LexError) —

if the input cannot be processed

# File 'lib/rly/lex.rb', line 119

def next
  while @pos < @input.length
    if self.class.ignores_list[@input[@pos]]
      ignore_symbol
      next
    end

    m = self.class.token_regexps.match(@input[@pos..-1])

    if m && ! m[0].empty?
      val = nil
      type = nil
      resolved_type = nil
      m.names.each do |n|
        if m[n]
          type = n.to_sym
          resolved_type = (n.start_with?('__anonymous_') ? nil : type)
          val = m[n]
          break
        end
      end

      if type
        tok = build_token(resolved_type, val)
        @pos += m.end(0)
        tok = self.class.callables[type].call(tok) if self.class.callables[type]

        if tok && tok.type
          return tok
        else
          next
        end
      end
    end
    
    if self.class.literals_list[@input[@pos]]
      tok = build_token(@input[@pos], @input[@pos])
      matched = true
      @pos += 1
      return tok
    end

    if self.class.error_hander
      pos = @pos
      tok = build_token(:error, @input[@pos])
      tok = self.class.error_hander.call(tok)
      if pos == @pos
        raise LexError.new("Illegal character '#{@input[@pos]}' at index #{@pos}")
      else
        return tok if tok && tok.type
      end
    else
      raise LexError.new("Illegal character '#{@input[@pos]}' at index #{@pos}")
    end

  end
  return nil
end

Class: Rly::Lex

Overview

Direct Known Subclasses

Instance Attribute Summary collapse

DSL Class Methods collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(input = "") ⇒ Lex

Examples:

Instance Attribute Details

#lineno ⇒ Fixnum

Examples:

#pos ⇒ Fixnum

Examples:

Class Method Details

.callables ⇒ Object

.ignore(ign) ⇒ Object

Examples:

.ignore_spaces_and_tabs ⇒ Object

.lex_double_quoted_string_tokens ⇒ Object

.lex_number_tokens ⇒ Object

.literals(lit) ⇒ Object

Examples:

.metatokens(*args) ⇒ Object

.metatokens_list ⇒ Object

.on_error(&block) ⇒ Object

Examples:

.terminals ⇒ Object

.token(*args) {|tok| ... } ⇒ Object

Examples:

.token_regexps ⇒ Object

Instance Method Details

#build_token(type, value) ⇒ Object

#ignore_symbol ⇒ Object

#input(input) ⇒ Object

Examples:

#inspect ⇒ Object

#next ⇒ LexToken?

Examples:

#initialize(input = "") ⇒ `Lex`

#lineno ⇒ `Fixnum`

#pos ⇒ `Fixnum`

.callables ⇒ `Object`

.ignore(ign) ⇒ `Object`

.ignore_spaces_and_tabs ⇒ `Object`

.lex_double_quoted_string_tokens ⇒ `Object`

.lex_number_tokens ⇒ `Object`

.literals(lit) ⇒ `Object`

.metatokens(*args) ⇒ `Object`

.metatokens_list ⇒ `Object`

.on_error(&block) ⇒ `Object`

.terminals ⇒ `Object`

.token(*args) {|tok| ... } ⇒ `Object`

.token_regexps ⇒ `Object`

#build_token(type, value) ⇒ `Object`

#ignore_symbol ⇒ `Object`

#input(input) ⇒ `Object`

#inspect ⇒ `Object`

#next ⇒ `LexToken`^?