Class: StyleScript::Lexer

Inherits:

Object

Object
StyleScript::Lexer

Defined in:: lib/style_script/lexer.rb

Overview

The lexer reads a stream of StyleScript and divvys it up into tagged tokens. A minor bit of the ambiguity in the grammar has been avoided by pushing some extra smarts into the Lexer.

Constant Summary collapse

KEYWORDS = The list of keywords passed verbatim to the parser.

["if", "else", "then", "unless", "until",
"true", "false", "yes", "no", "on", "off",
"and", "or", "is", "isnt", "not",
"new", "return",
"try", "catch", "finally", "throw",
"break", "continue",
"for", "in", "of", "by", "where", "while",
"delete", "instanceof", "typeof",
"switch", "when",
"super", "extends"]

IDENTIFIER = Token matching regexes.

/\A([a-zA-Z$_](\w|\$)*)/

NUMBER =

/\A(\b((0(x|X)[0-9a-fA-F]+)|([0-9]+(\.[0-9]+)?(e[+\-]?[0-9]+)?)))\b/i

STRING =

/\A(""|''|"(.*?)([^\\]|\\\\)"|'(.*?)([^\\]|\\\\)')/m

HEREDOC =

/\A("{6}|'{6}|"{3}\n?(.*?)\n?([ \t]*)"{3}|'{3}\n?(.*?)\n?([ \t]*)'{3})/m

JS =

/\A(``|`(.*?)([^\\]|\\\\)`)/m

OPERATOR =

/\A([+\*&|\/\-%=<>:!?]+)/

WHITESPACE =

/\A([ \t]+)/

COMMENT =

/\A(((\n?[ \t]*)?#.*$)+)/

CODE =

/\A((-|=)>)/

REGEX =

/\A(\/(.*?)([^\\]|\\\\)\/[imgy]{0,4})/

MULTI_DENT =

/\A((\n([ \t]*))+)(\.)?/

LAST_DENT =

/\n([ \t]*)/

ASSIGNMENT =

/\A(:|=)\Z/

JS_CLEANER = Token cleaning regexes.

/(\A`|`\Z)/

MULTILINER =

/\n/

STRING_NEWLINES =

/\n[ \t]*/

COMMENT_CLEANER =

/(^[ \t]*#|\n[ \t]*$)/

NO_NEWLINE =

/\A([+\*&|\/\-%=<>:!.\\][<>=&|]*|and|or|is|isnt|not|delete|typeof|instanceof)\Z/

HEREDOC_INDENT =

/^[ \t]+/

NOT_REGEX = Tokens which a regular expression will never immediately follow, but which a division operator might. See: www.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions

[
  :IDENTIFIER, :NUMBER, :REGEX, :STRING,
  ')', '++', '--', ']', '}',
  :FALSE, :NULL, :TRUE
]

CALLABLE = Tokens which could legitimately be invoked or indexed.

[:IDENTIFIER, :SUPER, ')', ']', '}', :STRING]

Instance Method Summary collapse

#close_indentation ⇒ Object

Close up all remaining open blocks.
#comment_token ⇒ Object

Matches and consumes comments.
#extract_next_token ⇒ Object

At every position, run through this list of attempted matches, short-circuiting if any of them succeed.
#heredoc_token ⇒ Object

Matches heredocs, adjusting indentation to the correct level.
#identifier_token ⇒ Object

Matches identifying literals: variables, keywords, method names, etc.
#indent_token ⇒ Object

Record tokens for indentation differing from the previous line.
#js_token ⇒ Object

Matches interpolated JavaScript.
#last_tag ⇒ Object

Peek at the previous token’s tag.
#last_value ⇒ Object

Peek at the previous token’s value.
#literal_token ⇒ Object

We treat all other single characters as a token.
#newline_token(newlines) ⇒ Object

Multiple newlines get merged together.
#number_token ⇒ Object

Matches numbers, including decimals, hex, and exponential notation.
#outdent_token(move_out) ⇒ Object

Record an oudent token or tokens, if we’re moving back inwards past multiple recorded indents.
#regex_token ⇒ Object

Matches regular expression literals.
#string_token ⇒ Object

Matches strings, including multi-line strings.
#suppress_newlines(newlines) ⇒ Object

Tokens to explicitly escape newlines are removed once their job is done.
#tag_parameters ⇒ Object

A source of ambiguity in our grammar was parameter lists in function definitions (as opposed to argument lists in function calls).
#token(tag, value) ⇒ Object

Add a token to the results, taking note of the line number.
#tokenize(code) ⇒ Object

Scan by attempting to match tokens one character at a time.
#whitespace_token ⇒ Object

Matches and consumes non-meaningful whitespace.

Instance Method Details

#close_indentation ⇒ `Object`

Close up all remaining open blocks. IF the first token is an indent, axe it.



267
268
269

# File 'lib/style_script/lexer.rb', line 267

def close_indentation
  outdent_token(@indent)
end

#comment_token ⇒ `Object`

Matches and consumes comments.

# File 'lib/style_script/lexer.rb', line 155

def comment_token
  return false unless comment = @chunk[COMMENT, 1]
  @line += comment.scan(MULTILINER).length
  token(:COMMENT, comment.gsub(COMMENT_CLEANER, '').split(MULTILINER))
  token("\n", "\n")
  @i += comment.length
end

#extract_next_token ⇒ `Object`

At every position, run through this list of attempted matches, short-circuiting if any of them succeed.

# File 'lib/style_script/lexer.rb', line 75

def extract_next_token
  return if identifier_token
  return if number_token
  return if heredoc_token
  return if string_token
  return if js_token
  return if regex_token
  return if indent_token
  return if comment_token
  return if whitespace_token
  return    literal_token
end

#heredoc_token ⇒ `Object`

Matches heredocs, adjusting indentation to the correct level.

# File 'lib/style_script/lexer.rb', line 127

def heredoc_token
  return false unless match = @chunk.match(HEREDOC)
  doc = match[2] || match[4]
  indent = doc.scan(HEREDOC_INDENT).min
  doc.gsub!(/^#{indent}/, "")
  doc.gsub!("\n", "\\n")
  doc.gsub!('"', '\\"')
  token(:STRING, "\"#{doc}\"")
  @line += match[1].count("\n")
  @i += match[1].length
end

#identifier_token ⇒ `Object`

Matches identifying literals: variables, keywords, method names, etc.

# File 'lib/style_script/lexer.rb', line 91

def identifier_token
  return false unless identifier = @chunk[IDENTIFIER, 1]
  # Keywords are special identifiers tagged with their own name,
  # 'if' will result in an [:IF, "if"] token.
  tag = KEYWORDS.include?(identifier) ? identifier.upcase.to_sym : :IDENTIFIER
  tag = :LEADING_WHEN if tag == :WHEN && [:OUTDENT, :INDENT, "\n"].include?(last_tag)
  @tokens[-1][0] = :PROTOTYPE_ACCESS if tag == :IDENTIFIER && last_value == '::'
  if tag == :IDENTIFIER && last_value == '.' && !(@tokens[-2] && @tokens[-2][1] == '.')
    if @tokens[-2][0] == "?"
      @tokens[-1][0] = :SOAK_ACCESS
      @tokens.delete_at(-2)
    else
      @tokens[-1][0] = :PROPERTY_ACCESS
    end
  end
  token(tag, identifier)
  @i += identifier.length
end

#indent_token ⇒ `Object`

Record tokens for indentation differing from the previous line.

# File 'lib/style_script/lexer.rb', line 164

def indent_token
  return false unless indent = @chunk[MULTI_DENT, 1]
  @line += indent.scan(MULTILINER).size
  @i += indent.size
  next_character = @chunk[MULTI_DENT, 4]
  no_newlines = next_character == '.' || (last_value.to_s.match(NO_NEWLINE) && @tokens[-2][0] != '.'  && !last_value.match(CODE))
  return suppress_newlines(indent) if no_newlines
  size = indent.scan(LAST_DENT).last.last.length
  return newline_token(indent) if size == @indent
  if size > @indent
    token(:INDENT, size - @indent)
    @indents << (size - @indent)
  else
    outdent_token(@indent - size)
  end
  @indent = size
end

#js_token ⇒ `Object`

Matches interpolated JavaScript.

# File 'lib/style_script/lexer.rb', line 140

def js_token
  return false unless script = @chunk[JS, 1]
  token(:JS, script.gsub(JS_CLEANER, ''))
  @i += script.length
end

#last_tag ⇒ `Object`

Peek at the previous token’s tag.



242
243
244

# File 'lib/style_script/lexer.rb', line 242

def last_tag
  @tokens.last && @tokens.last[0]
end

#last_value ⇒ `Object`

Peek at the previous token’s value.



237
238
239

# File 'lib/style_script/lexer.rb', line 237

def last_value
  @tokens.last && @tokens.last[1]
end

#literal_token ⇒ `Object`

We treat all other single characters as a token. Eg.: ( ) , . ! Multi-character operators are also literal tokens, so that Racc can assign the proper order of operations.

# File 'lib/style_script/lexer.rb', line 216

def literal_token
  value = @chunk[OPERATOR, 1]
  tag_parameters if value && value.match(CODE)
  value ||= @chunk[0,1]
  tag = value.match(ASSIGNMENT) ? :ASSIGN : value
  if !@spaced.equal?(last_value) && CALLABLE.include?(last_tag)
    tag = :CALL_START  if value == '('
    tag = :INDEX_START if value == '['
  end
  token(tag, value)
  @i += value.length
end

#newline_token(newlines) ⇒ `Object`

Multiple newlines get merged together. Use a trailing \ to escape newlines.

# File 'lib/style_script/lexer.rb', line 202

def newline_token(newlines)
  token("\n", "\n") unless last_value == "\n"
  true
end

#number_token ⇒ `Object`

Matches numbers, including decimals, hex, and exponential notation.

# File 'lib/style_script/lexer.rb', line 111

def number_token
  return false unless number = @chunk[NUMBER, 1]
  token(:NUMBER, number)
  @i += number.length
end

#outdent_token(move_out) ⇒ `Object`

Record an oudent token or tokens, if we’re moving back inwards past multiple recorded indents.

# File 'lib/style_script/lexer.rb', line 184

def outdent_token(move_out)
  while move_out > 0 && !@indents.empty?
    last_indent = @indents.pop
    token(:OUTDENT, last_indent)
    move_out -= last_indent
  end
  token("\n", "\n")
end

#regex_token ⇒ `Object`

Matches regular expression literals.

# File 'lib/style_script/lexer.rb', line 147

def regex_token
  return false unless regex = @chunk[REGEX, 1]
  return false if NOT_REGEX.include?(last_tag)
  token(:REGEX, regex)
  @i += regex.length
end

#string_token ⇒ `Object`

Matches strings, including multi-line strings.

# File 'lib/style_script/lexer.rb', line 118

def string_token
  return false unless string = @chunk[STRING, 1]
  escaped = string.gsub(STRING_NEWLINES, " \\\n")
  token(:STRING, escaped)
  @line += string.count("\n")
  @i += string.length
end

#suppress_newlines(newlines) ⇒ `Object`

Tokens to explicitly escape newlines are removed once their job is done.

# File 'lib/style_script/lexer.rb', line 208

def suppress_newlines(newlines)
  @tokens.pop if last_value == "\\"
  true
end

#tag_parameters ⇒ `Object`

A source of ambiguity in our grammar was parameter lists in function definitions (as opposed to argument lists in function calls). Tag parameter identifiers in order to avoid this. Also, parameter lists can make use of splats.

# File 'lib/style_script/lexer.rb', line 250

def tag_parameters
  return if last_tag != ')'
  i = 0
  loop do
    i -= 1
    tok = @tokens[i]
    return if !tok
    case tok[0]
    when :IDENTIFIER  then tok[0] = :PARAM
    when ')'          then tok[0] = :PARAM_END
    when '('          then return tok[0] = :PARAM_START
    end
  end
end

#token(tag, value) ⇒ `Object`

Add a token to the results, taking note of the line number.



232
233
234

# File 'lib/style_script/lexer.rb', line 232

def token(tag, value)
  @tokens << [tag, Value.new(value, @line)]
end

#tokenize(code) ⇒ `Object`

Scan by attempting to match tokens one character at a time. Slow and steady.

# File 'lib/style_script/lexer.rb', line 56

def tokenize(code)
  @code    = code.chomp # Cleanup code by remove extra line breaks
  @i       = 0          # Current character position we're parsing
  @line    = 1          # The current line.
  @indent  = 0          # The current indent level.
  @indents = []         # The stack of all indent levels we are currently within.
  @tokens  = []         # Collection of all parsed tokens in the form [:TOKEN_TYPE, value]
  @spaced  = nil        # The last value that has a space following it.
  while @i < @code.length
    @chunk = @code[@i..-1]
    extract_next_token
  end
  puts "original stream: #{@tokens.inspect}" if ENV['VERBOSE']
  close_indentation
  Rewriter.new.rewrite(@tokens)
end

#whitespace_token ⇒ `Object`

Matches and consumes non-meaningful whitespace.

# File 'lib/style_script/lexer.rb', line 194

def whitespace_token
  return false unless whitespace = @chunk[WHITESPACE, 1]
  @spaced = last_value
  @i += whitespace.length
end

Class: StyleScript::Lexer

Overview

Constant Summary collapse

Instance Method Summary collapse

Instance Method Details

#close_indentation ⇒ Object

#comment_token ⇒ Object

#extract_next_token ⇒ Object

#heredoc_token ⇒ Object

#identifier_token ⇒ Object

#indent_token ⇒ Object

#js_token ⇒ Object

#last_tag ⇒ Object

#last_value ⇒ Object

#literal_token ⇒ Object

#newline_token(newlines) ⇒ Object

#number_token ⇒ Object

#outdent_token(move_out) ⇒ Object

#regex_token ⇒ Object

#string_token ⇒ Object

#suppress_newlines(newlines) ⇒ Object

#tag_parameters ⇒ Object

#token(tag, value) ⇒ Object

#tokenize(code) ⇒ Object

#whitespace_token ⇒ Object