Class: HumanQL::QueryParser

Inherits:

Object

Object
HumanQL::QueryParser

show all

Defined in:: lib/human-ql/query_parser.rb

Overview

Human friendly, lenient query parser. Parses an arbitrary input string and outputs an abstract syntax tree (AST), which uses ruby arrays as S-expressions.

Supported Syntax Summary

As per defaults. In the table below, input string variations on the left are sperated by ‘,’ and output AST is shown on the right.

a                        --> 'a'
"a b c"                  --> [ :phrase, 'a', 'b', 'c' ]
a b c                    --> [ :and, 'a', 'b', 'c' ]
a OR b, a|b              --> [ :or, 'a', 'b' ]
a AND b, a&b             --> [ :and, 'a', 'b' ]
a b|c                    --> [ :and, 'a', [:or, 'b', 'c'] ]
(a b) OR (c d)           --> [ :or, [:and, 'a', 'b'], [:and, 'c', 'd'] ]
NOT expr, -expr          --> [ :not, expr ]
SCOPE:expr, SCOPE : expr --> [ 'SCOPE', expr ]

Where:

‘expr’ may be simple term, phrase, or parenthetical expression.
SCOPEs must be specified. By default, no scopes are supported.

The AST output from #parse may have various no-ops and redundances. Run it through a TreeNormalizer to avoid seeing or needing to handle these cases.

Customization

The lexing and token matching patterns, as well as other attributes used in the parser may be adjusted via constructor options or attribute writer methods. Many of these attributes may either be String constants or Regex patterns supporting multiple values as needed. Some features may be disabled by setting these values to nil (e.g. match no tokens). While accessors are defined, internally the instance variables are accessed directly for speed. Tests show this is as fast as using constants (which would be harder to modify) and faster than reader method calls.

Implementation Notes

The parser implementation adapts the infix precedence handling and operator stack of the Shunting Yard Algorithm originally described by Edsger Dijkstra. Attributes #default_op and #precedence control the handling of explicit or implied infix operators.

Direct Known Subclasses

PostgreSQLCustomParser

Defined Under Namespace

Classes: ParseState

Constant Summary collapse

SP = String pattern for Unicode spaces

"[[:space:]]".freeze

NSP = String pattern for Unicode non-spaces

"[^#{SP}]".freeze

SPACES = Regex for 1-to-many Unicode spaces

/#{SP}+/.freeze

DEFAULT_PRECEDENCE = Default precedence of supported operators.

{
  not: 11,
  or:  2,
  and: 1
}.freeze

Instance Attribute Summary collapse

#and_token ⇒ Object

AND operator token pattern.
#default_op ⇒ Object

The default operator when none is otherwise given between parsed terms.
#infix_token ⇒ Object

Pattern used for lexing to treat certain punctuation characters as seperate tokens, even if they are not space seperated.
#lparen ⇒ Object

Left parentheses pattern or value Default: ‘(’.
#lquote ⇒ Object

Left quote pattern or value Default: ‘“’.
#not_token ⇒ Object

NOT operator token pattern.
#or_token ⇒ Object

OR operator token pattern.
#precedence ⇒ Object

Hash of operators to precedence Integer value.
#prefix_token ⇒ Object

Pattern used for lexing to treat certain characters as seperate tokens when appearing as a prefix only.
#rparen ⇒ Object

Right parentheses pattern or value Default: ‘)’.
#rquote ⇒ Object

Right quote pattern or value.
#scope ⇒ Object

Scope pattern or value matching post-normalized scope token, including trailing ‘:’ but without whitespace.
#scope_token ⇒ Object

SCOPE unary operator pattern used for lexing to treat a scope prefix, e.g.
#scope_upcase ⇒ Object

Should scope tokens be upcased in the AST? This would imply case-insensitive #scope, and #scope_token as generated via #scopes= with the ‘ignorecase: true` option.
#spaces ⇒ Object

Pattern matching one or more characters to treat as white-space Default: SPACES.
#verbose ⇒ Object

If true, log parsing progress and state to $stderr.

Instance Method Summary collapse

#initialize(opts = {}) ⇒ QueryParser constructor

Construct given options which are interpreted as attribute names to set.
#log(l = nil) ⇒ Object
#norm_infix(q) ⇒ Object

Treat various punctuation form operators as always being seperate tokens per #infix_token pattern.
#norm_phrase_tokens(tokens) ⇒ Object

Select which tokens survive in a phrase.
#norm_prefix(q) ⇒ Object

Split prefixes as seperate tokens per #prefix_token pattern.
#norm_scope(q) ⇒ Object

If #scope_token is specified, normalize scopes as separate ‘SCOPE:’ tokens.
#norm_space(q) ⇒ Object

Normalize any whitespace to a single ASCII space character and strip leading/trailing whitepsace.
#norm_term(t) ⇒ Object

No-op in this implementation but may be used to replace characters.
#normalize(q) ⇒ Object

Runs the suite of initial input norm_* functions.
#parse(q) ⇒ Object
#parse_tree(tokens) ⇒ Object
#rparen_index(tokens) ⇒ Object

Find token matching #rparen in remaining tokens.
#scope_op(token) ⇒ Object

Given scope token, return the name (minus trailing ‘:’), upcased if #scope_upcase.
#scopes=(scopes) ⇒ Object

Given one or an Array of scope prefixes, generate the #scope and #scope_token patterns.

Constructor Details

#initialize(opts = {}) ⇒ `QueryParser`

Construct given options which are interpreted as attribute names to set.

# File 'lib/human-ql/query_parser.rb', line 200

def initialize( opts = {} )
  @default_op = :and

  @precedence = Hash.new(10)
  @precedence.merge!( DEFAULT_PRECEDENCE )
  @precedence.freeze

  @spaces = SPACES
  @infix_token  = /[()|&"]/.freeze
  @prefix_token = /(?<=\A|#{SP})-(?=#{NSP})/.freeze
  @or_token  = /\A(OR|\|)\z/i.freeze
  @and_token = /\A(AND|\&)\z/i.freeze
  @not_token = /\A(NOT|\-)\z/i.freeze
  @lquote = @rquote = '"'.freeze
  @lparen = '('.freeze
  @rparen = ')'.freeze

  @scope = nil
  @scope_token = nil
  @scope_upcase = false

  @verbose = false

  opts.each do |name,val|
    send( name.to_s + '=', val )
  end
end

Instance Attribute Details

#and_token ⇒ `Object`

AND operator token pattern. Should match the entire token using the ‘A’ and ‘/z’ syntax for begining and end of string. Default: Pattern matching complete tokens ‘AND’, ‘and’, or ‘&’



121
122
123

# File 'lib/human-ql/query_parser.rb', line 121

def and_token
  @and_token
end

#default_op ⇒ `Object`

The default operator when none is otherwise given between parsed terms. Default: :and



88
89
90

# File 'lib/human-ql/query_parser.rb', line 88

def default_op
  @default_op
end

#infix_token ⇒ `Object`

Pattern used for lexing to treat certain punctuation characters as seperate tokens, even if they are not space seperated. Default: Pattern matching any characters ‘(’, ‘)’, ‘|’, ‘&’, ‘“’ as used as operator/parenthesis tokens in defaults below.



106
107
108

# File 'lib/human-ql/query_parser.rb', line 106

def infix_token
  @infix_token
end

#lparen ⇒ `Object`

Left parentheses pattern or value Default: ‘(’



139
140
141

# File 'lib/human-ql/query_parser.rb', line 139

def lparen
  @lparen
end

#lquote ⇒ `Object`

Left quote pattern or value Default: ‘“’



130
131
132

# File 'lib/human-ql/query_parser.rb', line 130

def lquote
  @lquote
end

#not_token ⇒ `Object`

NOT operator token pattern. Should match the entire token using the ‘A’ and ‘/z’ syntax for begining and end of string. Default: Pattern matching complete tokens ‘NOT’, ‘not’, or ‘-’



126
127
128

# File 'lib/human-ql/query_parser.rb', line 126

def not_token
  @not_token
end

#or_token ⇒ `Object`

OR operator token pattern. Should match the entire token using the ‘A’ and ‘/z’ syntax for begining and end of string. Default: Pattern matching complete tokens ‘OR’, ‘or’, or ‘|’



116
117
118

# File 'lib/human-ql/query_parser.rb', line 116

def or_token
  @or_token
end

#precedence ⇒ `Object`

Hash of operators to precedence Integer value. The hash should also provide a default value for unlisted operators like any supported scopes. To limit human surprise, the #default_op should have the lowest precedence. The default is as per DEFAULT_PRECEDENCE with a default value of 10, thus :not has the highest precedence at 11.



96
97
98

# File 'lib/human-ql/query_parser.rb', line 96

def precedence
  @precedence
end

#prefix_token ⇒ `Object`

Pattern used for lexing to treat certain characters as seperate tokens when appearing as a prefix only. Default ‘-’ (as used in default #not_tokens)



111
112
113

# File 'lib/human-ql/query_parser.rb', line 111

def prefix_token
  @prefix_token
end

#rparen ⇒ `Object`

Right parentheses pattern or value Default: ‘)’



143
144
145

# File 'lib/human-ql/query_parser.rb', line 143

def rparen
  @rparen
end

#rquote ⇒ `Object`

Right quote pattern or value. Its fine if this is the same as #lquote. Default: ‘“’



135
136
137

# File 'lib/human-ql/query_parser.rb', line 135

def rquote
  @rquote
end

#scope ⇒ `Object`

Scope pattern or value matching post-normalized scope token, including trailing ‘:’ but without whitespace. Default: nil -> no scopes



178
179
180

# File 'lib/human-ql/query_parser.rb', line 178

def scope
  @scope
end

#scope_token ⇒ `Object`

SCOPE unary operator pattern used for lexing to treat a scope prefix, e.g. ‘SCOPE’ + ‘:’, with or without internal or trailing whitespace as single token. Used by #norm_scope, where it also treats a non-matching ‘:’ as whitespace. This would normally be set via #scopes=. Default: nil -> no scopes



186
187
188

# File 'lib/human-ql/query_parser.rb', line 186

def scope_token
  @scope_token
end

#scope_upcase ⇒ `Object`

Should scope tokens be upcased in the AST? This would imply case-insensitive #scope, and #scope_token as generated via #scopes= with the ‘ignorecase: true` option. Default: false



192
193
194

# File 'lib/human-ql/query_parser.rb', line 192

def scope_upcase
  @scope_upcase
end

#spaces ⇒ `Object`

Pattern matching one or more characters to treat as white-space Default: SPACES



100
101
102

# File 'lib/human-ql/query_parser.rb', line 100

def spaces
  @spaces
end

#verbose ⇒ `Object`

If true, log parsing progress and state to $stderr. Default: false



196
197
198

# File 'lib/human-ql/query_parser.rb', line 196

def verbose
  @verbose
end

Instance Method Details

#log(l = nil) ⇒ `Object`

# File 'lib/human-ql/query_parser.rb', line 241

def log( l = nil )
  if @verbose
    l = yield if block_given?
    $stderr.puts( l )
  end
end

#norm_infix(q) ⇒ `Object`

Treat various punctuation form operators as always being seperate tokens per #infix_token pattern. Note: Must always call norm_space after this



315
316
317

# File 'lib/human-ql/query_parser.rb', line 315

def norm_infix( q )
  q.gsub( @infix_token, ' \0 ' )
end

#norm_phrase_tokens(tokens) ⇒ `Object`

Select which tokens survive in a phrase. Also passes each token though #norm_term. Tokens matching #lparen and #rparen are dropped.

# File 'lib/human-ql/query_parser.rb', line 366

def norm_phrase_tokens( tokens )
  tokens.
    reject { |t| @lparen === t || @rparen === t }.
    map { |t| norm_term( t ) }
end

#norm_prefix(q) ⇒ `Object`

Split prefixes as seperate tokens per #prefix_token pattern

# File 'lib/human-ql/query_parser.rb', line 320

def norm_prefix( q )
  if @prefix_token
    q.gsub( @prefix_token, '\0 ' )
  else
    q
  end
end

#norm_scope(q) ⇒ `Object`

If #scope_token is specified, normalize scopes as separate ‘SCOPE:’ tokens. This expects the 2nd capture group of #scope_token to be the actual matching scope name, if present.

# File 'lib/human-ql/query_parser.rb', line 332

def norm_scope( q )
  if @scope_token
    q.gsub( @scope_token ) do
      if $2
        $2 + ': '
      else
        ' '
      end
    end
  else
    q
  end
end

#norm_space(q) ⇒ `Object`

Normalize any whitespace to a single ASCII space character and strip leading/trailing whitepsace.



348
349
350

# File 'lib/human-ql/query_parser.rb', line 348

def norm_space( q )
  q.gsub(@spaces, ' ').strip
end

#norm_term(t) ⇒ `Object`

No-op in this implementation but may be used to replace characters. Should not receive nor return null or empty values.



374
375
376

# File 'lib/human-ql/query_parser.rb', line 374

def norm_term( t )
  t
end

#normalize(q) ⇒ `Object`

Runs the suite of initial input norm_* functions. Returns nil if the result is empty.

# File 'lib/human-ql/query_parser.rb', line 354

def normalize( q )
  q ||= ''
  q = norm_infix( q )
  q = norm_scope( q )
  q = norm_prefix( q )
  q = norm_space( q )
  q unless q.empty?
end

#parse(q) ⇒ `Object`

# File 'lib/human-ql/query_parser.rb', line 228

def parse( q )
  unless @default_op == :and || @default_op == :or
    raise( "QueryParser#default_op is (#{@default_op.inspect}) " +
           "(should be :and or :or)" )
  end
  q = normalize( q )
  tokens = q ? q.split(' ') : []
  log { "Parse: " + tokens.join( ' ' ) }
  ast = parse_tree( tokens )
  log { "AST: " + ast.inspect }
  ast
end

#parse_tree(tokens) ⇒ `Object`

# File 'lib/human-ql/query_parser.rb', line 248

def parse_tree( tokens )
  s = ParseState.new( self )
  while ( t = tokens.shift )
    case t
    when @lquote
      rqi = tokens.index { |tt| @rquote === tt }
      if rqi
        s.push_term( [ :phrase, *norm_phrase_tokens(tokens[0...rqi]) ] )
        tokens = tokens[rqi+1..-1]
      end # else ignore
    when @lparen
      rpi = rparen_index( tokens )
      if rpi
        s.push_term( parse_tree( tokens[0...rpi] ) )
        tokens = tokens[rpi+1..-1]
      end # else ignore
    when @rquote
    #ignore
    when @rparen
    #ignore
    when @scope
      s.push_op( scope_op( t ) )
    when @or_token
      s.push_op( :or )
    when @and_token
      s.push_op( :and )
    when @not_token
      s.push_op( :not )
    else
      s.push_term( norm_term( t ) )
    end
  end
  s.flush_tree
end

#rparen_index(tokens) ⇒ `Object`

Find token matching #rparen in remaining tokens.

# File 'lib/human-ql/query_parser.rb', line 292

def rparen_index( tokens )
  li = 1
  phrase = false
  tokens.index do |tt|
    if phrase
      phrase = false if @rquote === tt
    else
      case tt
      when @rparen
        li -= 1
      when @lparen
        li += 1
      when @lquote
        phrase = true
      end
    end
    (li == 0)
  end
end

#scope_op(token) ⇒ `Object`

Given scope token, return the name (minus trailing ‘:’), upcased if #scope_upcase.

# File 'lib/human-ql/query_parser.rb', line 285

def scope_op( token )
  t = token[0...-1]
  t.upcase! if @scope_upcase
  t
end

#scopes=(scopes) ⇒ `Object`

Given one or an Array of scope prefixes, generate the #scope and #scope_token patterns. A trailing hash is intepreted as options, see below.

Options

:ignorecase: If true, generate case insensitive regexes and upcase the scope in AST output (per #scope_upcase)

# File 'lib/human-ql/query_parser.rb', line 153

def scopes=( scopes )
  scopes = Array( scopes )
  opts = scopes.last.is_a?( Hash ) && scopes.pop || {}
  ignorecase = !!(opts[:ignorecase])
  if scopes.empty?
    @scope = nil
    @scope_token = nil
  elsif scopes.length == 1 && !ignorecase
    s = scopes.first
    @scope = ( s + ':' ).freeze
    @scope_token = /((?<=\A|#{SP})(#{s}))?#{SP}*:/.freeze
  else
    opts = ignorecase ? Regexp::IGNORECASE : nil
    s = Regexp.union( *scopes ).source
    @scope = Regexp.new( '\A(' + s + '):\z', opts ).freeze
    @scope_token = Regexp.new( "((?<=\\A|#{SP})(#{s}))?#{SP}*:",
                               opts ).freeze
  end
  @scope_upcase = ignorecase
  nil
end

Class: HumanQL::QueryParser

Overview

Supported Syntax Summary

Customization

Implementation Notes

Direct Known Subclasses

Defined Under Namespace

Constant Summary collapse

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(opts = {}) ⇒ QueryParser

Instance Attribute Details

#and_token ⇒ Object

#default_op ⇒ Object

#infix_token ⇒ Object

#lparen ⇒ Object

#lquote ⇒ Object

#not_token ⇒ Object

#or_token ⇒ Object

#precedence ⇒ Object

#prefix_token ⇒ Object

#rparen ⇒ Object

#rquote ⇒ Object

#scope ⇒ Object

#scope_token ⇒ Object

#scope_upcase ⇒ Object

#spaces ⇒ Object

#verbose ⇒ Object