Class: HumanQL::QueryParser
- Inherits:
-
Object
- Object
- HumanQL::QueryParser
- Defined in:
- lib/human-ql/query_parser.rb
Overview
Human friendly, lenient query parser. Parses an arbitrary input string and outputs an abstract syntax tree (AST), which uses ruby arrays as S-expressions.
Supported Syntax Summary
As per defaults. In the table below, input string variations on the left are sperated by ‘,’ and output AST is shown on the right.
a --> 'a'
"a b c" --> [ :phrase, 'a', 'b', 'c' ]
a b c --> [ :and, 'a', 'b', 'c' ]
a OR b, a|b --> [ :or, 'a', 'b' ]
a AND b, a&b --> [ :and, 'a', 'b' ]
a b|c --> [ :and, 'a', [:or, 'b', 'c'] ]
(a b) OR (c d) --> [ :or, [:and, 'a', 'b'], [:and, 'c', 'd'] ]
NOT expr, -expr --> [ :not, expr ]
SCOPE:expr, SCOPE : expr --> [ 'SCOPE', expr ]
Where:
-
‘expr’ may be simple term, phrase, or parenthetical expression.
-
SCOPEs must be specified. By default, no scopes are supported.
The AST output from #parse may have various no-ops and redundances. Run it through a TreeNormalizer to avoid seeing or needing to handle these cases.
Customization
The lexing and token matching patterns, as well as other attributes used in the parser may be adjusted via constructor options or attribute writer methods. Many of these attributes may either be String constants or Regex patterns supporting multiple values as needed. Some features may be disabled by setting these values to nil (e.g. match no tokens). While accessors are defined, internally the instance variables are accessed directly for speed. Tests show this is as fast as using constants (which would be harder to modify) and faster than reader method calls.
Implementation Notes
The parser implementation adapts the infix precedence handling and operator stack of the Shunting Yard Algorithm originally described by Edsger Dijkstra. Attributes #default_op and #precedence control the handling of explicit or implied infix operators.
Direct Known Subclasses
Defined Under Namespace
Classes: ParseState
Constant Summary collapse
- SP =
String pattern for Unicode spaces
"[[:space:]]".freeze
- NSP =
String pattern for Unicode non-spaces
"[^#{SP}]".freeze
- SPACES =
Regex for 1-to-many Unicode spaces
/#{SP}+/.freeze
- DEFAULT_PRECEDENCE =
Default precedence of supported operators.
{ not: 11, or: 2, and: 1 }.freeze
Instance Attribute Summary collapse
-
#and_token ⇒ Object
AND operator token pattern.
-
#default_op ⇒ Object
The default operator when none is otherwise given between parsed terms.
-
#infix_token ⇒ Object
Pattern used for lexing to treat certain punctuation characters as seperate tokens, even if they are not space seperated.
-
#lparen ⇒ Object
Left parentheses pattern or value Default: ‘(’.
-
#lquote ⇒ Object
Left quote pattern or value Default: ‘“’.
-
#not_token ⇒ Object
NOT operator token pattern.
-
#or_token ⇒ Object
OR operator token pattern.
-
#precedence ⇒ Object
Hash of operators to precedence Integer value.
-
#prefix_token ⇒ Object
Pattern used for lexing to treat certain characters as seperate tokens when appearing as a prefix only.
-
#rparen ⇒ Object
Right parentheses pattern or value Default: ‘)’.
-
#rquote ⇒ Object
Right quote pattern or value.
-
#scope ⇒ Object
Scope pattern or value matching post-normalized scope token, including trailing ‘:’ but without whitespace.
-
#scope_token ⇒ Object
SCOPE unary operator pattern used for lexing to treat a scope prefix, e.g.
-
#scope_upcase ⇒ Object
Should scope tokens be upcased in the AST? This would imply case-insensitive #scope, and #scope_token as generated via #scopes= with the ‘ignorecase: true` option.
-
#spaces ⇒ Object
Pattern matching one or more characters to treat as white-space Default: SPACES.
-
#verbose ⇒ Object
If true, log parsing progress and state to $stderr.
Instance Method Summary collapse
-
#initialize(opts = {}) ⇒ QueryParser
constructor
Construct given options which are interpreted as attribute names to set.
- #log(l = nil) ⇒ Object
-
#norm_infix(q) ⇒ Object
Treat various punctuation form operators as always being seperate tokens per #infix_token pattern.
-
#norm_phrase_tokens(tokens) ⇒ Object
Select which tokens survive in a phrase.
-
#norm_prefix(q) ⇒ Object
Split prefixes as seperate tokens per #prefix_token pattern.
-
#norm_scope(q) ⇒ Object
If #scope_token is specified, normalize scopes as separate ‘SCOPE:’ tokens.
-
#norm_space(q) ⇒ Object
Normalize any whitespace to a single ASCII space character and strip leading/trailing whitepsace.
-
#norm_term(t) ⇒ Object
No-op in this implementation but may be used to replace characters.
-
#normalize(q) ⇒ Object
Runs the suite of initial input norm_* functions.
- #parse(q) ⇒ Object
- #parse_tree(tokens) ⇒ Object
-
#rparen_index(tokens) ⇒ Object
Find token matching #rparen in remaining tokens.
-
#scope_op(token) ⇒ Object
Given scope token, return the name (minus trailing ‘:’), upcased if #scope_upcase.
-
#scopes=(scopes) ⇒ Object
Given one or an Array of scope prefixes, generate the #scope and #scope_token patterns.
Constructor Details
#initialize(opts = {}) ⇒ QueryParser
Construct given options which are interpreted as attribute names to set.
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 |
# File 'lib/human-ql/query_parser.rb', line 200 def initialize( opts = {} ) @default_op = :and @precedence = Hash.new(10) @precedence.merge!( DEFAULT_PRECEDENCE ) @precedence.freeze @spaces = SPACES @infix_token = /[()|&"]/.freeze @prefix_token = /(?<=\A|#{SP})-(?=#{NSP})/.freeze @or_token = /\A(OR|\|)\z/i.freeze @and_token = /\A(AND|\&)\z/i.freeze @not_token = /\A(NOT|\-)\z/i.freeze @lquote = @rquote = '"'.freeze @lparen = '('.freeze @rparen = ')'.freeze @scope = nil @scope_token = nil @scope_upcase = false @verbose = false opts.each do |name,val| send( name.to_s + '=', val ) end end |
Instance Attribute Details
#and_token ⇒ Object
AND operator token pattern. Should match the entire token using the ‘A’ and ‘/z’ syntax for begining and end of string. Default: Pattern matching complete tokens ‘AND’, ‘and’, or ‘&’
121 122 123 |
# File 'lib/human-ql/query_parser.rb', line 121 def and_token @and_token end |
#default_op ⇒ Object
The default operator when none is otherwise given between parsed terms. Default: :and
88 89 90 |
# File 'lib/human-ql/query_parser.rb', line 88 def default_op @default_op end |
#infix_token ⇒ Object
Pattern used for lexing to treat certain punctuation characters as seperate tokens, even if they are not space seperated. Default: Pattern matching any characters ‘(’, ‘)’, ‘|’, ‘&’, ‘“’ as used as operator/parenthesis tokens in defaults below.
106 107 108 |
# File 'lib/human-ql/query_parser.rb', line 106 def infix_token @infix_token end |
#lparen ⇒ Object
Left parentheses pattern or value Default: ‘(’
139 140 141 |
# File 'lib/human-ql/query_parser.rb', line 139 def lparen @lparen end |
#lquote ⇒ Object
Left quote pattern or value Default: ‘“’
130 131 132 |
# File 'lib/human-ql/query_parser.rb', line 130 def lquote @lquote end |
#not_token ⇒ Object
NOT operator token pattern. Should match the entire token using the ‘A’ and ‘/z’ syntax for begining and end of string. Default: Pattern matching complete tokens ‘NOT’, ‘not’, or ‘-’
126 127 128 |
# File 'lib/human-ql/query_parser.rb', line 126 def not_token @not_token end |
#or_token ⇒ Object
OR operator token pattern. Should match the entire token using the ‘A’ and ‘/z’ syntax for begining and end of string. Default: Pattern matching complete tokens ‘OR’, ‘or’, or ‘|’
116 117 118 |
# File 'lib/human-ql/query_parser.rb', line 116 def or_token @or_token end |
#precedence ⇒ Object
Hash of operators to precedence Integer value. The hash should also provide a default value for unlisted operators like any supported scopes. To limit human surprise, the #default_op should have the lowest precedence. The default is as per DEFAULT_PRECEDENCE with a default value of 10, thus :not has the highest precedence at 11.
96 97 98 |
# File 'lib/human-ql/query_parser.rb', line 96 def precedence @precedence end |
#prefix_token ⇒ Object
Pattern used for lexing to treat certain characters as seperate tokens when appearing as a prefix only. Default ‘-’ (as used in default #not_tokens)
111 112 113 |
# File 'lib/human-ql/query_parser.rb', line 111 def prefix_token @prefix_token end |
#rparen ⇒ Object
Right parentheses pattern or value Default: ‘)’
143 144 145 |
# File 'lib/human-ql/query_parser.rb', line 143 def rparen @rparen end |
#rquote ⇒ Object
Right quote pattern or value. Its fine if this is the same as #lquote. Default: ‘“’
135 136 137 |
# File 'lib/human-ql/query_parser.rb', line 135 def rquote @rquote end |
#scope ⇒ Object
Scope pattern or value matching post-normalized scope token, including trailing ‘:’ but without whitespace. Default: nil -> no scopes
178 179 180 |
# File 'lib/human-ql/query_parser.rb', line 178 def scope @scope end |
#scope_token ⇒ Object
SCOPE unary operator pattern used for lexing to treat a scope prefix, e.g. ‘SCOPE’ + ‘:’, with or without internal or trailing whitespace as single token. Used by #norm_scope, where it also treats a non-matching ‘:’ as whitespace. This would normally be set via #scopes=. Default: nil -> no scopes
186 187 188 |
# File 'lib/human-ql/query_parser.rb', line 186 def scope_token @scope_token end |
#scope_upcase ⇒ Object
Should scope tokens be upcased in the AST? This would imply case-insensitive #scope, and #scope_token as generated via #scopes= with the ‘ignorecase: true` option. Default: false
192 193 194 |
# File 'lib/human-ql/query_parser.rb', line 192 def scope_upcase @scope_upcase end |
#spaces ⇒ Object
Pattern matching one or more characters to treat as white-space Default: SPACES
100 101 102 |
# File 'lib/human-ql/query_parser.rb', line 100 def spaces @spaces end |
#verbose ⇒ Object
If true, log parsing progress and state to $stderr. Default: false
196 197 198 |
# File 'lib/human-ql/query_parser.rb', line 196 def verbose @verbose end |
Instance Method Details
#log(l = nil) ⇒ Object
241 242 243 244 245 246 |
# File 'lib/human-ql/query_parser.rb', line 241 def log( l = nil ) if @verbose l = yield if block_given? $stderr.puts( l ) end end |
#norm_infix(q) ⇒ Object
Treat various punctuation form operators as always being seperate tokens per #infix_token pattern. Note: Must always call norm_space after this
315 316 317 |
# File 'lib/human-ql/query_parser.rb', line 315 def norm_infix( q ) q.gsub( @infix_token, ' \0 ' ) end |
#norm_phrase_tokens(tokens) ⇒ Object
Select which tokens survive in a phrase. Also passes each token though #norm_term. Tokens matching #lparen and #rparen are dropped.
366 367 368 369 370 |
# File 'lib/human-ql/query_parser.rb', line 366 def norm_phrase_tokens( tokens ) tokens. reject { |t| @lparen === t || @rparen === t }. map { |t| norm_term( t ) } end |
#norm_prefix(q) ⇒ Object
Split prefixes as seperate tokens per #prefix_token pattern
320 321 322 323 324 325 326 |
# File 'lib/human-ql/query_parser.rb', line 320 def norm_prefix( q ) if @prefix_token q.gsub( @prefix_token, '\0 ' ) else q end end |
#norm_scope(q) ⇒ Object
If #scope_token is specified, normalize scopes as separate ‘SCOPE:’ tokens. This expects the 2nd capture group of #scope_token to be the actual matching scope name, if present.
332 333 334 335 336 337 338 339 340 341 342 343 344 |
# File 'lib/human-ql/query_parser.rb', line 332 def norm_scope( q ) if @scope_token q.gsub( @scope_token ) do if $2 $2 + ': ' else ' ' end end else q end end |
#norm_space(q) ⇒ Object
Normalize any whitespace to a single ASCII space character and strip leading/trailing whitepsace.
348 349 350 |
# File 'lib/human-ql/query_parser.rb', line 348 def norm_space( q ) q.gsub(@spaces, ' ').strip end |
#norm_term(t) ⇒ Object
No-op in this implementation but may be used to replace characters. Should not receive nor return null or empty values.
374 375 376 |
# File 'lib/human-ql/query_parser.rb', line 374 def norm_term( t ) t end |
#normalize(q) ⇒ Object
Runs the suite of initial input norm_* functions. Returns nil if the result is empty.
354 355 356 357 358 359 360 361 |
# File 'lib/human-ql/query_parser.rb', line 354 def normalize( q ) q ||= '' q = norm_infix( q ) q = norm_scope( q ) q = norm_prefix( q ) q = norm_space( q ) q unless q.empty? end |
#parse(q) ⇒ Object
228 229 230 231 232 233 234 235 236 237 238 239 |
# File 'lib/human-ql/query_parser.rb', line 228 def parse( q ) unless @default_op == :and || @default_op == :or raise( "QueryParser#default_op is (#{@default_op.inspect}) " + "(should be :and or :or)" ) end q = normalize( q ) tokens = q ? q.split(' ') : [] log { "Parse: " + tokens.join( ' ' ) } ast = parse_tree( tokens ) log { "AST: " + ast.inspect } ast end |
#parse_tree(tokens) ⇒ Object
248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 |
# File 'lib/human-ql/query_parser.rb', line 248 def parse_tree( tokens ) s = ParseState.new( self ) while ( t = tokens.shift ) case t when @lquote rqi = tokens.index { |tt| @rquote === tt } if rqi s.push_term( [ :phrase, *norm_phrase_tokens(tokens[0...rqi]) ] ) tokens = tokens[rqi+1..-1] end # else ignore when @lparen rpi = rparen_index( tokens ) if rpi s.push_term( parse_tree( tokens[0...rpi] ) ) tokens = tokens[rpi+1..-1] end # else ignore when @rquote #ignore when @rparen #ignore when @scope s.push_op( scope_op( t ) ) when @or_token s.push_op( :or ) when @and_token s.push_op( :and ) when @not_token s.push_op( :not ) else s.push_term( norm_term( t ) ) end end s.flush_tree end |
#rparen_index(tokens) ⇒ Object
Find token matching #rparen in remaining tokens.
292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 |
# File 'lib/human-ql/query_parser.rb', line 292 def rparen_index( tokens ) li = 1 phrase = false tokens.index do |tt| if phrase phrase = false if @rquote === tt else case tt when @rparen li -= 1 when @lparen li += 1 when @lquote phrase = true end end (li == 0) end end |
#scope_op(token) ⇒ Object
Given scope token, return the name (minus trailing ‘:’), upcased if #scope_upcase.
285 286 287 288 289 |
# File 'lib/human-ql/query_parser.rb', line 285 def scope_op( token ) t = token[0...-1] t.upcase! if @scope_upcase t end |
#scopes=(scopes) ⇒ Object
Given one or an Array of scope prefixes, generate the #scope and #scope_token patterns. A trailing hash is intepreted as options, see below.
Options
- :ignorecase
-
If true, generate case insensitive regexes and upcase the scope in AST output (per #scope_upcase)
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
# File 'lib/human-ql/query_parser.rb', line 153 def scopes=( scopes ) scopes = Array( scopes ) opts = scopes.last.is_a?( Hash ) && scopes.pop || {} ignorecase = !!(opts[:ignorecase]) if scopes.empty? @scope = nil @scope_token = nil elsif scopes.length == 1 && !ignorecase s = scopes.first @scope = ( s + ':' ).freeze @scope_token = /((?<=\A|#{SP})(#{s}))?#{SP}*:/.freeze else opts = ignorecase ? Regexp::IGNORECASE : nil s = Regexp.union( *scopes ).source @scope = Regexp.new( '\A(' + s + '):\z', opts ).freeze @scope_token = Regexp.new( "((?<=\\A|#{SP})(#{s}))?#{SP}*:", opts ).freeze end @scope_upcase = ignorecase nil end |