Module: RPEG
Overview
This is intended to play the same role as LPEG’s lpeg module.
Top-level differences from LPEG:
-
AND patterns in LPEG are written as #patt (&patt in the first version) but +patt here
-
unary & apparently can’t be overloaded in Ruby
-
I tried using the “&” operator by overriding #to_proc but the Ruby parser rejects the &patt expression.
-
-
+patt doesn’t read well, even though -patt does. I think this is because binary plus is so much more common when buliding patterns than binary minus is.
-
There doesn’t seem to be another workable option. According to stackoverflow.com/a/21060235/1299011 the unary operators are !, ~, , and -. We use - already and ! needs to be avoided because of idiomiatic Ruby checks like !foo for existence. Operator ~ works, but that character is easy to mistake for a - when reading, which is bad. Ruby uses & as a unary (for to_proc) but the Ruby parser doesn’t allow its use in general. So I think we’re stuck with .
-
-
-
repeating patterns still use exponentiation, but it now looks like patt**n rather than patt^n because of Ruby’s syntax
-
grammars are represented by hashes or arrays. LPEG uses Lua tables (which are mashups of hashtables and arrays)
If an array is given then the nonterminals aren’t named and all open calls must use numeric indices. The first element of the array is either
-
a non-negative integer 0, 1, 2, … and specifies the (rule of the) initial nonterminal among the remaining elements with indices reckoned without that initial integer
-
something else, which is interpreted as the pattern for the initial nonterminal
Otherwise the grammar is defined with a Hash. The keys are the nonterminal symbols and the values are the rule patterns.
-
the keys must be symbols or strings (which are converted to symbols). No rule can use :initial or “initial” as nonterminal.
-
the open calls can refer either to the nonterminals (as strings or symbols) or to rule indices as they appear in the hash, ignoring the :initial key (if present)
-
:initial/“initial” can appear as a key in the hash and its value specifies the initial nonterminal.
-
if it is a non-zero integer it gives the index of the initial terminal’s rule, reckoned without the presence of the :initial key itself.
-
if it is a symbol or a string it specifies the initial nonterminal directly
-
-
-
“Table” captures return an instace of TableCapture, which impelements a little bit of a Lua’s table functionality
-
other formats haven’t worked out well
-
Function captures
Various kinds of captures involve calling a function (proc) provided by client code. For example, the construction (patt / fn) takes the captures made by patt and passes them as arguments to fn. Then the values returned by fn become the captures of the expression. Lua is better than Ruby at distinguishing between a function that returns multiple values and one that returns a single value that is an array. In RPEG, returns from function in contexts like this are treated as follows:
- [1, 2, 3]: multiple captures, 1, 2, 3.
- this is the natural interpretation as it's the standard way that a Ruby function returns multiple values
- [[1, 2, 3]]: a single capture that is an array
- nil: no captures
- even if the function says something like "return nil", the capture code has no way to distinguish between that and a
function that returns nothing
- [nil]: a single capture with value nil
- the weirdest case, but I don't see an alternative
TODO:
-
program generation optimations
-
other pattern-based optimizations: need to scan through the LPEG code again
-
I think I’ve done them all now
-
-
profiling
-
-
LPEG’s locale support
-
what would this look like in Ruby?
-
Defined Under Namespace
Modules: NonNumericOverloadExtension, RE Classes: Pattern
Instance Method Summary collapse
-
#B(patt) ⇒ Object
Returns a pattern that matches only if the input string at the current position is preceded by patt.
-
#C(pattern) ⇒ Object
LPEG: Creates a simple capture, which captures the substring of the subject that matches patt.
-
#Carg(num = nil) ⇒ Object
Capture the n-th extra argument provided to #match.
-
#Cb(name) ⇒ Object
From LPEG docs:.
-
#Cc(*values) ⇒ Object
LPEG: Creates a constant capture.
-
#Cf(pattern, lambda) ⇒ Object
From LPEG docs:.
-
#Cg(pattern, name = nil) ⇒ Object
From LPEG docs:.
-
#Cmt(patt, function) ⇒ Object
From LPEG:.
-
#Cp ⇒ Object
LPEG: Creates a position capture.
-
#Cs(patt) ⇒ Object
From LPEG:.
-
#Ct(patt) ⇒ Object
From LPEG:.
-
#match(thing, string, init = 0, *extra_args) ⇒ Object
See the instance method #match for the arguments.
-
#P(arg) ⇒ Object
Take argument and turn it into a pattern.
-
#R(*ranges) ⇒ Object
Given a 2-char string xy, the ASCII range x..y.
-
#S(charset) ⇒ Object
Match any character in string (regarded as a set of characters), range, or Set.
-
#V(ref) ⇒ Object
An “open call” reference to a rule in a grammar.
Instance Method Details
#B(patt) ⇒ Object
Returns a pattern that matches only if the input string at the current position is preceded by patt. Pattern patt must match only strings with some fixed length, and it cannot contain captures.
324 325 326 327 328 329 330 331 332 333 |
# File 'lib/rpeg/rpeg.rb', line 324 def B(patt) patt = P(patt) len = patt.fixed_len raise "Behind match: pattern may not have fixed length" unless len raise "Behind match: pattern has captures" if patt.has_captures? # LPEG puts an upper bound of MAXBEHIND = 255 on how large the match can be here. I think it is because the value is packed # into a byte of memory. We don't care about that Pattern.new(Pattern::BEHIND, patt, data: len) end |
#C(pattern) ⇒ Object
LPEG: Creates a simple capture, which captures the substring of the subject that matches patt. The captured value is a string. If patt has other captures, their values are returned after this one.
Note: it appears that when a simple capture is over another simple capture - like C(C(patt)) - we squeeze out the duplication. See this test at l.216 of test.lua:
assert(#m.match(m.C(m.C(i)), string.rep('a', i)) == i)
(In lua the # operator gives the size of a string or table)
176 177 178 179 180 181 |
# File 'lib/rpeg/rpeg.rb', line 176 def C(pattern) pattern = P(pattern) return pattern if pattern.type == Pattern::CAPTURE && pattern.capture == Capture::SIMPLE Pattern.new(Pattern::CAPTURE, pattern, capture: Capture::SIMPLE) end |
#Carg(num = nil) ⇒ Object
Capture the n-th extra argument provided to #match. The first extra argument is n=1, etc.
We accept a missing argument to match some LPEG test cases but an error is raised
186 187 188 189 190 |
# File 'lib/rpeg/rpeg.rb', line 186 def Carg(num = nil) raise "Invalid argument for Carg: #{num || 'nil'}" unless num.is_a?(Integer) && num&.positive? Pattern.new(Pattern::CAPTURE, P(true), data: num, capture: Capture::ARGUMENT) end |
#Cb(name) ⇒ Object
From LPEG docs:
Creates a back capture. This pattern matches the empty string and produces the values produced by the most recent group
capture named name (where name can be any Lua value).
Most recent means the last complete outermost group capture with the given name. A Complete capture means that the entire
pattern corresponding to the capture has matched. An Outermost capture means that the capture is not inside another complete
capture.
nil is not allowed as a name
202 203 204 205 206 |
# File 'lib/rpeg/rpeg.rb', line 202 def Cb(name) raise "Back capture must specify name of group" unless name Pattern.new(Pattern::CAPTURE, P(true), data: name, capture: Capture::BACKREF) end |
#Cc(*values) ⇒ Object
LPEG: Creates a constant capture. This pattern matches the empty string and produces all given values as its captured values.
No value at all - Cc() - adds nothing to the result, which is different from a value of nil.
We capture several values with individual captures.
213 214 215 216 217 218 219 220 221 222 223 224 |
# File 'lib/rpeg/rpeg.rb', line 213 def Cc(*values) return P(true) if values.empty? patt = Pattern.new(Pattern::CAPTURE, P(true), data: values.first, capture: Capture::CONST) return patt if values.size == 1 # Otherwise, follow LPEG and make an anonymous Group capture over a sequence of single-val const captures values[1...].each do |val| patt *= Cc(val) end Cg(patt) end |
#Cf(pattern, lambda) ⇒ Object
From LPEG docs:
Creates a fold capture. If patt produces a list of captures C1 C2 ... Cn, this capture will produce the value
func(...func(func(C1, C2), C3)..., Cn), that is, it will fold (or accumulate, or reduce) the captures from patt using
function func.
The second argument should be a lambda taking two arguments and returning one in the standard way.
The first nested capture is examined. If there isn’t one or it captures no values there is an error. If this capture contains more than one value all but the first are discarded. This is the initial value for the fold. Then we extract the remaining nested captures and use their values C2, …, Cn in the fold as described.
If Ci (i >= 2) produces k values then the lambda will receive k+1 arguments: the accumulator and the k captured values. an array.
240 241 242 243 244 |
# File 'lib/rpeg/rpeg.rb', line 240 def Cf(pattern, lambda) raise "Fold capture must have an accumulation lambda" unless lambda Pattern.new(Pattern::CAPTURE, P(pattern), data: lambda, capture: Capture::FOLD) end |
#Cg(pattern, name = nil) ⇒ Object
From LPEG docs:
Creates a group capture. It groups all values returned by patt into a single capture. The group may be anonymous (if no name
is given) or named with the given name (which can be any non-nil Lua value).
An anonymous group serves to join values from several captures into a single capture. A named group has a different
behavior. In most situations, a named group returns no values at all. Its values are only relevant for a following back
capture or when used inside a table capture.
The name doesn’t have to be string. It can be anything other than nil, because nil means it’s an anonymous group.
256 257 258 |
# File 'lib/rpeg/rpeg.rb', line 256 def Cg(pattern, name = nil) Pattern.new(Pattern::CAPTURE, P(pattern), data: name, capture: Capture::GROUP) end |
#Cmt(patt, function) ⇒ Object
From LPEG:
Creates a match-time capture. Unlike all other captures, this one is evaluated immediately when a match occurs (even if it
is part of a larger pattern that fails later). It forces the immediate evaluation of all its nested captures and then calls
function.
The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture
values produced by patt.
The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the
returned number becomes the new current position. (Assuming a subject s and current position i, the returned number must be
in the range [i, len(s) + 1].) If the call returns true, the match succeeds without consuming any input. (So, to return true
is equivalent to return i.) If the call returns false, nil, or no value, the match fails.
Any extra values returned by the function become the values produced by the capture.
314 315 316 317 318 319 320 |
# File 'lib/rpeg/rpeg.rb', line 314 def Cmt(patt, function) # LPEG uses a separate RUNTIME node type instead of CAPTURE because certain functions, like hascaptures and fixedlen, need # special behavior here. Note # # LPEG also uses "runtime" interally instead of "matchtime". We follow Pattern.new(Pattern::RUNTIME, P(patt), data: function) end |
#Cp ⇒ Object
LPEG: Creates a position capture. It matches the empty string and captures the position in the subject where the match occurs. The captured value is a number.
262 263 264 |
# File 'lib/rpeg/rpeg.rb', line 262 def Cp Pattern.new(Pattern::CAPTURE, P(true), capture: Capture::POSITION) end |
#Cs(patt) ⇒ Object
From LPEG:
Creates a substitution capture, which captures the substring of the subject that matches patt, with substitutions. For any
capture inside patt with a value, the substring that matched the capture is replaced by the capture value (which should be a
string). The final captured value is the string resulting from all replacements.
271 272 273 |
# File 'lib/rpeg/rpeg.rb', line 271 def Cs(patt) Pattern.new(Pattern::CAPTURE, P(patt), capture: Capture::SUBST) end |
#Ct(patt) ⇒ Object
From LPEG:
Creates a table capture. This capture returns a table with all values from all anonymous captures made by patt inside this
table in successive integer keys, starting at 1. Moreover, for each named capture group created by patt, the first value of
the group is put into the table with the group name as its key. The captured value is only the table.
For us the capture takes the form of a custom class, TableCapture. It is intended to mimic a little bit of the functionality of Lua’s tables, which are a combination Array/Hashtable
-
indexing is by 0, 1, 2, … for the anonmyous catpures
-
indexing by group names otherwise
-
#unpack gives an arry of the anonymous captures.
See the class definition (captures.rb) for more details.
Other things tried:
-
returning a hash when there are named captures and an array when there are not
-
in the hash, anonmyous captures are at keys 0, 1, 2, …
-
this turned out to be somewhat frustrating in unit tests.
-
-
returning a hash with group names as keys and a special key of :anon for the array of anonmyous captures
-
I thought this would work better, but actually turned out to be much worse to work with
-
295 296 297 |
# File 'lib/rpeg/rpeg.rb', line 295 def Ct(patt) Pattern.new(Pattern::CAPTURE, P(patt), capture: Capture::TABLE) end |
#match(thing, string, init = 0, *extra_args) ⇒ Object
See the instance method #match for the arguments
336 337 338 |
# File 'lib/rpeg/rpeg.rb', line 336 def match(thing, string, init = 0, *extra_args) P(thing).match(string, init, *extra_args) end |
#P(arg) ⇒ Object
Take argument and turn it into a pattern
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
# File 'lib/rpeg/rpeg.rb', line 102 def P(arg) case arg when Pattern arg when String # a sequence of CHAR patterns patt = P(true) arg.chars.reverse_each do |ch| patt = Pattern.new(Pattern::CHAR, data: ch) * patt end patt when Integer # When n >= 0, match at least n chars. # When n < 0, there must not be n or more characters left return -P(-arg) if arg.negative? # In LPEG the ANY VM instruction takes no arg and matches a single char, unlike the description in the paper. I think it # makes certain code optimizations simpler to analyze. We do the same. patt = P(true) arg.times do patt = Pattern.new(Pattern::ANY) * patt end patt when FalseClass @false_tree ||= Pattern.new(Pattern::NFALSE) when TrueClass @true_tree ||= Pattern.new(Pattern::NTRUE) when Hash, Array Pattern.new(Pattern::GRAMMAR, data: arg) when Proc # a pattern equivalent to a match-time capture over the empty string. Pattern.new(Pattern::RUNTIME, P(true), data: arg) else raise "RPEG.P does not support argument #{arg}" end end |
#R(*ranges) ⇒ Object
Given a 2-char string xy, the ASCII range x..y. Each argument gives a range and we match on their union.
Always represent with a Set.
142 143 144 145 146 147 148 149 150 151 152 153 |
# File 'lib/rpeg/rpeg.rb', line 142 def R(*ranges) return P(false) if ranges.empty? check = lambda do |str| raise "Bad data #{str} for Pattern#R" unless str.is_a?(String) && str.size == 2 Set.new ((str[0])..(str[1])).to_a end result = ranges.map{ check.call(_1) }.reduce(:|) S(result) end |
#S(charset) ⇒ Object
Match any character in string (regarded as a set of characters), range, or Set
If the set is empty we have NFALSE, which always fails If the set has a single element we have CHAR pattern, which is a little faster in the VM Otherwise we have a CHARSET pattern
85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
# File 'lib/rpeg/rpeg.rb', line 85 def S(charset) case charset when Set size = charset.size return P(false) if size.zero? return P(charset.first) if size == 1 return P(1) if charset == Pattern::FULL_CHAR_SET Pattern.new(Pattern::CHARSET, data: charset) when String S(Set.new(charset.chars)) else raise "Cannot create a character set pattern from #{chars}" end end |
#V(ref) ⇒ Object
An “open call” reference to a rule in a grammar. As we don’t have the grammar yet - it is available in full only when we are ready to compile - we remember it this way.
ref should be either
- a non-negative integer n, referring to the n-th rule in the grammar (0-based) or
- a value that will be the key in the final grammar - a Hash or Array - of the rule being referenced
- strings are turned into symbols
162 163 164 165 |
# File 'lib/rpeg/rpeg.rb', line 162 def V(ref) ref = ref.to_sym if ref.is_a?(String) Pattern.new(Pattern::OPEN_CALL, data: ref) end |