Module: RPEG

Extended by:
RPEG
Included in:
RPEG
Defined in:
lib/rpeg/rpeg.rb,
lib/rpeg.rb

Overview

This is intended to play the same role as LPEG’s lpeg module.

Top-level differences from LPEG:

  • AND patterns in LPEG are written as #patt (&patt in the first version) but +patt here

    • unary & apparently can’t be overloaded in Ruby

      • I tried using the “&” operator by overriding #to_proc but the Ruby parser rejects the &patt expression.

    • +patt doesn’t read well, even though -patt does. I think this is because binary plus is so much more common when buliding patterns than binary minus is.

      • There doesn’t seem to be another workable option. According to stackoverflow.com/a/21060235/1299011 the unary operators are !, ~, , and -. We use - already and ! needs to be avoided because of idiomiatic Ruby checks like !foo for existence. Operator ~ works, but that character is easy to mistake for a - when reading, which is bad. Ruby uses & as a unary (for to_proc) but the Ruby parser doesn’t allow its use in general. So I think we’re stuck with .

  • repeating patterns still use exponentiation, but it now looks like patt**n rather than patt^n because of Ruby’s syntax

  • grammars are represented by hashes or arrays. LPEG uses Lua tables (which are mashups of hashtables and arrays)

    If an array is given then the nonterminals aren’t named and all open calls must use numeric indices. The first element of the array is either

    • a non-negative integer 0, 1, 2, … and specifies the (rule of the) initial nonterminal among the remaining elements with indices reckoned without that initial integer

    • something else, which is interpreted as the pattern for the initial nonterminal

    Otherwise the grammar is defined with a Hash. The keys are the nonterminal symbols and the values are the rule patterns.

    • the keys must be symbols or strings (which are converted to symbols). No rule can use :initial or “initial” as nonterminal.

    • the open calls can refer either to the nonterminals (as strings or symbols) or to rule indices as they appear in the hash, ignoring the :initial key (if present)

    • :initial/“initial” can appear as a key in the hash and its value specifies the initial nonterminal.

      • if it is a non-zero integer it gives the index of the initial terminal’s rule, reckoned without the presence of the :initial key itself.

      • if it is a symbol or a string it specifies the initial nonterminal directly

  • “Table” captures return an instace of TableCapture, which impelements a little bit of a Lua’s table functionality

    • other formats haven’t worked out well

Function captures

Various kinds of captures involve calling a function (proc) provided by client code. For example, the construction (patt / fn) takes the captures made by patt and passes them as arguments to fn. Then the values returned by fn become the captures of the expression. Lua is better than Ruby at distinguishing between a function that returns multiple values and one that returns a single value that is an array. In RPEG, returns from function in contexts like this are treated as follows:

- [1, 2, 3]: multiple captures, 1, 2, 3.
  - this is the natural interpretation as it's the standard way that a Ruby function returns multiple values
- [[1, 2, 3]]: a single capture that is an array
- nil: no captures
  - even if the function says something like "return nil", the capture code has no way to distinguish between that and a
    function that returns nothing
- [nil]: a single capture with value nil
  - the weirdest case, but I don't see an alternative

TODO:

  • program generation optimations

    • other pattern-based optimizations: need to scan through the LPEG code again

      • I think I’ve done them all now

    • profiling

  • LPEG’s locale support

    • what would this look like in Ruby?

Defined Under Namespace

Modules: NonNumericOverloadExtension, RE Classes: Pattern

Instance Method Summary collapse

Instance Method Details

#B(patt) ⇒ Object

Returns a pattern that matches only if the input string at the current position is preceded by patt. Pattern patt must match only strings with some fixed length, and it cannot contain captures.



324
325
326
327
328
329
330
331
332
333
# File 'lib/rpeg/rpeg.rb', line 324

def B(patt)
  patt = P(patt)
  len = patt.fixed_len
  raise "Behind match: pattern may not have fixed length" unless len
  raise "Behind match: pattern has captures" if patt.has_captures?

  # LPEG puts an upper bound of MAXBEHIND = 255 on how large the match can be here. I think it is because the value is packed
  # into a byte of memory. We don't care about that
  Pattern.new(Pattern::BEHIND, patt, data: len)
end

#C(pattern) ⇒ Object

LPEG: Creates a simple capture, which captures the substring of the subject that matches patt. The captured value is a string. If patt has other captures, their values are returned after this one.

Note: it appears that when a simple capture is over another simple capture - like C(C(patt)) - we squeeze out the duplication. See this test at l.216 of test.lua:

assert(#m.match(m.C(m.C(i)), string.rep('a', i)) == i)

(In lua the # operator gives the size of a string or table)



176
177
178
179
180
181
# File 'lib/rpeg/rpeg.rb', line 176

def C(pattern)
  pattern = P(pattern)
  return pattern if pattern.type == Pattern::CAPTURE && pattern.capture == Capture::SIMPLE

  Pattern.new(Pattern::CAPTURE, pattern, capture: Capture::SIMPLE)
end

#Carg(num = nil) ⇒ Object

Capture the n-th extra argument provided to #match. The first extra argument is n=1, etc.

We accept a missing argument to match some LPEG test cases but an error is raised



186
187
188
189
190
# File 'lib/rpeg/rpeg.rb', line 186

def Carg(num = nil)
  raise "Invalid argument for Carg: #{num || 'nil'}" unless num.is_a?(Integer) && num&.positive?

  Pattern.new(Pattern::CAPTURE, P(true), data: num, capture: Capture::ARGUMENT)
end

#Cb(name) ⇒ Object

From LPEG docs:

Creates a back capture. This pattern matches the empty string and produces the values produced by the most recent group
capture named name (where name can be any Lua value).

Most recent means the last complete outermost group capture with the given name. A Complete capture means that the entire
pattern corresponding to the capture has matched. An Outermost capture means that the capture is not inside another complete
capture.

nil is not allowed as a name



202
203
204
205
206
# File 'lib/rpeg/rpeg.rb', line 202

def Cb(name)
  raise "Back capture must specify name of group" unless name

  Pattern.new(Pattern::CAPTURE, P(true), data: name, capture: Capture::BACKREF)
end

#Cc(*values) ⇒ Object

LPEG: Creates a constant capture. This pattern matches the empty string and produces all given values as its captured values.

No value at all - Cc() - adds nothing to the result, which is different from a value of nil.

We capture several values with individual captures.



213
214
215
216
217
218
219
220
221
222
223
224
# File 'lib/rpeg/rpeg.rb', line 213

def Cc(*values)
  return P(true) if values.empty?

  patt = Pattern.new(Pattern::CAPTURE, P(true), data: values.first, capture: Capture::CONST)
  return patt if values.size == 1

  # Otherwise, follow LPEG and make an anonymous Group capture over a sequence of single-val const captures
  values[1...].each do |val|
    patt *= Cc(val)
  end
  Cg(patt)
end

#Cf(pattern, lambda) ⇒ Object

From LPEG docs:

Creates a fold capture. If patt produces a list of captures C1 C2 ... Cn, this capture will produce the value
func(...func(func(C1, C2), C3)..., Cn), that is, it will fold (or accumulate, or reduce) the captures from patt using
function func.

The second argument should be a lambda taking two arguments and returning one in the standard way.

The first nested capture is examined. If there isn’t one or it captures no values there is an error. If this capture contains more than one value all but the first are discarded. This is the initial value for the fold. Then we extract the remaining nested captures and use their values C2, …, Cn in the fold as described.

If Ci (i >= 2) produces k values then the lambda will receive k+1 arguments: the accumulator and the k captured values. an array.



240
241
242
243
244
# File 'lib/rpeg/rpeg.rb', line 240

def Cf(pattern, lambda)
  raise "Fold capture must have an accumulation lambda" unless lambda

  Pattern.new(Pattern::CAPTURE, P(pattern), data: lambda, capture: Capture::FOLD)
end

#Cg(pattern, name = nil) ⇒ Object

From LPEG docs:

Creates a group capture. It groups all values returned by patt into a single capture. The group may be anonymous (if no name
is given) or named with the given name (which can be any non-nil Lua value).

An anonymous group serves to join values from several captures into a single capture. A named group has a different
behavior. In most situations, a named group returns no values at all. Its values are only relevant for a following back
capture or when used inside a table capture.

The name doesn’t have to be string. It can be anything other than nil, because nil means it’s an anonymous group.



256
257
258
# File 'lib/rpeg/rpeg.rb', line 256

def Cg(pattern, name = nil)
  Pattern.new(Pattern::CAPTURE, P(pattern), data: name, capture: Capture::GROUP)
end

#Cmt(patt, function) ⇒ Object

From LPEG:

Creates a match-time capture. Unlike all other captures, this one is evaluated immediately when a match occurs (even if it
is part of a larger pattern that fails later). It forces the immediate evaluation of all its nested captures and then calls
function.

The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture
values produced by patt.

The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the
returned number becomes the new current position. (Assuming a subject s and current position i, the returned number must be
in the range [i, len(s) + 1].) If the call returns true, the match succeeds without consuming any input. (So, to return true
is equivalent to return i.) If the call returns false, nil, or no value, the match fails.

Any extra values returned by the function become the values produced by the capture.


314
315
316
317
318
319
320
# File 'lib/rpeg/rpeg.rb', line 314

def Cmt(patt, function)
  # LPEG uses a separate RUNTIME node type instead of CAPTURE because certain functions, like hascaptures and fixedlen, need
  # special behavior here. Note
  #
  # LPEG also uses "runtime" interally instead of "matchtime". We follow
  Pattern.new(Pattern::RUNTIME, P(patt), data: function)
end

#CpObject

LPEG: Creates a position capture. It matches the empty string and captures the position in the subject where the match occurs. The captured value is a number.



262
263
264
# File 'lib/rpeg/rpeg.rb', line 262

def Cp
  Pattern.new(Pattern::CAPTURE, P(true), capture: Capture::POSITION)
end

#Cs(patt) ⇒ Object

From LPEG:

Creates a substitution capture, which captures the substring of the subject that matches patt, with substitutions. For any
capture inside patt with a value, the substring that matched the capture is replaced by the capture value (which should be a
string). The final captured value is the string resulting from all replacements.


271
272
273
# File 'lib/rpeg/rpeg.rb', line 271

def Cs(patt)
  Pattern.new(Pattern::CAPTURE, P(patt), capture: Capture::SUBST)
end

#Ct(patt) ⇒ Object

From LPEG:

Creates a table capture. This capture returns a table with all values from all anonymous captures made by patt inside this
table in successive integer keys, starting at 1. Moreover, for each named capture group created by patt, the first value of
the group is put into the table with the group name as its key. The captured value is only the table.

For us the capture takes the form of a custom class, TableCapture. It is intended to mimic a little bit of the functionality of Lua’s tables, which are a combination Array/Hashtable

  • indexing is by 0, 1, 2, … for the anonmyous catpures

  • indexing by group names otherwise

  • #unpack gives an arry of the anonymous captures.

See the class definition (captures.rb) for more details.

Other things tried:

  • returning a hash when there are named captures and an array when there are not

    • in the hash, anonmyous captures are at keys 0, 1, 2, …

    • this turned out to be somewhat frustrating in unit tests.

  • returning a hash with group names as keys and a special key of :anon for the array of anonmyous captures

    • I thought this would work better, but actually turned out to be much worse to work with



295
296
297
# File 'lib/rpeg/rpeg.rb', line 295

def Ct(patt)
  Pattern.new(Pattern::CAPTURE, P(patt), capture: Capture::TABLE)
end

#match(thing, string, init = 0, *extra_args) ⇒ Object

See the instance method #match for the arguments



336
337
338
# File 'lib/rpeg/rpeg.rb', line 336

def match(thing, string, init = 0, *extra_args)
  P(thing).match(string, init, *extra_args)
end

#P(arg) ⇒ Object

Take argument and turn it into a pattern



102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# File 'lib/rpeg/rpeg.rb', line 102

def P(arg)
  case arg
  when Pattern
    arg
  when String
    # a sequence of CHAR patterns
    patt = P(true)
    arg.chars.reverse_each do |ch|
      patt = Pattern.new(Pattern::CHAR, data: ch) * patt
    end
    patt
  when Integer
    # When n >= 0, match at least n chars.
    # When n < 0, there must not be n or more characters left
    return -P(-arg) if arg.negative?

    # In LPEG the ANY VM instruction takes no arg and matches a single char, unlike the description in the paper. I think it
    # makes certain code optimizations simpler to analyze. We do the same.
    patt = P(true)
    arg.times do
      patt = Pattern.new(Pattern::ANY) * patt
    end
    patt
  when FalseClass
    @false_tree ||= Pattern.new(Pattern::NFALSE)
  when TrueClass
    @true_tree ||= Pattern.new(Pattern::NTRUE)
  when Hash, Array
    Pattern.new(Pattern::GRAMMAR, data: arg)
  when Proc
    # a pattern equivalent to a match-time capture over the empty string.
    Pattern.new(Pattern::RUNTIME, P(true), data: arg)
  else
    raise "RPEG.P does not support argument #{arg}"
  end
end

#R(*ranges) ⇒ Object

Given a 2-char string xy, the ASCII range x..y. Each argument gives a range and we match on their union.

Always represent with a Set.



142
143
144
145
146
147
148
149
150
151
152
153
# File 'lib/rpeg/rpeg.rb', line 142

def R(*ranges)
  return P(false) if ranges.empty?

  check = lambda do |str|
    raise "Bad data #{str} for Pattern#R" unless str.is_a?(String) && str.size == 2

    Set.new ((str[0])..(str[1])).to_a
  end

  result = ranges.map{ check.call(_1) }.reduce(:|)
  S(result)
end

#S(charset) ⇒ Object

Match any character in string (regarded as a set of characters), range, or Set

If the set is empty we have NFALSE, which always fails If the set has a single element we have CHAR pattern, which is a little faster in the VM Otherwise we have a CHARSET pattern



85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# File 'lib/rpeg/rpeg.rb', line 85

def S(charset)
  case charset
  when Set
    size = charset.size
    return P(false) if size.zero?
    return P(charset.first) if size == 1
    return P(1) if charset == Pattern::FULL_CHAR_SET

    Pattern.new(Pattern::CHARSET, data: charset)
  when String
    S(Set.new(charset.chars))
  else
    raise "Cannot create a character set pattern from #{chars}"
  end
end

#V(ref) ⇒ Object

An “open call” reference to a rule in a grammar. As we don’t have the grammar yet - it is available in full only when we are ready to compile - we remember it this way.

ref should be either

- a non-negative integer n, referring to the n-th rule in the grammar (0-based) or
- a value that will be the key in the final grammar - a Hash or Array - of the rule being referenced
  - strings are turned into symbols


162
163
164
165
# File 'lib/rpeg/rpeg.rb', line 162

def V(ref)
  ref = ref.to_sym if ref.is_a?(String)
  Pattern.new(Pattern::OPEN_CALL, data: ref)
end