Class: ParsingMachine

Inherits:
Object
  • Object
show all
Defined in:
lib/rpeg/parsing_machine.rb

Overview

The VM used to run the programs generated from the patterns.

See lpvm.c in the LPEG code.

Defined Under Namespace

Classes: Frame

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(program, subject, initial_pos, extra_args) ⇒ ParsingMachine

program: the program to run subject: the string to match against initial_pos: the position in subject to start the search at extra_args may have been supplied in the initial #match call. These are consumed by Argument Captures.



117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/rpeg/parsing_machine.rb', line 117

def initialize(program, subject, initial_pos, extra_args)
  @program = program.clone.freeze.must_only_contain(Instruction)
  @prog_len = @program_size

  # When benchmarking with a large subject (4.5 MB) I found that the search took an enormous amount of time (1300 s), and 95% of
  # it was due to 3.6 million calls to String::[]. I don't know why this was so slow, as accessing a large string in a small
  # scratch loop is fast. I am very confused. Converting the string to an array of chars is much faster (1300 s became 9 s).
  # @original_subject = subject.clone.freeze
  @subject = subject.chars.freeze
  @subject_size = @subject.size

  @i_ptr = 0 # index in @program of the next instruction
  @subject_index = initial_pos
  @stack = []
  @breadcrumbs = [].must_only_ever_contain(Capture::Breadcrumb) # the records of the captures we make during parsing
  @bread_count = 0 # the number of breadcrumbs in @breadcrumbs (some in the array may be stale)

  @extra_args = extra_args.clone
end

Instance Attribute Details

#extra_argsObject (readonly)

These are internals that aren’t useful in the usual case



260
261
262
# File 'lib/rpeg/parsing_machine.rb', line 260

def extra_args
  @extra_args
end

#i_ptrObject (readonly)

These are internals that aren’t useful in the usual case



260
261
262
# File 'lib/rpeg/parsing_machine.rb', line 260

def i_ptr
  @i_ptr
end

#programObject (readonly)

These are internals that aren’t useful in the usual case



260
261
262
# File 'lib/rpeg/parsing_machine.rb', line 260

def program
  @program
end

#stackObject (readonly)

These are internals that aren’t useful in the usual case



260
261
262
# File 'lib/rpeg/parsing_machine.rb', line 260

def stack
  @stack
end

#subjectObject (readonly)

These are internals that aren’t useful in the usual case



260
261
262
# File 'lib/rpeg/parsing_machine.rb', line 260

def subject
  @subject
end

#subject_indexObject (readonly)

These are internals that aren’t useful in the usual case



260
261
262
# File 'lib/rpeg/parsing_machine.rb', line 260

def subject_index
  @subject_index
end

Instance Method Details



262
263
264
# File 'lib/rpeg/parsing_machine.rb', line 262

def breadcrumbs
  @breadcrumbs[0, @bread_count]
end

#capturesObject

Returns the captures obtained when we ran the machine.

If there are no captures we return the final index into the subject string. This is typically one past the matched section. If there is exactly one capture we return it. If there are multiple captures we return them in an array.

The capture code in LPEG (mostly in lpcap.c) looks complicated at first but it is made up of a bunch of pieces that each do one thing and coordinate well togehter. Some extra complexity comes from the manual memory management required in C and the need to interact with Lua values - this appears to be especially the case with the Runtime capture code, which is bewildering at first view. Porting it one capture kind at a time let me understand it at some level as I went.

Basic model:

  • We push Breadcrumb objects onto the stack as we run the VM based on the instructions generated from the patterns. We never pop anything from the stack: the Captures are breadcrumbs that let us work out after the fact what happend. Things do get removed from the Capture stack but only at backtrack points because a match has failed.

  • The End instruction tacks on an unbalanced CloseCapture. This appears to be simply an end-marker like the null string terminator. We don’t do this

  • After the VM runs we analyze the Breadcrumbs to calculate the captures. We go back and forth through the data. So isn’t not a stack, but an array.

This method plays the same role as LPEG’s getcaptures (lpcap.c)



428
429
430
431
432
433
434
435
436
437
438
439
440
# File 'lib/rpeg/parsing_machine.rb', line 428

def captures
  raise "Cannot call #captures unless machine ran sucessfully" unless done? && success?

  @capture_state = new_capture_state
  @capture_state.capture_all

  result = @capture_state.captures

  return @subject_index if result.empty?
  return result.first if result.size == 1

  result
end

#handle_run_time_capture_result(results) ⇒ Object

From the ICloseRuntIme main-loop switch statement in lpvm.c



372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
# File 'lib/rpeg/parsing_machine.rb', line 372

def handle_run_time_capture_result(results)
  directive, *dyn_captures = results
  unless directive
    handle_fail_ptr
    return
  end

  @subject_index = if directive == true
                     @subject_index
                   else
                     directive.must_be_a(Integer)
                     if directive < @subject_index || directive > @subject_size
                       raise 'invalid position returned by match-time capture'
                     end

                     directive
                   end

  if dyn_captures.empty?
    # no dynamic captures. Just get rid of the OPEN capture we still have
    @bread_count -= 1
  else
    # This is LPEG's adddyncaptures in lpvm.c
    @breadcrumbs[@bread_count - 1].data = nil # make the group capture an anonymous group
    dyn_captures.each do |cap_val|
      # LPEG uses a special RUNTIME capture kind here to help find these things later if they need to be removed. We don't appear
      # to need it - we could just use a CONST capture. But let's follow LPEG just in case.
      add_capture Capture::Breadcrumb.new(1, @subject_index, cap_val, Capture::RUNTIME)
    end
    add_capture Capture::Breadcrumb.new(1, @subject_index, nil, Capture::CLOSE) # close the group
  end
  @i_ptr += 1
end

#new_capture_state(starting_index = nil) ⇒ Object



454
455
456
# File 'lib/rpeg/parsing_machine.rb', line 454

def new_capture_state(starting_index = nil)
  CaptureState.new(@breadcrumbs[0, @bread_count], @subject, @subject_index, @extra_args, starting_index:)
end

#runObject



149
150
151
# File 'lib/rpeg/parsing_machine.rb', line 149

def run
  step until @done
end

#run_time_captureObject

This stub needs to be in ParsingMachine and not CaptureState because it must modify @bread_count



443
444
445
446
447
448
449
450
451
452
# File 'lib/rpeg/parsing_machine.rb', line 443

def run_time_capture
  # We need point to the close capture we just hit. LPEG is tricksy here: there isn't actually a CLOSE capture/breadcrumb yet, but
  # the data structure - an array of Capture objects - means that the "next capture" memory is interpreted as a Capture.  We have
  # to do something manually that in the C code happens "automatically"
  add_capture Capture::Breadcrumb.new(0, @subject_index, nil, Capture::CLOSE)
  capture_state = new_capture_state(@bread_count - 1) # start on the CLOSE we just tacked on

  @bread_count, result = capture_state.run_time_capture
  result
end

#stepObject



153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
# File 'lib/rpeg/parsing_machine.rb', line 153

def step
  instr = @program[@i_ptr]

  case instr.op_code
  when Instruction::TEST_CHARSET
    test_char(instr.data.include?(@subject[@subject_index]), instr.offset)
  when Instruction::TEST_CHAR
    test_char(instr.data == @subject[@subject_index], instr.offset)
  when Instruction::TEST_ANY
    test_char(@subject_index < @subject_size, instr.offset)
  when Instruction::ANY
    check_char(@subject_index < @subject_size)
  when Instruction::CHARSET
    check_char(instr.data.include?(@subject[@subject_index]))
  when Instruction::CHAR
    check_char(instr.data == @subject[@subject_index])
  when Instruction::JUMP
    @i_ptr += instr.offset
  when Instruction::CHOICE
    # We push the offset for the other side of the choice
    push(:state, instr.offset)
    @i_ptr += 1
  when Instruction::CALL
    # Call is like jump, but we push the return address onto the stack first
    push(:instruction, 1)
    @i_ptr += instr.offset
  when Instruction::RETURN
    @i_ptr = pop(:instruction).i_ptr
  when Instruction::COMMIT
    # we pop and discard the top of the stack (which must be a full state) and then do the jump given by arg1. Even though we
    # are discarding it check that it was a full state for sanity.
    _ = pop(:state)
    @i_ptr += instr.offset
  when Instruction::PARTIAL_COMMIT
    # Sort of a combination of commit (which pops) and choice (which pushes), but we just tweak the top of the stack. See
    # Ierusalimschy, sec 4.3
    stack_top = peek(:state)
    raise "Empty stack for partial commit!" unless stack_top

    stack_top.subject_index = @subject_index
    stack_top.bread_count = @bread_count
    @i_ptr += instr.offset
  when Instruction::BACK_COMMIT
    # A combination of a fail and a commit. We backtrack, but then jump to the specified instruction rather than using the
    # backtrack label. It's used for the AND pattern. See Ierusalimschy, 4.4
    stack_top = pop(:state)
    @subject_index = stack_top.subject_index
    @bread_count = stack_top.bread_count
    @i_ptr += instr.offset
  when Instruction::SPAN
    # Special instruction for when we are repeating over a charset, which is common. We just consume as many maching characters
    # as there are. This never fails as we can always match at least zero.
    @subject_index += 1 while instr.data.include?(@subject[@subject_index])
    @i_ptr += 1
  when Instruction::BEHIND
    n = instr.aux # the (fixed) length of the pattern we want to match.
    if n > @subject_index
      # We can't jump so far back in the subject
      handle_fail_ptr
    else
      @subject_index -= n
      @i_ptr += 1
    end
  when Instruction::FAIL
    handle_fail_ptr
  when Instruction::FAIL_TWICE
    # An optimization for the NOT implementation. We pop the top of the stack and discard it, and then enter the fail routine
    # again. For sanity's sake we'll check that the thing we are popping is a :state entry. See Ierusalimschy, 4.4
    _ = pop(:state)
    handle_fail_ptr
  when Instruction::CLOSE_RUN_TIME
    # The LPEG code for runtime captures is very complicated. Reading through it, it appears that the complexity comes from
    # needing to carefully manage the capture breadcrumbs wrt to the Lua values living on the Lua stack to avoid memory
    # leaks. We don't have to worry about that here, as everything is in Ruby and we can leave the hard stuff to the garbage
    # collector. The remaining work is little more than we have with a function capture.
    result = run_time_capture
    handle_run_time_capture_result(result)
  when Instruction::OPEN_CAPTURE
    record_capture(instr, size: 0, subject_index: @subject_index)
  when Instruction::CLOSE_CAPTURE
    # As in LPEG: "if possible, turn capture into a full capture"
    raise "Close capture without an open" unless @bread_count.positive?

    lc = @breadcrumbs[@bread_count - 1].must_be # still on the breadcrumb list
    if lc.size.zero? && (@subject_index - lc.subject_index) < 255 # TODO: should we care about an upper bound here?
      # The previous breadcrumb was an OPEN, and we are closing it
      lc.size = @subject_index - lc.subject_index + 1
      @i_ptr += 1
    else
      record_capture(instr, size: 1, subject_index: @subject_index)
    end
  when Instruction::FULL_CAPTURE
    # We have an all-in-one match, and the "capture length" tells us how far back in the subject the match started.
    len = (instr.aux[:capture_length] || 0).must_be(Integer)
    record_capture(instr, size: 1 + len, subject_index: @subject_index - len)
  when Instruction::OP_END
    @success = true
    done!
  else
    raise "Unhandled op code #{instr.op_code}"
  end
end

#success?Boolean

Returns:

  • (Boolean)


137
138
139
# File 'lib/rpeg/parsing_machine.rb', line 137

def success?
  @success
end