Class: ParsingMachine
- Inherits:
-
Object
- Object
- ParsingMachine
- Defined in:
- lib/rpeg/parsing_machine.rb
Overview
The VM used to run the programs generated from the patterns.
See lpvm.c in the LPEG code.
Defined Under Namespace
Classes: Frame
Instance Attribute Summary collapse
-
#extra_args ⇒ Object
readonly
These are internals that aren’t useful in the usual case.
-
#i_ptr ⇒ Object
readonly
These are internals that aren’t useful in the usual case.
-
#program ⇒ Object
readonly
These are internals that aren’t useful in the usual case.
-
#stack ⇒ Object
readonly
These are internals that aren’t useful in the usual case.
-
#subject ⇒ Object
readonly
These are internals that aren’t useful in the usual case.
-
#subject_index ⇒ Object
readonly
These are internals that aren’t useful in the usual case.
Instance Method Summary collapse
- #breadcrumbs ⇒ Object
-
#captures ⇒ Object
Returns the captures obtained when we ran the machine.
-
#handle_run_time_capture_result(results) ⇒ Object
From the ICloseRuntIme main-loop switch statement in lpvm.c.
-
#initialize(program, subject, initial_pos, extra_args) ⇒ ParsingMachine
constructor
program: the program to run subject: the string to match against initial_pos: the position in subject to start the search at extra_args may have been supplied in the initial #match call.
- #new_capture_state(starting_index = nil) ⇒ Object
- #run ⇒ Object
-
#run_time_capture ⇒ Object
This stub needs to be in ParsingMachine and not CaptureState because it must modify @bread_count.
- #step ⇒ Object
- #success? ⇒ Boolean
Constructor Details
#initialize(program, subject, initial_pos, extra_args) ⇒ ParsingMachine
program: the program to run subject: the string to match against initial_pos: the position in subject to start the search at extra_args may have been supplied in the initial #match call. These are consumed by Argument Captures.
117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
# File 'lib/rpeg/parsing_machine.rb', line 117 def initialize(program, subject, initial_pos, extra_args) @program = program.clone.freeze.must_only_contain(Instruction) @prog_len = @program_size # When benchmarking with a large subject (4.5 MB) I found that the search took an enormous amount of time (1300 s), and 95% of # it was due to 3.6 million calls to String::[]. I don't know why this was so slow, as accessing a large string in a small # scratch loop is fast. I am very confused. Converting the string to an array of chars is much faster (1300 s became 9 s). # @original_subject = subject.clone.freeze @subject = subject.chars.freeze @subject_size = @subject.size @i_ptr = 0 # index in @program of the next instruction @subject_index = initial_pos @stack = [] @breadcrumbs = [].must_only_ever_contain(Capture::Breadcrumb) # the records of the captures we make during parsing @bread_count = 0 # the number of breadcrumbs in @breadcrumbs (some in the array may be stale) @extra_args = extra_args.clone end |
Instance Attribute Details
#extra_args ⇒ Object (readonly)
These are internals that aren’t useful in the usual case
260 261 262 |
# File 'lib/rpeg/parsing_machine.rb', line 260 def extra_args @extra_args end |
#i_ptr ⇒ Object (readonly)
These are internals that aren’t useful in the usual case
260 261 262 |
# File 'lib/rpeg/parsing_machine.rb', line 260 def i_ptr @i_ptr end |
#program ⇒ Object (readonly)
These are internals that aren’t useful in the usual case
260 261 262 |
# File 'lib/rpeg/parsing_machine.rb', line 260 def program @program end |
#stack ⇒ Object (readonly)
These are internals that aren’t useful in the usual case
260 261 262 |
# File 'lib/rpeg/parsing_machine.rb', line 260 def stack @stack end |
#subject ⇒ Object (readonly)
These are internals that aren’t useful in the usual case
260 261 262 |
# File 'lib/rpeg/parsing_machine.rb', line 260 def subject @subject end |
#subject_index ⇒ Object (readonly)
These are internals that aren’t useful in the usual case
260 261 262 |
# File 'lib/rpeg/parsing_machine.rb', line 260 def subject_index @subject_index end |
Instance Method Details
#breadcrumbs ⇒ Object
262 263 264 |
# File 'lib/rpeg/parsing_machine.rb', line 262 def @breadcrumbs[0, @bread_count] end |
#captures ⇒ Object
Returns the captures obtained when we ran the machine.
If there are no captures we return the final index into the subject string. This is typically one past the matched section. If there is exactly one capture we return it. If there are multiple captures we return them in an array.
The capture code in LPEG (mostly in lpcap.c) looks complicated at first but it is made up of a bunch of pieces that each do one thing and coordinate well togehter. Some extra complexity comes from the manual memory management required in C and the need to interact with Lua values - this appears to be especially the case with the Runtime capture code, which is bewildering at first view. Porting it one capture kind at a time let me understand it at some level as I went.
Basic model:
-
We push Breadcrumb objects onto the stack as we run the VM based on the instructions generated from the patterns. We never pop anything from the stack: the Captures are breadcrumbs that let us work out after the fact what happend. Things do get removed from the Capture stack but only at backtrack points because a match has failed.
-
The End instruction tacks on an unbalanced CloseCapture. This appears to be simply an end-marker like the null string terminator. We don’t do this
-
After the VM runs we analyze the Breadcrumbs to calculate the captures. We go back and forth through the data. So isn’t not a stack, but an array.
This method plays the same role as LPEG’s getcaptures (lpcap.c)
428 429 430 431 432 433 434 435 436 437 438 439 440 |
# File 'lib/rpeg/parsing_machine.rb', line 428 def captures raise "Cannot call #captures unless machine ran sucessfully" unless done? && success? @capture_state = new_capture_state @capture_state.capture_all result = @capture_state.captures return @subject_index if result.empty? return result.first if result.size == 1 result end |
#handle_run_time_capture_result(results) ⇒ Object
From the ICloseRuntIme main-loop switch statement in lpvm.c
372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 |
# File 'lib/rpeg/parsing_machine.rb', line 372 def handle_run_time_capture_result(results) directive, *dyn_captures = results unless directive handle_fail_ptr return end @subject_index = if directive == true @subject_index else directive.must_be_a(Integer) if directive < @subject_index || directive > @subject_size raise 'invalid position returned by match-time capture' end directive end if dyn_captures.empty? # no dynamic captures. Just get rid of the OPEN capture we still have @bread_count -= 1 else # This is LPEG's adddyncaptures in lpvm.c @breadcrumbs[@bread_count - 1].data = nil # make the group capture an anonymous group dyn_captures.each do |cap_val| # LPEG uses a special RUNTIME capture kind here to help find these things later if they need to be removed. We don't appear # to need it - we could just use a CONST capture. But let's follow LPEG just in case. add_capture Capture::Breadcrumb.new(1, @subject_index, cap_val, Capture::RUNTIME) end add_capture Capture::Breadcrumb.new(1, @subject_index, nil, Capture::CLOSE) # close the group end @i_ptr += 1 end |
#new_capture_state(starting_index = nil) ⇒ Object
454 455 456 |
# File 'lib/rpeg/parsing_machine.rb', line 454 def new_capture_state(starting_index = nil) CaptureState.new(@breadcrumbs[0, @bread_count], @subject, @subject_index, @extra_args, starting_index:) end |
#run ⇒ Object
149 150 151 |
# File 'lib/rpeg/parsing_machine.rb', line 149 def run step until @done end |
#run_time_capture ⇒ Object
This stub needs to be in ParsingMachine and not CaptureState because it must modify @bread_count
443 444 445 446 447 448 449 450 451 452 |
# File 'lib/rpeg/parsing_machine.rb', line 443 def run_time_capture # We need point to the close capture we just hit. LPEG is tricksy here: there isn't actually a CLOSE capture/breadcrumb yet, but # the data structure - an array of Capture objects - means that the "next capture" memory is interpreted as a Capture. We have # to do something manually that in the C code happens "automatically" add_capture Capture::Breadcrumb.new(0, @subject_index, nil, Capture::CLOSE) capture_state = new_capture_state(@bread_count - 1) # start on the CLOSE we just tacked on @bread_count, result = capture_state.run_time_capture result end |
#step ⇒ Object
153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 |
# File 'lib/rpeg/parsing_machine.rb', line 153 def step instr = @program[@i_ptr] case instr.op_code when Instruction::TEST_CHARSET test_char(instr.data.include?(@subject[@subject_index]), instr.offset) when Instruction::TEST_CHAR test_char(instr.data == @subject[@subject_index], instr.offset) when Instruction::TEST_ANY test_char(@subject_index < @subject_size, instr.offset) when Instruction::ANY check_char(@subject_index < @subject_size) when Instruction::CHARSET check_char(instr.data.include?(@subject[@subject_index])) when Instruction::CHAR check_char(instr.data == @subject[@subject_index]) when Instruction::JUMP @i_ptr += instr.offset when Instruction::CHOICE # We push the offset for the other side of the choice push(:state, instr.offset) @i_ptr += 1 when Instruction::CALL # Call is like jump, but we push the return address onto the stack first push(:instruction, 1) @i_ptr += instr.offset when Instruction::RETURN @i_ptr = pop(:instruction).i_ptr when Instruction::COMMIT # we pop and discard the top of the stack (which must be a full state) and then do the jump given by arg1. Even though we # are discarding it check that it was a full state for sanity. _ = pop(:state) @i_ptr += instr.offset when Instruction::PARTIAL_COMMIT # Sort of a combination of commit (which pops) and choice (which pushes), but we just tweak the top of the stack. See # Ierusalimschy, sec 4.3 stack_top = peek(:state) raise "Empty stack for partial commit!" unless stack_top stack_top.subject_index = @subject_index stack_top.bread_count = @bread_count @i_ptr += instr.offset when Instruction::BACK_COMMIT # A combination of a fail and a commit. We backtrack, but then jump to the specified instruction rather than using the # backtrack label. It's used for the AND pattern. See Ierusalimschy, 4.4 stack_top = pop(:state) @subject_index = stack_top.subject_index @bread_count = stack_top.bread_count @i_ptr += instr.offset when Instruction::SPAN # Special instruction for when we are repeating over a charset, which is common. We just consume as many maching characters # as there are. This never fails as we can always match at least zero. @subject_index += 1 while instr.data.include?(@subject[@subject_index]) @i_ptr += 1 when Instruction::BEHIND n = instr.aux # the (fixed) length of the pattern we want to match. if n > @subject_index # We can't jump so far back in the subject handle_fail_ptr else @subject_index -= n @i_ptr += 1 end when Instruction::FAIL handle_fail_ptr when Instruction::FAIL_TWICE # An optimization for the NOT implementation. We pop the top of the stack and discard it, and then enter the fail routine # again. For sanity's sake we'll check that the thing we are popping is a :state entry. See Ierusalimschy, 4.4 _ = pop(:state) handle_fail_ptr when Instruction::CLOSE_RUN_TIME # The LPEG code for runtime captures is very complicated. Reading through it, it appears that the complexity comes from # needing to carefully manage the capture breadcrumbs wrt to the Lua values living on the Lua stack to avoid memory # leaks. We don't have to worry about that here, as everything is in Ruby and we can leave the hard stuff to the garbage # collector. The remaining work is little more than we have with a function capture. result = run_time_capture handle_run_time_capture_result(result) when Instruction::OPEN_CAPTURE record_capture(instr, size: 0, subject_index: @subject_index) when Instruction::CLOSE_CAPTURE # As in LPEG: "if possible, turn capture into a full capture" raise "Close capture without an open" unless @bread_count.positive? lc = @breadcrumbs[@bread_count - 1].must_be # still on the breadcrumb list if lc.size.zero? && (@subject_index - lc.subject_index) < 255 # TODO: should we care about an upper bound here? # The previous breadcrumb was an OPEN, and we are closing it lc.size = @subject_index - lc.subject_index + 1 @i_ptr += 1 else record_capture(instr, size: 1, subject_index: @subject_index) end when Instruction::FULL_CAPTURE # We have an all-in-one match, and the "capture length" tells us how far back in the subject the match started. len = (instr.aux[:capture_length] || 0).must_be(Integer) record_capture(instr, size: 1 + len, subject_index: @subject_index - len) when Instruction::OP_END @success = true done! else raise "Unhandled op code #{instr.op_code}" end end |
#success? ⇒ Boolean
137 138 139 |
# File 'lib/rpeg/parsing_machine.rb', line 137 def success? @success end |