Class: CaptureState

Inherits:
Object
  • Object
show all
Defined in:
lib/rpeg/captures.rb

Overview

VM and post-match capture support

q.v. LPEG’s CaptureState, lpcap.h

As a Ruby class it contains as much of the capture code (lpcap.c) as makes sense. Because of this, it needs to know the

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(breadcrumbs, subject, subject_index, extra_args, starting_index:) ⇒ CaptureState

Returns a new instance of CaptureState.



148
149
150
151
152
153
154
155
# File 'lib/rpeg/captures.rb', line 148

def initialize(breadcrumbs, subject, subject_index, extra_args, starting_index:)
  @breadcrumbs = breadcrumbs
  @breadcrumb_idx = starting_index || 0
  @subject = subject.freeze
  @subject_index = subject_index
  @extra_args = extra_args.freeze
  @captures = []
end

Instance Attribute Details

#capturesObject (readonly)

Returns the value of attribute captures.



146
147
148
# File 'lib/rpeg/captures.rb', line 146

def captures
  @captures
end

Instance Method Details

#advanceObject



194
195
196
# File 'lib/rpeg/captures.rb', line 194

def advance
  @breadcrumb_idx += 1
end

#capture_allObject



157
158
159
# File 'lib/rpeg/captures.rb', line 157

def capture_all
  push_capture until done?
end

#current_breadcrumbObject



188
189
190
191
192
# File 'lib/rpeg/captures.rb', line 188

def current_breadcrumb
  raise "No available breadcrumb" if done?

  @breadcrumbs[@breadcrumb_idx]
end

#done?Boolean

Returns:

  • (Boolean)


184
185
186
# File 'lib/rpeg/captures.rb', line 184

def done?
  @breadcrumb_idx == @breadcrumbs.size
end

#extract_one_string(what) ⇒ Object

This is LPEG’s addonestring (lpcap.c)

/* ** Evaluates a capture and adds its first value to buffer ‘b’; returns ** whether there was a value */

We just return the value, or nil if there isn’t one



533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
# File 'lib/rpeg/captures.rb', line 533

def extract_one_string(what)
  case current_breadcrumb.kind
  when Capture::STRING
    extract_string_capture
  when Capture::SUBST
    extract_subst_capture
  else
    n = push_capture
    return nil if n.zero?

    pop(n - 1) # just leave one
    res = pop
    # LPEG tests the type of this value with lua_isstring, which returns 1 if the value is a string or a number.
    raise "invalid #{what} value (a #{res.class})" unless res.is_a?(String) || res.is_a?(Numeric)

    res.to_s
  end
end

#extract_string_captureObject

This is LPEG’s stringcap (lpcap.c)

We return the result



485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
# File 'lib/rpeg/captures.rb', line 485

def extract_string_capture
  fmt = current_breadcrumb.data.must_be_a(String)
  the_str_caps = str_caps
  result = +""
  idx = -1
  loop do
    idx += 1
    break if idx >= fmt.length

    if fmt[idx] != "%"
      result << fmt[idx]
      next
    end

    idx += 1
    unless ('0'..'9').cover?(fmt[idx])
      result << fmt[idx]
      next
    end

    capture_index = fmt[idx].to_i
    raise "invalid capture index (#{capture_index})" if capture_index > the_str_caps.size - 1

    str_cap = the_str_caps[capture_index]
    if str_cap.isstring
      result << @subject[(str_cap.subject_start)...(str_cap.subject_end)].join
      next
    end

    cs_index = @breadcrumb_idx
    @breadcrumb_idx = the_str_caps[capture_index].breadcrumb_idx
    val = extract_one_string("capture") # lpeg's addonestring, but return instead of appending to b
    raise "no values in capture index #{capture_index}" unless val

    result << val
    @breadcrumb_idx = cs_index
  end
  result
end

#extract_subst_captureObject

This is LPEG’s substcap (lpcap.c)



457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
# File 'lib/rpeg/captures.rb', line 457

def extract_subst_capture
  breadcrumb = current_breadcrumb
  curr = breadcrumb.subject_index
  result = +""
  if breadcrumb.full?
    result = @subject[curr, breadcrumb.size - 1].join
  else
    advance # skip open
    until current_breadcrumb.close?
      nxt = current_breadcrumb.subject_index
      result << @subject[curr, nxt - curr].join
      if (match = extract_one_string("replacement"))
        result << match
        curr = prev_end_index
      else
        # no capture index
        curr = nxt
      end
    end
    result << @subject[curr, current_breadcrumb.subject_index - curr].join
  end
  advance
  result
end

#indexObject



198
199
200
# File 'lib/rpeg/captures.rb', line 198

def index
  @breadcrumb_idx
end

#index=(val) ⇒ Object



202
203
204
# File 'lib/rpeg/captures.rb', line 202

def index=(val)
  @breadcrumb_idx = val
end

#munge_last!(num) ⇒ Object

partially rotate the captures to make what is currently the final value the n-th from last value. For example, if @captures is currently [0, 1, 2, 3, 4], then calling munge_last(3) makes it [0, 1, 4, 2, 3]. Now 4 (previously the last value) is third from last. When n == 1 this is a no-op



692
693
694
695
696
697
698
699
700
701
# File 'lib/rpeg/captures.rb', line 692

def munge_last!(num)
  return if num == 1
  raise "Bad munge argument" unless num.positive?
  raise "Not enough values in array to munge it" if num > @captures.size

  tail = @captures.pop(num)
  last = tail.pop
  @captures << last
  @captures += tail
end

#pop(num = nil) ⇒ Object

Pop the top capture off and return it, erroring if there isn’t one.

If the argument is given we pop off the last n captures as a chunk (not one by one) and return them in an array.

nil might be in the stack, so we need to count rather than simply check pop for truthiness



171
172
173
174
175
176
177
178
179
180
181
182
# File 'lib/rpeg/captures.rb', line 171

def pop(num = nil)
  if num
    raise "Cannot pop off a negative number of elements" if num.negative?
    raise "There are not #{num} captures to pop" if num > @captures.size

    @captures.pop(num)
  else
    raise "There is not a capture to pop" unless @captures.size.positive?

    @captures.pop
  end
end

#prev_end_indexObject

The ending subject index of the previous breadcrumb



207
208
209
210
211
# File 'lib/rpeg/captures.rb', line 207

def prev_end_index
  raise "No previous breadcrumb" unless @breadcrumb_idx.positive?

  @breadcrumbs[@breadcrumb_idx - 1].end_index
end

#push(cap) ⇒ Object

push a captured value



162
163
164
# File 'lib/rpeg/captures.rb', line 162

def push(cap)
  @captures << cap
end

#push_fold_captureObject

This is LPEG’s foldcap (lpcap.c)



366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
# File 'lib/rpeg/captures.rb', line 366

def push_fold_capture
  raise "no initial value for fold capture" if current_breadcrumb.full?

  fn = current_breadcrumb.data.must_be
  advance

  if current_breadcrumb.close? || (n = push_capture).zero?
    raise "no initial value for fold capture"
  end

  # discard all but one capture. This is the first value for the fold accumulator
  pop(n - 1)
  acc = pop
  until current_breadcrumb.close?
    n = push_capture
    acc = fn.call(acc, *pop(n))
  end
  advance # skip close
  push acc
  1
end

#push_function_captureObject

This is LPEG’s functioncap (lpcap.c)



412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
# File 'lib/rpeg/captures.rb', line 412

def push_function_capture
  proc = current_breadcrumb.data.must_be_a(Proc) # get the proc to call
  n = push_nested_captures # get the nested captures...
  args = pop(n) # ...pop them
  result = proc.call(*args) # ... and pass them to the proc
  # the results, if any, are the capture values
  #
  # The natural thing to do here would be result = Array(result) and just enumerate them to push onto the capture stack. BUT,
  # sometimes proc will return a Hash (when building a grammar in RE, for example) and Array({x: 1}) = [[:x, 1]], which is not
  # what we want. At root, the issue is that Lua is better than Ruby at distinguishing between a function that returns multiple
  # value and one that returns a single value that is an array. The following appears to be what we want, and remember that we
  # need to write capture functions that are careful to distinguish between returning [1,2,3] (multiple captures) and [[1,2,3]]
  # (single capture that is an array).
  #
  # Another gotcha: a function that returns nil does not give a capture, while one that returns [nil] has captured the single
  # value nil.
  #
  # TODO: consider whether the Maybe monad would help here. It would help distinguish between [1,2,3] and [[1,2,3]] more cleanly,
  # but not with "no capture" vs "nil was captured", since typically Maybe(nil) = None().
  if result.is_a?(Array)
    result.each { |cap| push cap }
    result.size
  elsif result
    push result
    1
  else
    0
  end
end

#push_nested_captures(add_extra: false) ⇒ Object

See pushnestedcaptures in lpcap.c

/* ** Push on the Lua stack all values generated by nested captures inside ** the current capture. Returns number of values pushed. ‘addextra’ ** makes it push the entire match after all captured values. The ** entire match is pushed also if there are no other nested values, ** so the function never returns zero. */

We append what we find to the capture state and return their number.

Code is closely based on the LPEG code.



292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
# File 'lib/rpeg/captures.rb', line 292

def push_nested_captures(add_extra: false)
  open_capture = current_breadcrumb
  advance

  if open_capture.full?
    cpos = open_capture.subject_index
    match_len = open_capture.size - 1
    match_range = cpos...(cpos + match_len)
    push @subject[match_range].join
    return 1
  end

  count = 0
  count += push_capture until current_breadcrumb.close? # Nested captures

  # We have reached our matching close
  close_capture = current_breadcrumb
  advance
  if add_extra || count.zero?
    match_range = (open_capture.subject_index)...(close_capture.subject_index)
    push @subject[match_range].join
    count += 1
  end
  count.must_be.positive?
  count
end

#push_num_captureObject

This is LPEG’s numcap (lpcap.c)



395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
# File 'lib/rpeg/captures.rb', line 395

def push_num_capture
  idx = current_breadcrumb.data
  if idx.zero?
    # skip them all
    seek_next!
    return 0
  end

  n = push_nested_captures
  raise "no capture '#{idx}" if n < idx

  vals = pop(n) # pop them off
  push vals[idx - 1] # push back the one we want
  1
end

#push_one_nested_valueObject

Push nested values and then pop off all but one



389
390
391
392
# File 'lib/rpeg/captures.rb', line 389

def push_one_nested_value
  n = push_nested_captures
  pop(n - 1)
end

#push_query_captureObject

This is LPEG’s querycap (lpcap.c)



443
444
445
446
447
448
449
450
451
452
453
454
# File 'lib/rpeg/captures.rb', line 443

def push_query_capture
  hash = current_breadcrumb.data.must_be_a(Hash)
  push_one_nested_value
  query_key = pop # pop it
  result = hash[query_key]
  if result
    push(result)
    1
  else
    0 # no result
  end
end

#push_table_captureObject

This is LPEG’s tablecap (lpcap.h)

Instead of always producing a Hash, potentially with integer keys 0, 1, 2, …, we produce an array when there are no named groups to worry about.

  • TODO: reconsider this. Arrays are nicer than Hashes in this case but client code might not know which class to expect, especially when getting captures from a complicated pattern.

    • At first I always returned a Hash, with numeric keys 0, 1, 2, … for the anonymous captures and name keys for the others. But this felt clunky, especially when we want to, say, join the anonymous arguments into a string.

    • Maybe we should return a hash with an :anonymous key giving the array of anonymous captures, or something like that.

Experimental: return a TableCapture instance



330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
# File 'lib/rpeg/captures.rb', line 330

def push_table_capture
  if current_breadcrumb.full?
    # Empty table
    push []
    advance
    return 1
  end

  advance # move past the open capture
  named_results = {}
  indexed_results = []
  next_index = 0
  until current_breadcrumb.close?
    breadcrumb = current_breadcrumb
    if breadcrumb.kind == Capture::GROUP && breadcrumb.data
      # named group. We only keep track of the *first* value in the group
      push_one_nested_value
      value = pop
      named_results[breadcrumb.data] = value
    else
      # not a named group
      # k is the number we just got. We pop them back and put them in our result object
      k = push_capture
      (0..(k - 1)).to_a.reverse_each do |i|
        indexed_results[next_index + i] = pop
      end
      next_index += k
    end
  end
  advance # skip the close entry

  push Capture::TableCapture.new(named_results, indexed_results)
  1
end

#run_time_captureObject

In LPEG this logic is split between the main VM loop (lpvm.c) and the runtimecap function (lpcap.c). As noted above, the LPEG code is complicated by the need to manage references to objects living on the Lua stack to avoid C-side memory leaks. We don’t have to worry about such things

We start at the CLOSE_RUN_TIME breadcrumb and

- change the CLOSE_RUN_TIME to a regular CLOSE
- find the matching OPEN and grab the Proc that we need to call
- push the nested captures and immediately pop them
- pass the necessary arguments to proc: the subject, the current posistion, and the just-popped captures
- clear out all the breadcrumbs after the OPEN

We return a [bc, result] pair.

- bc is the new "breadcrumb count" for the VM, as we discard the existing captures for the RunTime grouping.
- result is the result of the Proc call


566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
# File 'lib/rpeg/captures.rb', line 566

def run_time_capture
  seek_matching_open!
  current_breadcrumb.kind.must_be(Capture::GROUP)
  open_cap_idx = index

  proc = current_breadcrumb.data.must_be_a(Proc) # get the proc to call

  @subject_as_str ||= @subject.join
  args = [@subject_as_str, @subject_index]
  n = push_nested_captures
  args += pop(n) # prepare arguments for the function
  result = Array(proc.call(*args)) # ... and pass them to the proc

  [open_cap_idx + 1, result]
end

#seek_back_ref!(group_name) ⇒ Object

Search backwards from the current breadcrumb for the start of the group capture with the given name.

If we find it the state index is updated appropriately. If we don’t find it we raise an exception.

This is LPEG’s findback() (lpcap.c)



627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
# File 'lib/rpeg/captures.rb', line 627

def seek_back_ref!(group_name)
  group_name.must_be
  while @breadcrumb_idx.positive?
    @breadcrumb_idx -= 1
    # Skip nested captures
    if current_breadcrumb.close?
      seek_matching_open!
    else
      # The opening of a capture that encloses the BACKREF. Skip it and keep going backwards
      next unless current_breadcrumb.full?
    end
    # We are at an open capture that was closed before our BACKREF
    next unless current_breadcrumb.kind == Capture::GROUP # is it a group?
    next unless current_breadcrumb.data == group_name # does it have the right name?

    # We found it!
    return
  end
  raise "back reference '#{group_name}' not found"
end

#seek_matching_open!Object

This is LPEG’s findopen (lpcap.c)

Assume we are starting from a close capture. We go back to the matching open capture.



651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
# File 'lib/rpeg/captures.rb', line 651

def seek_matching_open!
  n = 0 # number of nested closes waiting for an open
  loop do
    @breadcrumb_idx -= 1
    raise "subject index underflow in seek_open!" if @breadcrumb_idx.negative?

    if current_breadcrumb.close?
      n += 1
    elsif !current_breadcrumb.full?
      # It's an open of some sort
      return if n.zero?

      n -= 1
    end
  end
end

#seek_next!Object

This is LPEG’s nextcap (lpcap.c)

Move to the next capture



671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
# File 'lib/rpeg/captures.rb', line 671

def seek_next!
  unless current_breadcrumb.full?
    n = 0 # number of nested opens waiting for a close
    loop do
      @breadcrumb_idx += 1
      if current_breadcrumb.close?
        break if n.zero?

        n -= 1
      elsif !current_breadcrumb.full?
        n += 1
      end
    end
  end

  @breadcrumb_idx += 1
end

#str_capsObject

This is LPEG’s getstrcaps (lpcap.c) /* ** Collect values from current capture into array ‘cps’. Current ** capture must be Cstring (first call) or Csimple (recursive calls). ** (In first call, fills %0 with whole match for Cstring.) ** Returns number of elements in the array that were filled. */

We simply return the array of StrAux elements



591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
# File 'lib/rpeg/captures.rb', line 591

def str_caps
  result = []
  first_aux = StrAux.new
  first_aux.isstring = true
  first_aux.subject_start = current_breadcrumb.subject_index
  result << first_aux

  first_is_full = current_breadcrumb.full?
  unless first_is_full
    advance # move past the Open
    until current_breadcrumb.close?
      if result.size > MAX_STR_CAPS
        seek_next! # just skip it
      elsif current_breadcrumb.kind == Capture::SIMPLE
        result += str_caps # get the matches recursively
      else
        # Not a string
        aux = StrAux.new
        aux.isstring = false
        aux.breadcrumb_idx = index
        seek_next!
        result << aux
      end
    end
  end
  result[0].subject_end = current_breadcrumb.end_index
  advance # skip capture close/full capture
  result
end