Class: RPEG::Pattern

Inherits:
Object
  • Object
show all
Defined in:
lib/rpeg/rpeg.rb

Overview

The class representing “patterns” and containing the logic to turn them into programs for the virtual machine.

Very roughly, this is where the LPEG code in lptree.c and lpcode.c lives

Constant Summary collapse

NODE_TYPES =
%i[
  charset char any seq ordered_choice repeated not and
  ntrue nfalse grammar open_call rule call capture runtime behind
].each do |op|
  const_set op.upcase, op
end
FULL_CHAR_SET =

We assume we have UTF-8 input with no multibyte characters.

Set.new((0..255).map{ _1.chr(Encoding::UTF_8) })

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(type, left = nil, right = nil, data: nil, capture: nil) ⇒ Pattern

Left and right are subpatterns. data is other relevant data capture is used for Capture patterns



1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
# File 'lib/rpeg/rpeg.rb', line 1413

def initialize(type, left = nil, right = nil, data: nil, capture: nil)
  raise "Bad node type #{type}" unless NODE_TYPES.include?(type)

  @type = type
  @left = left
  @right = right
  @data = data
  @capture = capture
  sanity_check
  return unless type == GRAMMAR

  fix_up_grammar
  verify_grammar
end

Instance Attribute Details

#captureObject (readonly)

Returns the value of attribute capture.



354
355
356
# File 'lib/rpeg/rpeg.rb', line 354

def capture
  @capture
end

#dataObject

sometimes we need to tweak this



355
356
357
# File 'lib/rpeg/rpeg.rb', line 355

def data
  @data
end

#leftObject (readonly)

Returns the value of attribute left.



354
355
356
# File 'lib/rpeg/rpeg.rb', line 354

def left
  @left
end

#rightObject (readonly)

Returns the value of attribute right.



354
355
356
# File 'lib/rpeg/rpeg.rb', line 354

def right
  @right
end

#typeObject (readonly)

Returns the value of attribute type.



354
355
356
# File 'lib/rpeg/rpeg.rb', line 354

def type
  @type
end

Instance Method Details

#*(other) ⇒ Object

p1 * p2 means p1 followed by p2



390
391
392
393
394
395
396
397
398
399
400
401
# File 'lib/rpeg/rpeg.rb', line 390

def *(other)
  other = fix_type(other)

  # true is the identity for *
  return self if other.type == NTRUE
  return other if type == NTRUE

  # rejigger to make SEQ right-associative. I don't know that it makes a difference, but LPEG does it.
  return left * (right * other) if type == SEQ

  Pattern.new(SEQ, self, other)
end

#**(other) ⇒ Object

pat ** n means “n or more occurrences of pat” when n is non-negative, and “up to -n occurrences of pat” if n is negative.



423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
# File 'lib/rpeg/rpeg.rb', line 423

def **(other)
  n = other.must_be_a(Integer)

  if n >= 0
    raise "Pattern may match 0-length string so repetition may lead to an infinite loop" if nullable?

    patt = Pattern.new(REPEATED, self) # this repeats 0 or more times
    while n.positive?
      patt = self * patt
      n -= 1
    end
  else
    n = -n
    patt = RPEG.P(true)
    while n.positive?
      patt = self * patt + true
      n -= 1
    end
  end
  patt
end

#+(other) ⇒ Object

p1 + p2 is ordered choice: if p1 matches we match and never consider p2, otherwise try matching on p2



404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
# File 'lib/rpeg/rpeg.rb', line 404

def +(other)
  other = fix_type(other)

  if charsetlike? && other.charsetlike?
    # Take the union of the charsets
    RPEG.S(charset + other.charset)
  elsif type == NFALSE
    other
  elsif other.type == NFALSE
    self
  elsif type == ORDERED_CHOICE
    # rejigger to make this operation right-associative, giving more efficient compiled code. See Ierusalimschy 4.2
    left + (right + other)
  else
    Pattern.new(ORDERED_CHOICE, self, other)
  end
end

#+@Object

Unary “and”: pattern matches here (without consuming any input)

Ierusalimschy points out that &patt can be implemented as –patt, but there is an optimization for the VM, so we preserve it



454
455
456
# File 'lib/rpeg/rpeg.rb', line 454

def +@
  Pattern.new(AND, self)
end

#-(other) ⇒ Object

Difference is “this but not that”. So p1 - p2 matches if p1 does and p2 doesn’t. Here, p2 doesn’t consume any input but p1 does. The pattern is equivalent to -p2 * p1.

Special case: if both patterns are charsets we replace with a single charset



462
463
464
465
466
467
468
469
# File 'lib/rpeg/rpeg.rb', line 462

def -(other)
  other = fix_type(other)

  return RPEG.S(charset - other.charset) if charsetlike? && other.charsetlike?

  # Otherwise we use -p2 * p1, i.e., "p2 doesn't match here" followed by "try to match and consume p1"
  -other * self
end

#-@Object

Unary negation represents “does not match”. So -patt says that there is no match at the current position and we don’t consume any of the string



447
448
449
# File 'lib/rpeg/rpeg.rb', line 447

def -@
  Pattern.new(NOT, self)
end

#/(other) ⇒ Object

Replacement captures of various kinds

From the LPEG docs:

patt / string

  Creates a string capture. It creates a capture string based on string. The captured value is a copy of string, except that
  the character % works as an escape character: any sequence in string of the form %n, with n between 1 and 9, stands for
  the match of the n-th capture in patt. The sequence %0 stands for the whole match. The sequence %% stands for a single %.

patt / number

  Creates a numbered capture. For a non-zero number, the captured value is the n-th value captured by patt. When number is
  zero, there are no captured values.

patt / table  [for this we accept a Hash]

  Creates a query capture. It indexes the given table using as key the first value captured by patt, or the whole match if
  patt produced no value. The value at that index is the final value of the capture. If the table does not have that key,
  there is no captured value.

patt / function

  Creates a function capture. It calls the given function passing all captures made by patt as arguments, or the whole match
  if patt made no capture. The values returned by the function are the final values of the capture. In particular, if
  function returns no value, there is no captured value.


497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
# File 'lib/rpeg/rpeg.rb', line 497

def /(other)
  case other
  when String
    Pattern.new(CAPTURE, self, data: other, capture: Capture::STRING)
  when Integer
    raise "Cannot use negative number for numbered capture" if other.negative?

    Pattern.new(CAPTURE, self, data: other, capture: Capture::NUM)
  when Hash
    Pattern.new(CAPTURE, self, data: other, capture: Capture::QUERY)
  when Proc
    Pattern.new(CAPTURE, self, data: other, capture: Capture::FUNCTION)
  else
    raise "Replacement capture is not supported for #{other}"
  end
end

#call_recursive(func, default) ⇒ Object

From callrecursive in LPEG’s lpcode.c

/* ** Visit a TCall node taking care to stop recursion. If node not yet ** visited, return ‘f(sib2(tree))’, otherwise return ‘def’ (default ** value) */

This method acts as a circuit breaker for structural recursion that might otherwise get in a loop among mutually recursive grammar rules.

It’s janky, but we follow LPEG’s approach of hijacking the key field (which we call data) to keep track of the recursion



738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
# File 'lib/rpeg/rpeg.rb', line 738

def call_recursive(func, default)
  type.must_be CALL
  child.type.must_be RULE

  already_visited = :already_visited

  saved_data = @data

  if saved_data == already_visited
    default
  else
    # first time we've been here
    @data = already_visited
    result = send(func)
    @data = saved_data
    result
  end
end

#charsetObject



587
588
589
590
591
592
593
594
595
596
597
598
# File 'lib/rpeg/rpeg.rb', line 587

def charset
  raise "Pattern #{type} isn't charset-like" unless charsetlike?

  case type
  when CHARSET
    data
  when CHAR
    Set.new([data])
  when ANY
    FULL_CHAR_SET
  end
end

#charsetlike?Boolean

Pattern properties

Returns:

  • (Boolean)


583
584
585
# File 'lib/rpeg/rpeg.rb', line 583

def charsetlike?
  type == CHARSET || type == CHAR || type == ANY
end

#check_pred(pred) ⇒ Object

The is lpeg’s checkaux from lpcode.c. Comment from that function (reformatted):

/* ** Checks how a pattern behaves regarding the empty string, in one of two different ways:

** - A pattern is nullable if it can match without consuming any character; ** - A pattern is nofail if it never fails for any string (including the empty string).

** The difference is only for predicates and run-time captures; for other patterns, the two properties are equivalent. (With ** predicates, &‘a’ is nullable but not nofail. Of course, nofail => nullable.)

** These functions are all convervative in the following way: ** p is nullable => nullable(p) ** nofail(p) => p cannot fail

** The function assumes that TOpenCall is not nullable; this will be checked again when the grammar is fixed.

** Run-time captures can do whatever they want, so the result is conservative. */



631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
# File 'lib/rpeg/rpeg.rb', line 631

def check_pred(pred)
  raise "Bad check predicate #{pred}" unless %i[nullable nofail].include?(pred)

  case type
  when CHAR, CHARSET, ANY, OPEN_CALL, NFALSE
    # Not nullable; for open_call this is a blind assumption
    false
  when NTRUE, REPEATED
    true
  when NOT, BEHIND
    # can match empty, but can fail
    pred != :nofail
  when AND
    # can match empty; can fail exactly when body can
    return true if pred == :nullable

    child.check_pred(pred)
  when RUNTIME
    # can fail; match empty iff body does
    return false if pred == :nofail

    child.check_pred(pred)
  when SEQ
    left.check_pred(pred) && right.check_pred(pred)
  when ORDERED_CHOICE
    left.check_pred(pred) || right.check_pred(pred)
  when GRAMMAR
    # Strings are matched by the initial nonterminal
    child.first.check_pred(pred)
  when CALL, RULE, CAPTURE
    child.check_pred(pred)
  else
    raise "Unhandled pattern type #{type}"
  end
end

#childObject

If left is defined and right is nil - so we have a unary op - we can get child here



373
374
375
376
377
# File 'lib/rpeg/rpeg.rb', line 373

def child
  raise 'Pattern is not unary' if right

  left
end

#code(follow_set: FULL_CHAR_SET, dominating_test: nil, active_choice: false) ⇒ Object

follow_set

  • the set of first characters accepted by whatever comes after us, or the full set of characters if nothing follows us

dominating_test

  • a TEST_CHAR, TEST_CHARSET, or TEST_ANY instruction that we can assume has succeeded and which might save us a little time

  • see the tt argument sprinkled through the functions in LPEG’s lpcode.c

active_option

  • there is a CHOICE instruction still “active” in the code already generated

  • certain pattern types can take advantange of this to avoid another CHOICE instruction

  • this is called ‘opt’ in lpcode.c

NOTE: don’t cache the results as we did before, because the code depends on the arguments



932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
# File 'lib/rpeg/rpeg.rb', line 932

def code(follow_set: FULL_CHAR_SET, dominating_test: nil, active_choice: false)
  code = []
  case type
  when CHARSET
    code << charset_code(data, dominating_test)
  when CHAR
    code << char_code(data, dominating_test)
  when ANY
    code << Instruction.new(i::ANY)
  when SEQ
    code.concat seq_code(follow_set, dominating_test)
  when NTRUE
  # we always succeed, which means we don't have to do anything at all
  when NFALSE
    code << Instruction.new(i::FAIL)
  when OPEN_CALL
    # we resolved these to CALL when the grammar node was created. So if we see one now it is because it was not contained in a
    # grammar.
    raise 'OPEN_CALL node appears outside of a grammar'
  when CALL
    # This is symbolic target for now. It will be converted to a numeric offset during GRAMMAR analysis
    code << Instruction.new(i::CALL, offset: data)
  when ORDERED_CHOICE
    code.concat choice_code(follow_set, active_choice)
  when REPEATED
    code.concat repeated_code(follow_set, active_choice)
  when NOT
    code.concat not_code
  when AND
    code.concat and_code(dominating_test)
  when BEHIND
    code << Instruction.new(i::BEHIND, aux: data) if data.positive?
    code.concat child.code
  when CAPTURE
    c = child.code(follow_set:, dominating_test:)
    len = fixed_len
    if len && !child.has_captures?
      code.concat c
      code << Instruction.new(i::FULL_CAPTURE, data:, aux: { capture_length: len, kind: capture })
    else
      code << Instruction.new(i::OPEN_CAPTURE, data:, aux: { kind: capture })
      code.concat c
      code << Instruction.new(i::CLOSE_CAPTURE, aux: { kind: Capture::CLOSE })
    end
  when RUNTIME
    code << Instruction.new(i::OPEN_CAPTURE, data:, aux: { kind: Capture::GROUP })
    code.concat child.code(follow_set: FULL_CHAR_SET, dominating_test:)
    code << Instruction.new(i::CLOSE_RUN_TIME, aux: { kind: Capture::CLOSE })
  when RULE
    code = child.code(follow_set:)
    code[0] = code.first.clone
    code.first.dec = data # decorate with the nonterminal, but clone first to avoid unexpected mutations
  when GRAMMAR
    code.concat grammar_code
  else
    raise "Unhandled pattern type #{type}"
  end

  code
end

#coerce(other) ⇒ Object

This only happens if other is a Numeric type, which is annoying. See the monkeypatching below for other cases.



385
386
387
# File 'lib/rpeg/rpeg.rb', line 385

def coerce(other)
  [RPEG.P(other), self]
end

#convert_open_call_to_call!(rule, ref) ⇒ Object

Special operation when closing open calls



1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
# File 'lib/rpeg/rpeg.rb', line 1429

def convert_open_call_to_call!(rule, ref)
  raise "Cannot convert pattern to CALL" unless type == OPEN_CALL
  raise "Must give rule and nonterminal symbol to CALL pattern" unless rule && ref
  raise "Rule for CALL pattern must be a rule, got #{rule.type}" unless rule.type == RULE

  @type = CALL
  @left = rule
  @data = ref

  # We must check these again when needed rather than use the memoized values
  %i[@nullable @nofail].each do |ivar|
    remove_instance_variable(ivar) if instance_variable_defined?(ivar)
  end
end

#first_set(follow_set = FULL_CHAR_SET) ⇒ Object

LPEG’s getfirst

static int getfirst (TTree *tree, const Charset *follow, Charset *firstset) {

/* ** Computes the ‘first set’ of a pattern. ** The result is a conservative aproximation: ** match p ax -> x (for some x) ==> a belongs to first(p) ** or ** a not in first(p) ==> match p ax -> fail (for all x)

So we want to know the set of characters that can make the pattern succeed, at least on the first characters

** ** The set ‘follow’ is the first set of what follows the ** pattern (full set if nothing follows it). ** ** The function returns 0 when this resulting set can be used for ** test instructions that avoid the pattern altogether. ** A non-zero return can happen for two reasons: ** 1) match p ” -> ” ==> return has bit 1 set ** (tests cannot be used because they would always fail for an empty input); ** 2) there is a match-time capture ==> return has bit 2 set ** (optimizations should not bypass match-time captures). */

I don’t really understand what is going on here. I’m hoping it will make more sense as I port it. I think we pass in follow and return the int and firstset.



807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
# File 'lib/rpeg/rpeg.rb', line 807

def first_set(follow_set = FULL_CHAR_SET)
  case type
  when CHAR, CHARSET, ANY
    [0, charset]
  when NTRUE
    [1, follow_set.clone] # /* accepts the empty string */
  when NFALSE
    [0, Set.new]
  when ORDERED_CHOICE
    e1, first1 = left.first_set(follow_set)
    e2, first2 = right.first_set(follow_set)
    [e1 | e2, first1 | first2]
  when SEQ
    if !left.nullable?
      # /* when p1 is not nullable, p2 has nothing to contribute;
      #  return getfirst(sib1(tree), fullset, firstset); */
      left.first_set(FULL_CHAR_SET)
    else
      e2, first2 = right.first_set(follow_set)
      e1, first1 = left.first_set(first2)
      return [0, first1] if e1.zero? # /* 'e1' ensures that first can be used */
      return [2, first1] if (e1 | e2) & 2 == 2 # /* one of the children has a matchtime? */

      [e2, first1] # /* else depends on 'e2' */
    end
  when REPEATED
    _, first_cs = child.first_set(follow_set)
    [1, first_cs | follow_set] # /* accept the empty string */
  when CAPTURE, RULE
    child.first_set(follow_set)
  when GRAMMAR
    child.first.first_set(follow_set)
  when RUNTIME
    # NOTE: I don't understand this
    #
    # /* function invalidates any follow info. */
    e, first_set = child.first_set(FULL_CHAR_SET)
    if e.positive?
      [2, first_set] # /* function is not "protected"? */
    else
      [0, first_set] # /* pattern inside capture ensures first can be used */
    end
  when CALL
    child.first_set(follow_set)
  when AND
    e, first_set = child.first_set(follow_set)
    [e, first_set & follow_set]
  when NOT, BEHIND
    if type == NOT && child.charsetlike?
      [1, FULL_CHAR_SET - child.charset]
    else
      # /* instruction gives no new information */
      # /* call 'getfirst' only to check for math-time captures */
      e, = child.first_set(follow_set)
      [e | 1, follow_set] # /* always can accept the empty string */
    end
  else
    raise "Unhandled node type #{type}"
  end
end

#fix_type(other) ⇒ Object



514
515
516
517
518
# File 'lib/rpeg/rpeg.rb', line 514

def fix_type(other)
  return other if other.is_a?(Pattern)

  RPEG.P(other) # see what we can do
end

#fixed_lenObject

fixedlen from LPEG’s lpcode.h

/* ** number of characters to match a pattern (or -1 if variable) */

We return nil if the node’s matches are not all of the same length



674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
# File 'lib/rpeg/rpeg.rb', line 674

def fixed_len
  case type
  when CHARSET, CHAR, ANY
    1
  when NOT, AND, NTRUE, NFALSE, BEHIND
    0
  when REPEATED, OPEN_CALL, RUNTIME
    nil
  when CAPTURE, RULE
    child.fixed_len
  when GRAMMAR
    child.first.fixed_len # the first rule is the initial nonterminal
  when CALL
    call_recursive(:fixed_len, nil)
  when SEQ
    left_len = left.fixed_len
    return nil unless left_len

    right_len = right.fixed_len
    return nil unless right_len

    left_len + right_len
  when ORDERED_CHOICE
    left_len = left.fixed_len
    return nil unless left_len

    right_len = right.fixed_len
    right_len == left_len ? right_len : nil
  else
    raise "Unhandled node type #{type}"
  end
end

#has_captures?Boolean

From hascaptures in LPEG’s lpcode.c /* ** Check whether a pattern tree has captures */

Returns:

  • (Boolean)


761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
# File 'lib/rpeg/rpeg.rb', line 761

def has_captures?
  case type
  when CAPTURE, RUNTIME
    true
  when CALL
    call_recursive(:has_captures?, false)
  when GRAMMAR
    child.any?(&:has_captures?)
  else
    case num_children
    when 0
      false
    when 1
      child.has_captures?
    when 2
      left.has_captures? || right.has_captures?
    end
  end
end

#head_fail?Boolean

LPEG’s headfail

/* ** If ‘headfail(tree)’ true, then ‘tree’ can fail only depending on the ** next character of the subject. */

Returns:

  • (Boolean)


874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
# File 'lib/rpeg/rpeg.rb', line 874

def head_fail?
  case type
  when CHAR, CHARSET, ANY, NFALSE
    true
  when NTRUE, REPEATED, RUNTIME, NOT
    false
  when CAPTURE, RULE, AND, CALL
    child.head_fail?
  when GRAMMAR
    child.first.head_fail?
  when SEQ
    return false unless right.nofail?

    left.head_fail?
  when ORDERED_CHOICE
    left.head_fail? && right.head_fail?
  else
    raise "Unhandled node type #{type}"
  end
end

#iObject

shorthand



915
916
917
# File 'lib/rpeg/rpeg.rb', line 915

def i
  Instruction
end

#loops?Boolean

From checkloops in lptree.c

/* ** Check whether a tree has potential infinite loops */

Returns:

  • (Boolean)


712
713
714
715
716
717
718
719
720
721
722
723
724
# File 'lib/rpeg/rpeg.rb', line 712

def loops?
  return true if type == REPEATED && child.nullable?

  # /* sub-grammars already checked */, i.e., by verify_grammar
  return false if type == GRAMMAR || type == CALL

  case num_children
  when 1
    child.loops?
  when 2
    left.loops? || right.loops?
  end
end

#match(str, init = 0, *extra_args) ⇒ Object

Return the index just after the matching prefix of str or nil if there is no match

str: the string the match against init: the string index to start at, defaulting to 0 extra_args: used by Argument Captures



362
363
364
365
366
367
368
369
370
# File 'lib/rpeg/rpeg.rb', line 362

def match(str, init = 0, *extra_args)
  # Note that the program doesn't depend on the arguments so we can cache it
  @program ||= optimize_jumps(code(follow_set: FULL_CHAR_SET) + [Instruction.new(i::OP_END)])

  machine = ParsingMachine.new(@program, str, init, extra_args)
  machine.run

  return machine.captures if machine.success?
end

#need_follow?Boolean

LPEG’s needfollow (lpcode.c)

/* ** Check whether the code generation for the given tree can benefit ** from a follow set (to avoid computing the follow set when it is ** not needed) */

Returns:

  • (Boolean)


1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
# File 'lib/rpeg/rpeg.rb', line 1305

def need_follow?
  case type
  when CHAR, CHARSET, ANY, NFALSE, NTRUE, AND, NOT, RUNTIME, GRAMMAR, CALL, BEHIND
    false
  when ORDERED_CHOICE, REPEATED
    true
  when CAPTURE
    child.need_follow?
  when SEQ
    right.need_follow?
  else
    raise "Unhandled case #{type} in need_follow?"
  end
end

#nofail?Boolean

Returns:

  • (Boolean)


606
607
608
609
610
# File 'lib/rpeg/rpeg.rb', line 606

def nofail?
  return @nofail if defined? @nofail

  @nofail = check_pred(:nofail)
end

#nullable?Boolean

Returns:

  • (Boolean)


600
601
602
603
604
# File 'lib/rpeg/rpeg.rb', line 600

def nullable?
  return @nullable if defined? @nullable

  @nullable = check_pred(:nullable)
end

#num_childrenObject

TODO: consider recurising via a #children method. Then we can handle the necessary rules in a GRAMMAR just once.



896
897
898
899
900
901
902
903
904
905
906
907
908
909
# File 'lib/rpeg/rpeg.rb', line 896

def num_children
  case type
  when CHARSET, CHAR, ANY, NTRUE, NFALSE, OPEN_CALL
    0
  when REPEATED, AND, NOT, CALL, RULE, CAPTURE, RUNTIME, BEHIND
    1
  when SEQ, ORDERED_CHOICE
    2
  when GRAMMAR
    raise "#num_children isn't meaningful for GRAMMAR nodes"
  else
    raise "Unhandled pattern type #{type}"
  end
end

#to_sObject



520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
# File 'lib/rpeg/rpeg.rb', line 520

def to_s
  return @to_s if @to_s

  result = []
  do_sub_pattern = lambda do |sub_patt|
    sub_patt.to_s.split("\n").each do |line|
      result << "|  #{line}"
    end
  end

  type_s = type.to_s.capitalize

  case type
  when CHARSET
    result << "Charset: #{data.join.dump}"
  when ANY
    result << "#{type_s}: #{data}"
  when CHAR
    result << "#{type_s}: #{data.dump}"
  when NTRUE
    result << "TRUE"
  when NFALSE
    result << "FALSE"
  when OPEN_CALL
    result << "OpenCall: #{data}"
  when CALL
    result << "Call: #{data}"
  when SEQ, ORDERED_CHOICE
    result << (type == SEQ ? "Seq:" : "Ordered Choice:")
    do_sub_pattern.call(left)
    do_sub_pattern.call(right)
  when RULE
    result << "nonterminal: #{data}"
  when REPEATED, NOT, AND, BEHIND
    result << "#{type_s}: "
    do_sub_pattern.call(child)
  when CAPTURE
    result << "Capture: #{capture} #{data.inspect}"
    do_sub_pattern.call(child)
  when RUNTIME
    result << "Runtime: #{capture} #{data.inspect}"
    do_sub_pattern.call(child)
  when GRAMMAR
    result << "Grammar:"
    first = true
    child.each do |nonterminal, rule_pattern|
      prefix = "  #{nonterminal}: "
      first = true
      rule_pattern.to_s.split("\n").each do |line|
        line_prefix = first ? prefix : (" " * prefix.len)
        result << "#{line_prefix}#{line}"
      end
    end
  else
    raise "Unhandled type for to_s: #{type}"
  end

  @to_s = result.join("\n")
end

#verify_ruleObject

We check if a rule can be left-recursive, i.e., whether we can return to the rule without consuming any input. The plan is to walk the tree into subtrees whenever we see we can do so without consuming any input.

LPEG comment follows. Note that we check for nullability directly for sanity’s sake.

/* ** Check whether a rule can be left recursive; raise an error in that ** case; otherwise return 1 iff pattern is nullable. ** The return value is used to check sequences, where the second pattern ** is only relevant if the first is nullable. ** Parameter ‘nb’ works as an accumulator, to allow tail calls in ** choices. (‘nb’ true makes function returns true.) ** Parameter ‘passed’ is a list of already visited rules, ‘npassed’ ** counts the elements in ‘passed’. ** Assume ktable at the top of the stack. */



1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
# File 'lib/rpeg/rpeg.rb', line 1346

def verify_rule
  raise "verify_rule called on something that isn't a rule" unless type == RULE

  rules_seen = []

  local_rec = lambda do |pattern, num_rules_seen|
    case pattern.type
    when CHAR, CHARSET, ANY, NTRUE, NFALSE, BEHIND
    # no op
    when NOT, AND, REPEATED, CAPTURE, RUNTIME
      # nullable, so keep going
      local_rec.call(pattern.child, num_rules_seen)
    when CALL
      local_rec.call(pattern.child, num_rules_seen)
    when SEQ
      local_rec.call(pattern.left, num_rules_seen)
      # only check 2nd child if first is nullable
      local_rec.call(pattern.right, num_rules_seen) if pattern.left.nullable?
    when ORDERED_CHOICE
      # must check both children
      local_rec.call(pattern.left, num_rules_seen)
      local_rec.call(pattern.right, num_rules_seen)
    when RULE
      raise "rule '#{pattern.data}' may be left-recursive" if rules_seen[0...num_rules_seen].include?(pattern)

      num_rules_seen += 1
      rules_seen[num_rules_seen] = pattern
      local_rec.call(pattern.child, num_rules_seen)
    when GRAMMAR
    # LPEG says: /* sub-grammar cannot be left recursive */
    # But why? I guess because we would have rejected it at creation.
    else
      raise "Unhandled case #{pattern.type} in verify_rule"
    end
  end

  local_rec.call(self, 0)
end