Class: Calyx::PrefixTree
- Inherits:
-
Object
- Object
- Calyx::PrefixTree
- Defined in:
- lib/calyx/prefix_tree.rb
Instance Method Summary collapse
- #add(label, index) ⇒ Object
- #add_all(elements) ⇒ Object
- #common_prefix(a, b) ⇒ Object
-
#initialize ⇒ PrefixTree
constructor
A new instance of PrefixTree.
- #insert(label, index) ⇒ Object
-
#lookup(label) ⇒ Object
This was basically ported from the pseudocode found on Wikipedia to Ruby, with a lot of extra internal state tracking that is totally absent from most algorithmic descriptions.
Constructor Details
#initialize ⇒ PrefixTree
Returns a new instance of PrefixTree.
7 8 9 |
# File 'lib/calyx/prefix_tree.rb', line 7 def initialize @root = PrefixNode.new([], nil) end |
Instance Method Details
#add(label, index) ⇒ Object
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
# File 'lib/calyx/prefix_tree.rb', line 21 def add(label, index) parts = label.split(/(%)/).reject { |p| p.empty? } parts_count = parts.count # Can’t use more than one capture symbol which gives the following splits: # - ["literal"] # - ["%", "literal"] # - ["literal", "%"] # - ["literal", "%", "literal"] if parts_count > 3 raise "Too many capture patterns: #{label}" end current_node = @root parts.each_with_index do |part, i| index_slot = (i == parts_count - 1) ? index : nil is_wildcard = part == "%" matched_prefix = false current_node.children.each_with_index do |edge, j| prefix = common_prefix(edge.label, part) unless prefix.empty? matched_prefix = true if prefix == edge.label # Current prefix matches the edge label so we can continue down the # tree without mutating the current branch next_node = PrefixNode.new([], index_slot) current_node.children << PrefixEdge.new(next_node, label.delete_prefix(prefix), is_wildcard) else # We have a partial match on current edge so replace it with the new # prefix then rejoin the remaining suffix to the existing branch edge.label = edge.label.delete_prefix(prefix) prefix_node = PrefixNode.new([edge], nil) next_node = PrefixNode.new([], index_slot) prefix_node.children << PrefixEdge.new(next_node, label.delete_prefix(prefix), is_wildcard) current_node.children[j] = PrefixEdge.new(prefix_node, prefix, is_wildcard) end current_node = next_node break end end # No existing edges have a common prefix so push a new branch onto the tree # at the current level unless matched_prefix next_edge = PrefixEdge.new(PrefixNode.new([], index_slot), part, is_wildcard) current_node.children << next_edge current_node = next_edge.node end end end |
#add_all(elements) ⇒ Object
17 18 19 |
# File 'lib/calyx/prefix_tree.rb', line 17 def add_all(elements) elements.each_with_index { |el, i| add(el, i) } end |
#common_prefix(a, b) ⇒ Object
177 178 179 180 181 182 183 184 185 186 187 188 189 |
# File 'lib/calyx/prefix_tree.rb', line 177 def common_prefix(a, b) selected_prefix = "" min_index_length = a < b ? a.length : b.length index = 0 until index == min_index_length return selected_prefix if a[index] != b[index] selected_prefix += a[index] index += 1 end selected_prefix end |
#insert(label, index) ⇒ Object
11 12 13 14 15 |
# File 'lib/calyx/prefix_tree.rb', line 11 def insert(label, index) if @root.children.empty? @root.children << PrefixEdge.new(PrefixNode.new([], index), label, false) end end |
#lookup(label) ⇒ Object
This was basically ported from the pseudocode found on Wikipedia to Ruby, with a lot of extra internal state tracking that is totally absent from most algorithmic descriptions. This ends up making a real mess of the expression of the algorithm, mostly due to choices and conflicts between whether to go with the standard iterative and procedural flow of statements or use a more functional style. A mangle that speaks to the questions around portability between different languages. Is this codebase a design prototype? Is it an evolving example that should guide implementations in other languages?
The problem with code like this is that it’s a bit of a maintenance burden if not structured compactly and precisely enough to not matter and having enough tests passing that it lasts for a few years without becoming a nuisance or leading to too much nonsense.
There are several ways to implement this, some of these may work better or worse, and this might be quite different across multiple languages so what goes well in one place could suck in other places. The only way to make a good decision around it is to learn via testing and experiments.
Alternative possible implementations:
-
Regex compilation on registration, use existing legacy mapping code
-
Prefix tree, trie, radix tree/trie, compressed bitpatterns, etc
-
Split string flip, imperative list processing hacks (easier for more people to contribute?)
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
# File 'lib/calyx/prefix_tree.rb', line 101 def lookup(label) current_node = @root chars_consumed = 0 chars_captured = nil label_length = label.length # Traverse the tree until reaching a leaf node or all input characters are consumed while current_node != nil && !current_node.children.empty? && chars_consumed < label_length # Candidate edge pointing to the next node to check candidate_edge = nil # Traverse from the current node down the tree looking for candidate edges current_node.children.each do |edge| # Generate a suffix based on the prefix already consumed sub_label = label[chars_consumed, label_length] # If this edge is a wildcard we check the next level of the tree if edge.wildcard? # Wildcard pattern is anchored to the end of the string so we can # consume all remaining characters and pick this as an edge candidate if edge.node.children.empty? chars_captured = label[chars_consumed, sub_label.length] chars_consumed += sub_label.length candidate_edge = edge break end # The wildcard is anchored to the start or embedded in the middle of # the string so we traverse this edge and scan the next level of the # tree with a greedy lookahead. This means we will always match as # much of the wildcard string as possible when there is a trailing # suffix that could be repeated several times within the characters # consumed by the wildcard pattern. # # For example, we expect `"te%s"` to match on `"tests"` rather than # bail out after matching the first three characters `"tes"`. edge.node.children.each do |lookahead_edge| prefix = sub_label.rindex(lookahead_edge.label) if prefix chars_captured = label[chars_consumed, prefix] chars_consumed += prefix + lookahead_edge.label.length candidate_edge = lookahead_edge break end end # We found a candidate so no need to continue checking edges break if candidate_edge else # Look for a common prefix on this current edge label and the remaining suffix if edge.label == common_prefix(edge.label, sub_label) chars_consumed += edge.label.length candidate_edge = edge break end end end if candidate_edge # Traverse to the node our edge candidate points to current_node = candidate_edge.node else # We didn’t find a possible edge candidate so bail out of the loop current_node = nil end end # In order to return a match, the following postconditions must be true: # - We are pointing to a leaf node # - We have consumed all the input characters if current_node != nil and current_node.index != nil and chars_consumed == label_length PrefixMatch.new(label, current_node.index, chars_captured) else nil end end |