Class: KeywordProspector
- Inherits:
-
Object
- Object
- KeywordProspector
- Defined in:
- lib/keyword_prospector.rb
Overview
KeywordProspector takes a collection of words, and optionally their associated outputs, and builds a match tree for running matches of the keywords against provided text. While construction of the Aho-Corasick tree takes a long time when there are many keywords, matching runs in time proportional to the length of the text provided. So, even if you have tens of thousands of keywords to match against, matching will still be very fast.
Class Method Summary collapse
-
.filter_overlaps(results) ⇒ Object
Filters overlaps from an array of results.
Instance Method Summary collapse
-
#add(entry, output = nil) ⇒ Object
Add an entry to the tree.
-
#construct_fail ⇒ Object
Call once after adding all entries.
- #each_pair(&block) ⇒ Object
-
#initialize(words = nil) ⇒ KeywordProspector
constructor
If words is provided, each word is added to the tree and the tree is initialized.
- #inspect ⇒ Object
-
#process(bytes, options = {}) ⇒ Object
Process the provided text for matches.
- #to_hash ⇒ Object
Constructor Details
#initialize(words = nil) ⇒ KeywordProspector
If words is provided, each word is added to the tree and the tree is initialized. Otherwise, call add for each word to place in the dictionary.
22 23 24 25 26 27 28 29 |
# File 'lib/keyword_prospector.rb', line 22 def initialize(words=nil) @start = State.new(0, 0, []) if(words) words.each{|word| add word} construct_fail end end |
Class Method Details
.filter_overlaps(results) ⇒ Object
Filters overlaps from an array of results. If two results overlap, the shorter result is removed. If both results have the same length, the second result is removed.
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
# File 'lib/keyword_prospector.rb', line 122 def self.filter_overlaps(results) i = 0 while (i < results.size-1) a = results[i] b = results[i+1] if a.overlap?(b) if (a.length < b.length) results.delete_at(i) else results.delete_at(i+1) end end i += 1 end end |
Instance Method Details
#add(entry, output = nil) ⇒ Object
Add an entry to the tree. The optional output parameter can be any object, and will be returned when this keyword is matched. If the entry has a keywords method, it should return a collection of keywords. In this case, the output will be added for each keyword provided. If output is not provided, the entry is returned.
36 37 38 39 40 41 42 43 44 45 46 |
# File 'lib/keyword_prospector.rb', line 36 def add(entry, output=nil) output ||= entry if (entry.respond_to?(:keywords)) entry.keywords.each do |keyword| add_internal(keyword, output) end else add_internal(entry, output) end end |
#construct_fail ⇒ Object
Call once after adding all entries. This constructs failure links in the tree, which allow the state machine to move a single step with every input character instead of backtracking back to the beginning of the tree when a partial match fails to match the next character.
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
# File 'lib/keyword_prospector.rb', line 52 def construct_fail queue = Queue.new @start.values.each do |value| value.fail = @start value.fail_increment = 1 queue.push value end prepare_root while !queue.empty? r = queue.pop r.keys.each do |char| s = r[char] queue.push s state = r.fail increment = 0 while !state[char] increment += state.fail_increment state = state.fail end s.fail = state[char] s.fail_increment = increment end end end |
#each_pair(&block) ⇒ Object
148 149 150 151 152 153 154 |
# File 'lib/keyword_prospector.rb', line 148 def each_pair(&block) @start.keys.each do |key| next if @start[key] == @start @start[key].walk([key.chr], &block) end end |
#inspect ⇒ Object
156 157 158 |
# File 'lib/keyword_prospector.rb', line 156 def inspect to_hash.inspect end |
#process(bytes, options = {}) ⇒ Object
Process the provided text for matches. Returns an array of Match objects. Each Match object contains the keyword matched, the start and end position in the text, and the output object specified when the keyword was added to the search tree. The end position is actually the position of the character immediately following the end of the keyword, such that end position minus start position equals the length of the keyword string.
Options:
-
:filter_overlaps - When multiple keywords overlap, filter out overlaps by choosing the longest match.
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
# File 'lib/keyword_prospector.rb', line 89 def process(bytes, ={}) retval = [] unless block_given? state = @start position = Position.new(0, 0) bytes.each_byte do |a| state = state.transition(a, position) if state.keyword && ((position.begin == 0 || KeywordProspector.word_delimiter?(bytes[position.begin-1])) && (position.end == bytes.length || KeywordProspector.word_delimiter?(bytes[position.end]))) match = Match.new(state.keyword, position.begin, position.end, state.output) # do something with the found item if block_given? yield match else retval << match end end end if retval if ([:filter_overlaps]) KeywordProspector.filter_overlaps(retval) end end return retval end |
#to_hash ⇒ Object
138 139 140 141 142 143 144 145 146 |
# File 'lib/keyword_prospector.rb', line 138 def to_hash retval = {} each_pair do |x,y| retval[x] = y end retval end |