Class: KeywordProspector

Inherits:
Object
  • Object
show all
Defined in:
lib/keyword_prospector.rb

Overview

KeywordProspector takes a collection of words, and optionally their associated outputs, and builds a match tree for running matches of the keywords against provided text. While construction of the Aho-Corasick tree takes a long time when there are many keywords, matching runs in time proportional to the length of the text provided. So, even if you have tens of thousands of keywords to match against, matching will still be very fast.

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(words = nil) ⇒ KeywordProspector

If words is provided, each word is added to the tree and the tree is initialized. Otherwise, call add for each word to place in the dictionary.



22
23
24
25
26
27
28
29
# File 'lib/keyword_prospector.rb', line 22

def initialize(words=nil)
  @start = State.new(0, 0, [])
  if(words)
    words.each{|word| add word}

    construct_fail
  end
end

Class Method Details

.filter_overlaps(results) ⇒ Object

Filters overlaps from an array of results. If two results overlap, the shorter result is removed. If both results have the same length, the second result is removed.



122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# File 'lib/keyword_prospector.rb', line 122

def self.filter_overlaps(results)
  i = 0
  while (i < results.size-1)
    a = results[i]
    b = results[i+1]
    if a.overlap?(b)
      if (a.length < b.length)
        results.delete_at(i)
      else
        results.delete_at(i+1)
      end
    end
    i += 1
  end
end

Instance Method Details

#add(entry, output = nil) ⇒ Object

Add an entry to the tree. The optional output parameter can be any object, and will be returned when this keyword is matched. If the entry has a keywords method, it should return a collection of keywords. In this case, the output will be added for each keyword provided. If output is not provided, the entry is returned.



36
37
38
39
40
41
42
43
44
45
46
# File 'lib/keyword_prospector.rb', line 36

def add(entry, output=nil)
  output ||= entry

  if (entry.respond_to?(:keywords))
    entry.keywords.each do |keyword|
      add_internal(keyword, output)
    end
  else
    add_internal(entry, output)
  end
end

#construct_failObject

Call once after adding all entries. This constructs failure links in the tree, which allow the state machine to move a single step with every input character instead of backtracking back to the beginning of the tree when a partial match fails to match the next character.



52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# File 'lib/keyword_prospector.rb', line 52

def construct_fail
  queue = Queue.new
  @start.values.each do |value|
    value.fail = @start
    value.fail_increment = 1
    queue.push value
  end
  
  prepare_root
  
  while !queue.empty?
    r = queue.pop
    r.keys.each do |char|
      s = r[char]
      queue.push s
      state = r.fail
      increment = 0
      while !state[char]
        increment += state.fail_increment
        state = state.fail
      end
      s.fail = state[char]
      s.fail_increment = increment
    end
  end
end

#each_pair(&block) ⇒ Object



148
149
150
151
152
153
154
# File 'lib/keyword_prospector.rb', line 148

def each_pair(&block)
  @start.keys.each do |key|
    next if @start[key] == @start

    @start[key].walk([key.chr], &block)
  end
end

#inspectObject



156
157
158
# File 'lib/keyword_prospector.rb', line 156

def inspect
  to_hash.inspect
end

#process(bytes, options = {}) ⇒ Object

Process the provided text for matches. Returns an array of Match objects. Each Match object contains the keyword matched, the start and end position in the text, and the output object specified when the keyword was added to the search tree. The end position is actually the position of the character immediately following the end of the keyword, such that end position minus start position equals the length of the keyword string.

Options:

  • :filter_overlaps - When multiple keywords overlap, filter out overlaps by choosing the longest match.



89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# File 'lib/keyword_prospector.rb', line 89

def process(bytes, options={})
  retval = [] unless block_given?
  state = @start
  position = Position.new(0, 0)
  bytes.each_byte do |a|
    state = state.transition(a, position)
    if state.keyword && ((position.begin == 0 ||
                 KeywordProspector.word_delimiter?(bytes[position.begin-1])) &&
                 (position.end == bytes.length ||
                  KeywordProspector.word_delimiter?(bytes[position.end])))
      match = Match.new(state.keyword, position.begin, position.end, state.output)

      # do something with the found item
      if block_given?
        yield match
      else
        retval << match
      end
    end
  end

  if retval
    if (options[:filter_overlaps])
      KeywordProspector.filter_overlaps(retval)
    end
  end

  return retval
end

#to_hashObject



138
139
140
141
142
143
144
145
146
# File 'lib/keyword_prospector.rb', line 138

def to_hash
  retval = {}

  each_pair do |x,y|
    retval[x] = y
  end

  retval
end