Module: Jaccard
- Defined in:
- lib/crosslanguagespotter/jaccard.rb
Overview
Helpers to calculate the Jaccard Coefficient Index and related metrics easily.
(from Wikipedia): The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.
The closer to 1.0 this number is, the more similar two items are.
Class Method Summary collapse
-
.best_match(items) ⇒ Array<a, b>
Returns the pair of items whose distance is minimized.
-
.closest_to(a, others) ⇒ Object
Determines which member of
others
has the smallest distance vsa
. -
.coefficient(a, b) ⇒ Float
Calculates the Jaccard Coefficient Index.
-
.distance(a, b) ⇒ Float
Calculates the inverse of the Jaccard coefficient.
Class Method Details
.best_match(items) ⇒ Array<a, b>
Returns the pair of items whose distance is minimized.
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
# File 'lib/crosslanguagespotter/jaccard.rb', line 99 def self.best_match(items) seen = Set.new matches = [] items.each do |row| items.each do |col| next if row == col next if seen.include?([row, col]) || seen.include?([col, row]) seen << [row, col] matches << [distance(row, col), [row, col]] end end matches.sort.first.last end |
.closest_to(a, others) ⇒ Object
Determines which member of others
has the smallest distance vs a
.
Because of the implementation, if multiple items from others
have the same distance, the last one will be returned. If this is undesirable, reverse others
before calling #closest_to.
78 79 80 81 82 83 84 85 |
# File 'lib/crosslanguagespotter/jaccard.rb', line 78 def self.closest_to(a, others) others.inject([2.0, nil]) do |memo, other| dist = distance(a, other) next memo if memo.first < dist [dist, other] end.last end |
.coefficient(a, b) ⇒ Float
Calculates the Jaccard Coefficient Index.
a
must implement the set intersection and set union operators: #&
and #+
. Array and Set both implement these methods natively. It is expected that the results of +
will either return a unique set or that it returns an object that responds to #uniq!
. The results of #coefficient
will be wrong if the union contains duplicate elements.
Also note that the individual items in a
and b
must implement a sane #eql? method. ActiveRecord::Base, String, Fixnum (but not Float), Array and Hash instances all implement a correct notion of equality. Other instances might have to be checked to ensure correct behavior.
34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# File 'lib/crosslanguagespotter/jaccard.rb', line 34 def self.coefficient(a, b) raise ArgumentError, "#{a.inspect} does not implement #&" unless a.respond_to?(:&) raise ArgumentError, "#{a.inspect} does not implement #+" unless a.respond_to?(:+) intersection = a & b union = a + b # Set does not implement #uniq or #uniq! since elements are # always guaranteed to be present only once. That's the only # reason we need to guard against that here. union.uniq! if union.respond_to?(:uniq!) intersection.length.to_f / union.length.to_f end |
.distance(a, b) ⇒ Float
Calculates the inverse of the Jaccard coefficient.
The closer to 0.0 the distance is, the more similar two items are.
56 57 58 |
# File 'lib/crosslanguagespotter/jaccard.rb', line 56 def self.distance(a, b) 1.0 - coefficient(a, b) end |