Module: Jaccard

Defined in:
lib/crosslanguagespotter/jaccard.rb

Overview

Helpers to calculate the Jaccard Coefficient Index and related metrics easily.

(from Wikipedia): The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

The closer to 1.0 this number is, the more similar two items are.

Class Method Summary collapse

Class Method Details

.best_match(items) ⇒ Array<a, b>

Returns the pair of items whose distance is minimized.

Examples:


a = [1, 2, 3]
b = [1, 2]
c = [1, 3]
Jaccard.best_match([a, b, c]) #=> [[1, 2, 3], [1, 2]]

Parameters:

  • items (#each)

    A collection of attributes.

Returns:

  • (Array<a, b>)

    A pair of set of attributes whose Jaccard distance is the minimal, given the input set.



99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
# File 'lib/crosslanguagespotter/jaccard.rb', line 99

def self.best_match(items)
  seen = Set.new
  matches = []

  items.each do |row|
    items.each do |col|
      next if row == col
      next if seen.include?([row, col]) || seen.include?([col, row])
      seen << [row, col]
      matches << [distance(row, col), [row, col]]
    end
  end

  matches.sort.first.last
end

.closest_to(a, others) ⇒ Object

Determines which member of others has the smallest distance vs a.

Because of the implementation, if multiple items from others have the same distance, the last one will be returned. If this is undesirable, reverse others before calling #closest_to.

Examples:


a = [1, 2, 3]
b = [1, 3]
c = [1, 2, 3]
Jaccard.closest_to(b, [a, c]) #=> [1, 2, 3]
# Note that the actual instance returned will be c

Parameters:

  • a (#&, #+)

    A set of attributes

  • others (#inject)

    A collection of set of attributes

Returns:

  • The item from others with the distance minimized to 0.0.



78
79
80
81
82
83
84
85
# File 'lib/crosslanguagespotter/jaccard.rb', line 78

def self.closest_to(a, others)
  others.inject([2.0, nil]) do |memo, other|
    dist = distance(a, other)
    next memo if memo.first < dist

    [dist, other]
  end.last
end

.coefficient(a, b) ⇒ Float

Calculates the Jaccard Coefficient Index.

a must implement the set intersection and set union operators: #& and #+. Array and Set both implement these methods natively. It is expected that the results of + will either return a unique set or that it returns an object that responds to #uniq!. The results of #coefficient will be wrong if the union contains duplicate elements.

Also note that the individual items in a and b must implement a sane #eql? method. ActiveRecord::Base, String, Fixnum (but not Float), Array and Hash instances all implement a correct notion of equality. Other instances might have to be checked to ensure correct behavior.

Examples:


a = [1, 2, 3, 4]
b = [1, 3, 4]
Jaccard.coefficient(a, b) #=> 0.75

Parameters:

  • a (#&, #+)

    A set of items

  • b (#&, #+)

    A second set of items

Returns:

  • (Float)

    The Jaccard Coefficient Index between a and b.

Raises:

  • (ArgumentError)

See Also:



34
35
36
37
38
39
40
41
42
43
44
45
46
47
# File 'lib/crosslanguagespotter/jaccard.rb', line 34

def self.coefficient(a, b)
  raise ArgumentError, "#{a.inspect} does not implement #&" unless a.respond_to?(:&)
  raise ArgumentError, "#{a.inspect} does not implement #+" unless a.respond_to?(:+)

  intersection = a & b
  union        = a + b

  # Set does not implement #uniq or #uniq! since elements are
  # always guaranteed to be present only once. That's the only
  # reason we need to guard against that here.
  union.uniq! if union.respond_to?(:uniq!)

  intersection.length.to_f / union.length.to_f
end

.distance(a, b) ⇒ Float

Calculates the inverse of the Jaccard coefficient.

The closer to 0.0 the distance is, the more similar two items are.

Returns:

  • (Float)

    1.0 - #coefficient(a, b)

See Also:



56
57
58
# File 'lib/crosslanguagespotter/jaccard.rb', line 56

def self.distance(a, b)
  1.0 - coefficient(a, b)
end