# Module: Jaccard

Defined in:
lib/jaccard.rb

## Overview

Helpers to calculate the Jaccard Coefficient Index and related metrics easily.

(from Wikipedia): The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets.

The closer to 1.0 this number is, the more similar two items are.

## Class Method Summary collapse

• Returns the pair of items whose distance is minimized.

• Determines which member of `others` has the smallest distance vs `a`.

• Calculates the Jaccard Coefficient Index.

• Calculates the inverse of the Jaccard coefficient.

## Class Method Details

### .best_match(items) ⇒ Array<a, b>

Returns the pair of items whose distance is minimized.

Examples:

``````
a = [1, 2, 3]
b = [1, 2]
c = [1, 3]
Jaccard.best_match([a, b, c]) #=> [[1, 2, 3], [1, 2]]``````

Parameters:

• items (#each)

A collection of attributes.

Returns:

• (Array<a, b>)

A pair of set of attributes whose Jaccard distance is the minimal, given the input set.

 ``` 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113``` ```# File 'lib/jaccard.rb', line 99 def self.best_match(items) seen = Set.new matches = [] items.each do |row| items.each do |col| next if row == col next if seen.include?([row, col]) || seen.include?([col, row]) seen << [row, col] matches << [distance(row, col), [row, col]] end end matches.sort.first.last end```

### .closest_to(a, others) ⇒ Object

Determines which member of `others` has the smallest distance vs `a`.

Because of the implementation, if multiple items from `others` have the same distance, the last one will be returned. If this is undesirable, reverse `others` before calling #closest_to.

Examples:

``````
a = [1, 2, 3]
b = [1, 3]
c = [1, 2, 3]
Jaccard.closest_to(b, [a, c]) #=> [1, 2, 3]
# Note that the actual instance returned will be c``````

Parameters:

• a (#&, #+)

A set of attributes

• others (#inject)

A collection of set of attributes

Returns:

• The item from `others` with the distance minimized to 0.0.

 ``` 78 79 80 81 82 83 84 85``` ```# File 'lib/jaccard.rb', line 78 def self.closest_to(a, others) others.inject([2.0, nil]) do |memo, other| dist = distance(a, other) next memo if memo.first < dist [dist, other] end.last end```

### .coefficient(a, b) ⇒ Float

Calculates the Jaccard Coefficient Index.

`a` must implement the set intersection and set union operators: `#&` and `#+`. Array and Set both implement these methods natively. It is expected that the results of `+` will either return a unique set or that it returns an object that responds to `#uniq!`. The results of `#coefficient` will be wrong if the union contains duplicate elements.

Also note that the individual items in `a` and `b` must implement a sane #eql? method. ActiveRecord::Base, String, Fixnum (but not Float), Array and Hash instances all implement a correct notion of equality. Other instances might have to be checked to ensure correct behavior.

Examples:

``````
a = [1, 2, 3, 4]
b = [1, 3, 4]
Jaccard.coefficient(a, b) #=> 0.75``````

Parameters:

• a (#&, #+)

A set of items

• b (#&, #+)

A second set of items

Returns:

• (Float)

The Jaccard Coefficient Index between `a` and `b`.

Raises:

• (ArgumentError)

 ``` 34 35 36 37 38 39 40 41 42 43 44 45 46 47``` ```# File 'lib/jaccard.rb', line 34 def self.coefficient(a, b) raise ArgumentError, "#{a.inspect} does not implement #&" unless a.respond_to?(:&) raise ArgumentError, "#{a.inspect} does not implement #+" unless a.respond_to?(:+) intersection = a & b union = a + b # Set does not implement #uniq or #uniq! since elements are # always guaranteed to be present only once. That's the only # reason we need to guard against that here. union.uniq! if union.respond_to?(:uniq!) intersection.length.to_f / union.length.to_f end```

### .distance(a, b) ⇒ Float

Calculates the inverse of the Jaccard coefficient.

The closer to 0.0 the distance is, the more similar two items are.

Returns:

• (Float)

`1.0 - #coefficient(a, b)`

 ``` 56 57 58``` ```# File 'lib/jaccard.rb', line 56 def self.distance(a, b) 1.0 - coefficient(a, b) end```