Module: Edits::JaroWinkler

Defined in:
lib/edits/jaro_winkler.rb

Overview

Implements Jaro-Winkler similarity algorithm.

Constant Summary collapse

WINKLER_PREFIX_WEIGHT =

Prefix scaling factor for jaro-winkler metric. Default is 0.1 Should not exceed 0.25 or metric range will leave 0..1

0.1
WINKLER_THRESHOLD =

Threshold for boosting Jaro with winkler prefix multiplier. Default is 0.7

0.7

Class Method Summary collapse

Class Method Details

.distance(seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT) ⇒ Float

Note:

Not a true distance metric, fails to satisfy triangle inequality.

Calculate Jaro-Winkler distance

Examples:

Edits::JaroWinkler.distance("information", "informant")
# => 0.05858585858585863

Returns:

  • (Float)

    distance, between 0.0 (identical) and 1.0 (distant)


66
67
68
69
70
71
72
# File 'lib/edits/jaro_winkler.rb', line 66

def self.distance(
  seq1, seq2,
  threshold: WINKLER_THRESHOLD,
  weight: WINKLER_PREFIX_WEIGHT
)
  1.0 - similarity(seq1, seq2, threshold: threshold, weight: weight)
end

.similarity(seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT) ⇒ Float

Calculate Jaro-Winkler similarity of given strings

Adds weight to Jaro similarity according to the length of a common prefix of up to 4 letters, where exists. The additional weighting is only applied when the original similarity passes a threshold.

Sw = Sj + (l * p * (1 - Dj))

Where Sj is Jaro, l is prefix length, and p is prefix weight

Examples:

Edits::JaroWinkler.similarity("information", "informant")
# => 0.9414141414141414

Parameters:

  • seq1 (String, Array)
  • seq2 (String, Array)
  • threshold (Float) (defaults to: WINKLER_THRESHOLD)

    threshold for applying Winkler prefix weighting

  • weight (Float) (defaults to: WINKLER_PREFIX_WEIGHT)

    weighting for common prefix, should not exceed 0.25

Returns:

  • (Float)

    similarity, between 0.0 (none) and 1.0 (identical)


35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# File 'lib/edits/jaro_winkler.rb', line 35

def self.similarity(
  seq1, seq2,
  threshold: WINKLER_THRESHOLD,
  weight: WINKLER_PREFIX_WEIGHT
)

  dj = Jaro.similarity(seq1, seq2)

  if dj > threshold
    # size of common prefix, max 4
    max_bound = seq1.length > seq2.length ? seq2.length : seq1.length
    max_bound = 4 if max_bound > 4

    l = 0
    l += 1 until seq1[l] != seq2[l] || l >= max_bound

    l < 1 ? dj : dj + (l * weight * (1 - dj))
  else
    dj
  end
end