Module: Edits::JaroWinkler

Defined in:
lib/edits/jaro_winkler.rb

Overview

Implements Jaro-Winkler similarity algorithm.

Constant Summary collapse

WINKLER_PREFIX_WEIGHT =

Prefix scaling factor for jaro-winkler metric. Default is 0.1 Should not exceed 0.25 or metric range will leave 0..1

0.1
WINKLER_THRESHOLD =

Threshold for boosting Jaro with Winkler prefix multiplier. Default is 0.7

0.7

Class Method Summary collapse

Class Method Details

.distance(seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT) ⇒ Float

Note:

Not a true distance metric, fails to satisfy triangle inequality.

Calculate Jaro-Winkler distance

Examples:

Edits::JaroWinkler.distance("information", "informant")
# => 0.05858585858585863

Returns:

  • (Float)

    distance, from 0.0 (identical) to 1.0 (distant)


63
64
65
66
67
68
69
# File 'lib/edits/jaro_winkler.rb', line 63

def self.distance(
  seq1, seq2,
  threshold: WINKLER_THRESHOLD,
  weight: WINKLER_PREFIX_WEIGHT
)
  1.0 - similarity(seq1, seq2, threshold: threshold, weight: weight)
end

.similarity(seq1, seq2, threshold: WINKLER_THRESHOLD, weight: WINKLER_PREFIX_WEIGHT) ⇒ Float

Calculate Jaro-Winkler similarity of given strings

Adds weight to Jaro similarity according to the length of a common prefix of up to 4 letters, where exists. The additional weighting is only applied when the original similarity passes a threshold.

Sw = Sj + (l * p * (1 - Dj))

Where Sj is Jaro, l is prefix length, and p is prefix weight

Examples:

Edits::JaroWinkler.similarity("information", "informant")
# => 0.9414141414141414

Parameters:

  • seq1 (String, Array)
  • seq2 (String, Array)
  • threshold (Float) (defaults to: WINKLER_THRESHOLD)

    threshold for applying Winkler prefix weighting

  • weight (Float) (defaults to: WINKLER_PREFIX_WEIGHT)

    weighting for common prefix, should not exceed 0.25

Returns:

  • (Float)

    similarity, from 0.0 (none) to 1.0 (identical)


35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# File 'lib/edits/jaro_winkler.rb', line 35

def self.similarity(
  seq1, seq2,
  threshold: WINKLER_THRESHOLD,
  weight: WINKLER_PREFIX_WEIGHT
)

  sj = Jaro.similarity(seq1, seq2)
  return sj unless sj > threshold

  # size of common prefix, max 4
  max_bound = seq1.length > seq2.length ? seq2.length : seq1.length
  max_bound = 4 if max_bound > 4

  l = 0
  l += 1 until seq1[l] != seq2[l] || l >= max_bound

  l < 1 ? sj : sj + (l * weight * (1 - sj))
end