Module: Edits::DamerauLevenshtein

Defined in:
lib/edits/damerau_levenshtein.rb

Overview

Implements the Damerau/Levenshtein distance algorithm.

Determines distance between two strings by counting edits, identifying:

  • Insertion
  • Deletion
  • Substitution
  • Adjacent transposition

Class Method Summary collapse

Class Method Details

.distance(seq1, seq2) ⇒ Integer

Calculate the Damerau/Levenshtein distance of two sequences.

Examples:

DamerauLevenshtein.distance("acer", "earn")# => 3

Parameters:

  • seq1 (String, Array)
  • seq2 (String, Array)

Returns:

  • (Integer)

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/edits/damerau_levenshtein.rb', line 20

def self.distance(seq1, seq2)
  seq1, seq2 = seq2, seq1 if seq1.length > seq2.length

  # array of codepoints outperforms String
  seq1 = seq1.codepoints if seq1.is_a? String
  seq2 = seq2.codepoints if seq2.is_a? String

  rows = seq1.length
  cols = seq2.length
  return cols if rows.zero?
  return rows if cols.zero?

  # 'infinite' edit distance for padding cost matrix.
  # Can be any value > max[rows, cols]
  inf = rows + cols

  # Initialize first two rows of cost matrix.
  # The full initial state where cols=3, rows=2 (inf=5) would be:
  #   [[5, 5, 5, 5, 5],
  #    [5, 0, 1, 2, 3],
  #    [5, 1, 0, 0, 0],
  #    [5, 2, 0, 0, 0]]
  matrix = [Array.new(cols + 2, inf)]
  matrix << 0.upto(cols).to_a.unshift(inf)

  # element => last row seen
  item_history = Hash.new(0)

  1.upto(rows) do |row|
    # generate next row of cost matrix
    new_row = Array.new(cols + 2, 0)
    new_row[0] = inf
    new_row[1] = row
    matrix << new_row

    last_match_col = 0
    seq1_item = seq1[row - 1]

    1.upto(cols) do |col|
      seq2_item = seq2[col - 1]
      last_match_row = item_history[seq2_item]

      sub_cost = seq1_item == seq2_item ? 0 : 1

      transposition = 1 + matrix[last_match_row][last_match_col]
      transposition += row - last_match_row - 1
      transposition += col - last_match_col - 1

      # TODO: do insertion/deletion need to be considered when
      # seq1_item == seq2_item ?
      #
      # substitution, deletion, insertion, transposition
      cost = [
        matrix[row][col] + sub_cost,
        matrix[row][col + 1] + 1,
        matrix[row + 1][col] + 1,
        transposition
      ].min

      matrix[row + 1][col + 1] = cost

      last_match_col = col if sub_cost.zero?
    end

    item_history[seq1_item] = row
  end

  matrix[rows + 1][cols + 1]
end