Module: Edits::RestrictedEdit

Defined in:
lib/edits/restricted_edit.rb

Overview

Implements Restricted Damerau-Levenshtein distance (Optimal Alignment) algorithm.

Determines distance between two strings by counting edits, identifying:

  • Insertion
  • Deletion
  • Substitution
  • Swapped items

Class Method Summary collapse

Class Method Details

.distance(seq1, seq2) ⇒ Integer

Note:

Not a true distance metric, fails to satisfy triangle inequality.

Calculate the Restricted Damerau-Levenshtein distance (Optimal Alignment) of two sequences.

Examples:

RestrictedEdit.distance("iota", "atom")
# => 3

Parameters:

  • seq1 (String, Array)
  • seq2 (String, Array)

Returns:

  • (Integer)


23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/edits/restricted_edit.rb', line 23

def self.distance(seq1, seq2)
  if seq1.length > seq2.length
    temp = seq1
    seq1 = seq2
    seq2 = temp
  end

  # array of Integer codepoints outperforms String
  seq1 = seq1.codepoints if seq1.is_a? String
  seq2 = seq2.codepoints if seq2.is_a? String

  rows = seq1.length
  cols = seq2.length
  return cols if rows.zero?
  return rows if cols.zero?

  # previous two rows of cost matrix are retained
  lastlast_row = []
  last_row = []
  # Initialize first row of cost matrix.
  # The full initial state where cols=3, rows=2 would be:
  #   [[0, 1, 2, 3],
  #    [1, 0, 0, 0],
  #    [2, 0, 0, 0]]
  curr_row = 0.upto(cols).to_a

  rows.times do |row|
    lastlast_row = last_row
    last_row = curr_row

    # generate next row of cost matrix
    curr_row = Array.new(cols + 1, 0)
    curr_row[0] = row + 1

    curr_item = seq1[row]

    cols.times do |col|
      sub_cost = curr_item == seq2[col] ? 0 : 1
      is_swap = sub_cost == 1 &&
        row.positive? && col.positive? &&
        curr_item == seq2[col - 1] &&
        seq1[row - 1] == seq2[col]

      deletion = last_row[col + 1] + 1
      insertion = curr_row[col] + 1
      substitution = last_row[col] + sub_cost

      # step cost is min of operation costs
      cost = deletion < insertion ? deletion : insertion
      cost = substitution if substitution < cost

      if is_swap
        swap = lastlast_row[col - 1] + 1
        cost = swap if swap < cost
      end

      curr_row[col + 1] = cost
    end
  end

  curr_row[cols]
end