Module: Edits::RestrictedEdit

Extended by:
Compare
Defined in:
lib/edits/restricted_edit.rb

Overview

Implements Restricted Damerau-Levenshtein distance (Optimal Alignment) algorithm.

Determines distance between two strings by counting edits, identifying:

  • Insertion
  • Deletion
  • Substitution
  • Adjacent transposition

This variant is restricted by the condition that no sub-string is edited more than once.

Class Method Summary collapse

Methods included from Compare

most_similar

Class Method Details

.distance(seq1, seq2) ⇒ Integer

Note:

Not a true distance metric, fails to satisfy triangle inequality.

Calculate the Restricted Damerau-Levenshtein distance (Optimal Alignment) of two sequences.

Examples:

RestrictedEdit.distance("iota", "atom")
# => 3

Parameters:

  • seq1 (String, Array)
  • seq2 (String, Array)

Returns:

  • (Integer)

28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# File 'lib/edits/restricted_edit.rb', line 28

def self.distance(seq1, seq2)
  seq1, seq2 = seq2, seq1 if seq1.length > seq2.length

  # array of codepoints outperforms String
  seq1 = seq1.codepoints if seq1.is_a? String
  seq2 = seq2.codepoints if seq2.is_a? String

  rows = seq1.length
  cols = seq2.length
  return cols if rows.zero?
  return rows if cols.zero?

  # retain previous two rows of cost matrix
  lastlast_row = []
  last_row = []

  # Initialize first row of cost matrix.
  # The full initial state where cols=3, rows=2 would be:
  #   [[0, 1, 2, 3],
  #    [1, 0, 0, 0],
  #    [2, 0, 0, 0]]
  curr_row = 0.upto(cols).to_a

  rows.times do |row|
    # rotate row arrays
    curr_row, last_row, lastlast_row = lastlast_row, curr_row, last_row

    curr_row[0] = row + 1
    seq1_item = seq1[row]

    cols.times do |col|
      sub_cost = seq1_item == seq2[col] ? 0 : 1
      is_swap = sub_cost.positive? &&
        row.positive? && col.positive? &&
        seq1_item == seq2[col - 1] &&
        seq1[row - 1] == seq2[col]

      # | Xt |    |    |
      # |    | Xs | Xd |
      # |    | Xi | ?  |
      # step cost is min of operation costs
      # substitution, deletion, insertion, transposition
      cost = [
        last_row[col] + sub_cost,
        last_row[col + 1] + 1,
        curr_row[col] + 1
      ].min

      if is_swap
        swap = lastlast_row[col - 1] + 1
        cost = swap if swap < cost
      end

      curr_row[col + 1] = cost
    end
  end

  curr_row[cols]
end

.distance_with_max(seq1, seq2, max) ⇒ Integer

Calculate the Restricted Damerau-Levenshtein distance (Optimal Alignment) of two sequences, bounded by a maximum value.

Examples:

Edits::RestrictedEdit.distance("cloud", "crayon")
# => 5
Edits::RestrictedEdit.distance_with_max("cloud", "crayon", 2)
# => 2

Parameters:

  • seq1 (String, Array)
  • seq2 (String, Array)
  • max (Integer)

    maximum distance

Returns:

  • (Integer)

100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
# File 'lib/edits/restricted_edit.rb', line 100

def self.distance_with_max(seq1, seq2, max)
  seq1, seq2 = seq2, seq1 if seq1.length > seq2.length

  rows = seq1.length
  cols = seq2.length
  return cols > max ? max : cols if rows.zero?
  return rows > max ? max : rows if cols.zero?
  return max if (cols - rows) >= max

  # array of codepoints outperforms String
  seq1 = seq1.codepoints if seq1.is_a? String
  seq2 = seq2.codepoints if seq2.is_a? String

  # 'infinite' edit distance for padding cost matrix.
  # Can be any value > max[rows, cols]
  inf = cols + 1

  # retain previous two rows of cost matrix,
  # padded with "inf" as matrix is not fully evaluated
  lastlast_row = Array.new(inf, inf)
  last_row = Array.new(inf, inf)
  curr_row = 0.upto(cols).to_a

  rows.times do |row|
    # rotate row arrays
    curr_row, last_row, lastlast_row = lastlast_row, curr_row, last_row

    # Ukkonen cut-off
    min_col = row > max ? row - max : 0
    max_col = row + max
    max_col = cols - 1 if max_col > cols - 1

    curr_row[min_col] = min_col.zero? ? row + 1 : inf
    seq1_item = seq1[row]
    diagonal = cols - rows + row

    min_col.upto(max_col) do |col|
      return max if diagonal == col && last_row[col] >= max

      sub_cost = seq1_item == seq2[col] ? 0 : 1
      is_swap = sub_cost.positive? &&
        row.positive? && col.positive? &&
        seq1_item == seq2[col - 1] &&
        seq1[row - 1] == seq2[col]

      # | Xt |    |    |
      # |    | Xs | Xd |
      # |    | Xi | ?  |
      # substitution, deletion, insertion, transposition
      cost = [
        last_row[col] + sub_cost,
        last_row[col + 1] + 1,
        curr_row[col] + 1
      ].min

      if is_swap
        swap = lastlast_row[col - 1] + 1
        cost = swap if swap < cost
      end

      curr_row[col + 1] = cost
    end
  end

  curr_row[cols] > max ? max : curr_row[cols]
end