Module: Edits::Levenshtein

Extended by:
Compare
Defined in:
lib/edits/levenshtein.rb

Overview

Implements Levenshtein distance algorithm.

Determines distance between two string by counting edits, identifying:

  • Insertion
  • Deletion
  • Substitution

Class Method Summary collapse

Methods included from Compare

most_similar

Class Method Details

.distance(seq1, seq2) ⇒ Integer

Note:

A true distance metric, satisfies triangle inequality.

Calculate the Levenshtein (edit) distance of two sequences.

Examples:

Levenshtein.distance("sand", "hands")
# => 2

Parameters:

  • seq1 (String, Array)
  • seq2 (String, Array)

Returns:

  • (Integer)

    distance, 0 (identical) or greater (more distant)


24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# File 'lib/edits/levenshtein.rb', line 24

def self.distance(seq1, seq2)
  seq1, seq2 = seq2, seq1 if seq1.length > seq2.length

  # array of codepoints outperforms String
  seq1 = seq1.codepoints if seq1.is_a? String
  seq2 = seq2.codepoints if seq2.is_a? String

  rows = seq1.length
  cols = seq2.length
  return cols if rows == 0
  return rows if cols == 0

  # Initialize first row of cost matrix.
  # The full initial state where cols=3, rows=2 would be:
  #   [[0, 1, 2, 3],
  #    [1, 0, 0, 0],
  #    [2, 0, 0, 0]]
  last_row = 0.upto(cols).to_a

  rows.times do |row|
    prev_col_cost = row + 1
    seq1_item = seq1[row]

    cols.times do |col|
      # | Xs | Xd |
      # | Xi | ?  |
      # step cost is min of operation costs
      # substitution, deletion, insertion
      cost = [
        last_row[col] + (seq1_item == seq2[col] ? 0 : 1),
        last_row[col + 1] + 1,
        prev_col_cost + 1
      ].min

      # overwrite previous row as we progress
      last_row[col] = prev_col_cost
      prev_col_cost = cost
    end
    last_row[cols] = prev_col_cost
  end

  last_row[cols]
end

.distance_with_max(seq1, seq2, max) ⇒ Integer

Calculate the Levenshtein (edit) distance of two sequences, bounded by a maximum value.

Examples:

Edits::Levenshtein.distance("cloud", "crayon")
# => 5
Edits::Levenshtein.distance_with_max("cloud", "crayon", 2)
# => 2

Parameters:

  • seq1 (String, Array)
  • seq2 (String, Array)
  • max (Integer)

    maximum distance

Returns:

  • (Integer)

    distance, from 0 (identical) to max (more distant)


80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# File 'lib/edits/levenshtein.rb', line 80

def self.distance_with_max(seq1, seq2, max)
  seq1, seq2 = seq2, seq1 if seq1.length > seq2.length

  rows = seq1.length
  cols = seq2.length
  return cols > max ? max : cols if rows == 0
  return rows > max ? max : rows if cols == 0
  return max if (cols - rows) >= max

  # array of codepoints outperforms String
  seq1 = seq1.codepoints if seq1.is_a? String
  seq2 = seq2.codepoints if seq2.is_a? String

  # 'infinite' edit distance for padding cost matrix.
  # Can be any value > max[rows, cols]
  inf = cols + 1

  # retain previous row of cost matrix
  last_row = 0.upto(cols).to_a

  rows.times do |row|
    # Ukkonen cut-off
    min_col = row > max ? row - max : 0
    max_col = row + max
    max_col = cols - 1 if max_col > cols - 1

    prev_col_cost = min_col == 0 ? row + 1 : inf
    seq1_item = seq1[row]
    diagonal = cols - rows + row

    min_col.upto(max_col) do |col|
      return max if diagonal == col && last_row[col] >= max

      # | Xs | Xd |
      # | Xi | ?  |
      # substitution, deletion, insertion
      cost = [
        last_row[col] + (seq1_item == seq2[col] ? 0 : 1),
        last_row[col + 1] + 1,
        prev_col_cost + 1
      ].min

      # overwrite previous row as we progress
      last_row[col] = prev_col_cost
      prev_col_cost = cost
    end

    last_row[cols] = prev_col_cost
  end

  last_row[cols] > max ? max : last_row[cols]
end