Module: Edits::DamerauLevenshtein

Defined in:
lib/edits/damerau_levenshtein.rb

Overview

Implements the Damerau/Levenshtein distance algorithm.

Determines distance between two strings by counting edits, identifying:

• Insertion
• Deletion
• Substitution

Class Method Summary collapse

• Calculate the Damerau/Levenshtein distance of two sequences.

Class Method Details

.distance(seq1, seq2) ⇒ Integer

Calculate the Damerau/Levenshtein distance of two sequences.

Examples:

``````DamerauLevenshtein.distance("acer", "earn")# => 3
``````

Parameters:

• seq1 (String, Array)
• seq2 (String, Array)

Returns:

• (Integer)
 ``` 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88``` ```# File 'lib/edits/damerau_levenshtein.rb', line 20 def self.distance(seq1, seq2) seq1, seq2 = seq2, seq1 if seq1.length > seq2.length # array of codepoints outperforms String seq1 = seq1.codepoints if seq1.is_a? String seq2 = seq2.codepoints if seq2.is_a? String rows = seq1.length cols = seq2.length return cols if rows.zero? return rows if cols.zero? # 'infinite' edit distance for padding cost matrix. # Can be any value > max[rows, cols] inf = rows + cols # Initialize first two rows of cost matrix. # The full initial state where cols=3, rows=2 (inf=5) would be: # [[5, 5, 5, 5, 5], # [5, 0, 1, 2, 3], # [5, 1, 0, 0, 0], # [5, 2, 0, 0, 0]] matrix = [Array.new(cols + 2, inf)] matrix << 0.upto(cols).to_a.unshift(inf) # element => last row seen item_history = Hash.new(0) 1.upto(rows) do |row| # generate next row of cost matrix new_row = Array.new(cols + 2, 0) new_row[0] = inf new_row[1] = row matrix << new_row last_match_col = 0 seq1_item = seq1[row - 1] 1.upto(cols) do |col| seq2_item = seq2[col - 1] last_match_row = item_history[seq2_item] sub_cost = seq1_item == seq2_item ? 0 : 1 transposition = 1 + matrix[last_match_row][last_match_col] transposition += row - last_match_row - 1 transposition += col - last_match_col - 1 # TODO: do insertion/deletion need to be considered when # seq1_item == seq2_item ? # # substitution, deletion, insertion, transposition cost = [ matrix[row][col] + sub_cost, matrix[row][col + 1] + 1, matrix[row + 1][col] + 1, transposition ].min matrix[row + 1][col + 1] = cost last_match_col = col if sub_cost.zero? end item_history[seq1_item] = row end matrix[rows + 1][cols + 1] end```