Module: Babushka::Levenshtein
 Defined in:
 lib/levenshtein/levenshtein.rb
Overview
The Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two sequences is given by the minimum number of operations needed to transform one sequence into the other, where an operation is an insertion, deletion, or substitution of a single element.
More information about the Levenshtein distance algorithm: en.wikipedia.org/wiki/Levenshtein_distance .
Constant Summary collapse
 VERSION =
"0.2.0"
Class Method Summary collapse

.distance(s1, s2, threshold = nil) ⇒ Object
Returns the Levenshtein distance between two sequences.

.distance_fast_or_slow(s1, s2, threshold) ⇒ Object
:nodoc:.

.levenshtein_distance_slow(s1, s2, threshold) ⇒ Object
:nodoc:.

.normalized_distance(s1, s2, threshold = nil) ⇒ Object
Returns the Levenshtein distance as a number between 0.0 and 1.0.
Class Method Details
.distance(s1, s2, threshold = nil) ⇒ Object
Returns the Levenshtein distance between two sequences.
The two sequences can be two strings, two arrays, or two other objects. Strings, arrays and arrays of strings are handled with optimized (very fast) C code. All other sequences are handled with generic (fast) C code.
The sequences should respond to :length and :[] and all objects in the sequences (as returned by []) should response to :==.
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 
# File 'lib/levenshtein/levenshtein.rb', line 71 def self.distance(s1, s2, threshold=nil) s1, s2 = s2, s1 if s1.length > s2.length # s1 is the short one; s2 is the long one. # Handle some basic circumstances. return 0 if s1 == s2 return s2.length if s1.length == 0 if threshold return nil if (s2.lengths1.length) >= threshold a1, a2 = nil, nil a1, a2 = s1, s2 if s1.respond_to?(:) and s2.respond_to?(:) a1, a2 = s1.scan(/./), s2.scan(/./) if s1.respond_to?(:scan) and s2.respond_to?(:scan) if a1 and a2 return nil if (a1a2).length >= threshold return nil if (a2a1).length >= threshold end end distance_fast_or_slow(s1, s2, threshold) end 
.distance_fast_or_slow(s1, s2, threshold) ⇒ Object
:nodoc:
95 96 97 98 99 100 101 
# File 'lib/levenshtein/levenshtein.rb', line 95 def self.distance_fast_or_slow(s1, s2, threshold) # :nodoc: if respond_to?(:levenshtein_distance_fast) levenshtein_distance_fast(s1, s2, threshold) # Implemented in C. else levenshtein_distance_slow(s1, s2, threshold) # Implemented in Ruby. end end 
.levenshtein_distance_slow(s1, s2, threshold) ⇒ Object
:nodoc:
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 
# File 'lib/levenshtein/levenshtein.rb', line 103 def self.levenshtein_distance_slow(s1, s2, threshold) # :nodoc: row = (0..s1.length).to_a 1.upto(s2.length) do y prow = row row = [y] 1.upto(s1.length) do x row[x] = [prow[x]+1, row[x1]+1, prow[x1]+(s1[x1]==s2[y1] ? 0 : 1)].min end # Stop analysing this sequence as soon as the best possible # result for this sequence is bigger than the best result so far. # (The minimum value in the next row will be equal to or greater # than the minimum value in this row.) return nil if threshold and row.min >= threshold end row[1] end 
.normalized_distance(s1, s2, threshold = nil) ⇒ Object
Returns the Levenshtein distance as a number between 0.0 and 1.0. It's basically the Levenshtein distance divided by the length of the longest sequence.
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 
# File 'lib/levenshtein/levenshtein.rb', line 43 def self.normalized_distance(s1, s2, threshold=nil) s1, s2 = s2, s1 if s1.length > s2.length # s1 is the short one; s2 is the long one. if s2.length == 0 0.0 # Since s1.length < s2.length, s1 must be empty as well. else if threshold if d = self.distance(s1, s2, (threshold*s2.length+1).to_i) d.to_f/s2.length else nil end else self.distance(s1, s2).to_f/s2.length end end end 