Class: TextRank::Fingerprint

Inherits:
Object
  • Object
show all
Defined in:
lib/text_rank/fingerprint.rb

Overview

Class used to compare documents according to TextRank. A "fingerprint" represents the first N keywords (in order from most significant to least) from applying the TextRank algorithm. To compare two "fingerprints" we apply an algorithm that looks at each of the N prefixes and counts the overlap. This rewards matches of significant keywords much higher than matches of less significant keywords. But to prevent less significant keywords from being completely ignored we apply an inverse log linear transformation to each of the N prefixes.

For example, consider the following comparison:

town man empty found vs. general empty found jar

The first pass considers just the first keywords: town vs. general. As these are different, they contribute 0.

The second pass considers the first two keywords: town man vs general empty. Again, no overlap, so they contribute 0.

The third pass considers the first three keywords: town man empty vs general empty found. Here we have one overlap: empty. This contributes 1.

The fourth pass considers all, and there is two overlaps: empty & found. This contributes 2.

We can represent the overlaps as the vector [0, 0, 1, 2]. Then we will apply the inverse log linear transformation defined by:

f(x_i) = x_i / ln(i + 1) = [0, 0, 1 / ln(4), 2 / ln(5)] = [0, 0, 0.7213475204444817, 1.2426698691192237]

Finally we take the average of the transformed vector and normalize it (to ensure a final value between 0.0 and 1.0):

norm(avg(SUM f(x_i))) = norm( avg(1.9640173895637054) ) = norm( 0.49100434739092635 ) = 0.49100434739092635 / avg(SUM f(1, 2, 3, 4)) = 0.49100434739092635 / avg(7.912555793714532) = 0.49100434739092635 / 1.978138948428633 = 0.24821529740414025

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(*values) ⇒ Fingerprint

Creates a new fingerprint for comparison with another fingerprint

Parameters:

  • values (Array)

    An array of fingerprint values of any hashable type.



54
55
56
57
# File 'lib/text_rank/fingerprint.rb', line 54

def initialize(*values)
  @size = values.size
  @values = values
end

Instance Attribute Details

#sizeObject (readonly)

Returns the value of attribute size.



49
50
51
# File 'lib/text_rank/fingerprint.rb', line 49

def size
  @size
end

#valuesObject (readonly)

Returns the value of attribute values.



49
50
51
# File 'lib/text_rank/fingerprint.rb', line 49

def values
  @values
end

Instance Method Details

#similarity(other) ⇒ Number

Calculates the "similarity" between this fingerprint and another

Parameters:

  • other (Fingerprint)

    A second fingerprint to compare

Returns:

  • (Number)

    A number between 0.0 (different) and 1.0 (same)



62
63
64
65
66
67
68
69
70
# File 'lib/text_rank/fingerprint.rb', line 62

def similarity(other)
  return 1.0 if values == other.values # Short-circuit for efficiency

  sum = 0
  overlap(other).each_with_index do |overlap_value, i|
    sum += overlap_value * linear_transform[i]
  end
  sum
end