Class: TfIdf

Inherits:
Object
  • Object
show all
Defined in:
lib/tf-idf.rb

Overview

Tf-idf class implementing en.wikipedia.org/wiki/Tf-idf.

The library constructs an IDF corpus and stopword list either from documents specified by the client, or by reading from input files. It computes IDF for a specified term based on the corpus, or generates keywords ordered by tf-idf for a specified document.

Constant Summary collapse

DEFAULT_IDF =
1.5

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(corpus_filename = nil, stopword_filename = nil, default_idf = DEFAULT_IDF) ⇒ TfIdf

Initialize the tf-idf dictionary.

If a corpus file is supplied, reads the idf dictionary from it, in the format of:

# of total documents
term: # of documents containing the term

If a stopword file is specified, reads the stopword list from it, in the format of one stopword per line.

The DEFAULT_IDF value is returned when a query term is not found in the IDF corpus.

Parameters:

  • corpus_filename (String) (defaults to: nil)

    The disk location of the IDF corpus.

  • stopword_filename (String) (defaults to: nil)

    The disk location of the stopword list.

  • default_idf (Float) (defaults to: DEFAULT_IDF)

    The value returned when a term is not found in the IDF corpus.

Raises:

  • ("Corpus Not Found")

    Thrown when the corpus isn’t found.

  • ("Stopwords Not Found")

    Thrown when the stopwords list isn’t found.



46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# File 'lib/tf-idf.rb', line 46

def initialize(corpus_filename = nil, stopword_filename = nil, default_idf = DEFAULT_IDF)
  self.num_docs = 0
  self.term_num_docs = {} 
  self.stopwords = []
  self.idf_default = default_idf

  raise "Corpus not found" if corpus_filename && !File.exists?(corpus_filename)    
  if corpus_filename
    entries = File.read(corpus_filename).entries
    self.num_docs = entries.shift.strip.to_i
    entries.each do |line|
      tokens = line.split(":")
      term = tokens[0].strip
      frequency = tokens[1].strip.to_i
      self.term_num_docs[term] = frequency
    end
  end
  
  raise "Stopwords not found" if stopword_filename && !File.exists?(stopword_filename)
  if stopword_filename
    self.stopwords = File.read(stopword_filename).entries.collect{|x| x.strip}
  end
end

Instance Attribute Details

#idf_defaultFloat

Returns The default value returned when a term is not found in the tf-idf corpus.

Returns:

  • (Float)

    The default value returned when a term is not found in the tf-idf corpus.



22
23
24
# File 'lib/tf-idf.rb', line 22

def idf_default
  @idf_default
end

#num_docsInteger

Returns The total number of documents in the tf-idf corpus.

Returns:

  • (Integer)

    The total number of documents in the tf-idf corpus.



13
14
15
# File 'lib/tf-idf.rb', line 13

def num_docs
  @num_docs
end

#stopwordsArray<String>

Returns An array of stopwords.

Returns:

  • (Array<String>)

    An array of stopwords.



19
20
21
# File 'lib/tf-idf.rb', line 19

def stopwords
  @stopwords
end

#term_num_docsHash<String, Integer>

Returns A histogram of terms and their term frequency.

Returns:

  • (Hash<String, Integer>)

    A histogram of terms and their term frequency.



16
17
18
# File 'lib/tf-idf.rb', line 16

def term_num_docs
  @term_num_docs
end

Class Method Details

.from_corpus(corpus_filename, default_idf = DEFAULT_IDF) ⇒ TfIdf

Convenience method for creating a TfIdf instance.

Parameters:

  • corpus_filename (String)

    The disk location of the IDF corpus.

Returns:

  • (TfIdf)

    A TfIdf instance loaded with the corpus.



75
76
77
# File 'lib/tf-idf.rb', line 75

def self.from_corpus(corpus_filename, default_idf = DEFAULT_IDF)
  self.new(corpus_filename, nil, default_idf)
end

Instance Method Details

#add_input_document(input) ⇒ Object

Add terms in the specified document to the IDF corpus.

Parameters:

  • input (String)

    String representation of a document.



95
96
97
98
99
100
101
102
103
104
105
# File 'lib/tf-idf.rb', line 95

def add_input_document(input)
  self.num_docs += 1
  token_set = get_tokens(input).uniq
  token_set.each do |term|
    if self.term_num_docs[term]
      self.term_num_docs[term] += 1
    else
      self.term_num_docs[term] = 1
    end
  end
end

#doc_keywords(curr_doc) ⇒ Array

Retrieve terms and corresponding tf-idf for the specified document.

The returned terms are ordered by decreasing tf-idf.

Parameters:

  • curr_doc (String)

    String representation of an existing document.

Returns:

  • (Array)

    Terms ordered by decreasing tf-idf rank.



164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
# File 'lib/tf-idf.rb', line 164

def doc_keywords(curr_doc)
  tfidf = {}

  tokens = self.get_tokens(curr_doc)
  token_set = tokens.uniq
  token_set_sz = token_set.count
  
  token_set.each do |term|
    mytf = tokens.count(term).to_f / token_set_sz
    myidf = self.idf(term)
    tfidf[term] = mytf * myidf
  end

  sort_by_tfidf(tfidf)
end

#get_tokens(input) ⇒ Array<String>

Breaks a string into tokens. This implementation matches whole words. Clients may wish to override this behaviour with their own tokenization. strategy.

Parameters:

  • input (String)

    String representation of a document

Returns:

  • (Array<String>)

    A list of tokens



86
87
88
89
# File 'lib/tf-idf.rb', line 86

def get_tokens(input)
  # str.split().collect{|x| x if x =~ /[A-Za-z]+/}.compact
  input.split.select{|x| x =~ /<a.*?\/a>|<[^\>]*>|[\w'@#]+/}
end

#idf(term) ⇒ Float

Retrieves the IDF for the specified term.

This is computed with:

logarithm of ((number of documents in corpus) divided by 
              (number of documents containing this term)).

Parameters:

  • term (String)

    A term in the IDF corpus.

Returns:

  • (Float)

    The IDF for the specified term.



144
145
146
147
148
149
150
151
152
153
154
155
# File 'lib/tf-idf.rb', line 144

def idf(term)
  if self.stopwords.include?(term)
    return 0
  end
          
  if self.term_num_docs[term].nil?
    return self.idf_default
  end

  return Math.log((1 + self.num_docs).to_f / 
                  (1 + self.term_num_docs[term]))
end

#save_corpus_to_file(idf_filename, stopword_filename, stopword_percentage_threshold = 0.01) ⇒ Object

Saves the tf-idf corpus and stopword list to the specified file.

A word is a stopword if it occurs in more than stopword_threshold% of num_docs. A threshold of 0.4, means that the word must occur in more than 40% of the documents.

Parameters:

  • idf_filename (String)

    Filename.

  • stopword_filename (String)

    Filename.

  • stopword_percentage_threshold (Float) (defaults to: 0.01)

    Stopword threshold. Lower threshold lower criteria.



116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# File 'lib/tf-idf.rb', line 116

def save_corpus_to_file(idf_filename, stopword_filename, stopword_percentage_threshold = 0.01)
  File.open(idf_filename, "w") do |file|
    file.write("#{self.num_docs}\n")
    self.term_num_docs.each do |term, num_docs|
      file.write("#{term}: #{num_docs}\n")
    end
  end
  
  File.open(stopword_filename, "w") do |file|
    sorted_term_num_docs = sort_by_tfidf(self.term_num_docs)
    sorted_term_num_docs.each do |term, num_docs|
      # pp [term, num_docs, stopword_percentage_threshold, self.num_docs, stopword_percentage_threshold * self.num_docs, ]
      if num_docs > stopword_percentage_threshold * self.num_docs
        file.write("#{term}\n")
      end
    end
  end
end

#sort_by_tfidf(tfidf) ⇒ Array<Array<String, Float>>

Sorts terms by decreasing tf-idf.

Examples:

Sort by tf-idf

"{'and'=>0.0025, 'fork'=>0.0025, 'the'=>0.37688590118819, 'spoon'=>1.0025}" #=>
"[['spoon', 1.0025], ['the', 0.37688590118819], ['fork', 0.0025], ['and', 0.0025]]"

Returns:

  • (Array<Array<String, Float>>)

    An array of term/IDF array pairs.



195
196
197
# File 'lib/tf-idf.rb', line 195

def sort_by_tfidf(tfidf)
  tfidf.sort{|a, b| b[1] <=> a[1]}
end

#to_sString

Returns a string representation of the tf-idf corpus.

Returns:

  • (String)

    Contains # docs, # term and frequency.



184
185
186
# File 'lib/tf-idf.rb', line 184

def to_s
  {:num_docs => self.num_docs, :term_num_docs => self.term_num_docs.size}.inspect
end