Class: TfIdf
- Inherits:
-
Object
- Object
- TfIdf
- Defined in:
- lib/tf-idf.rb
Overview
Tf-idf class implementing en.wikipedia.org/wiki/Tf-idf.
The library constructs an IDF corpus and stopword list either from documents specified by the client, or by reading from input files. It computes IDF for a specified term based on the corpus, or generates keywords ordered by tf-idf for a specified document.
Constant Summary collapse
- DEFAULT_IDF =
1.5
Instance Attribute Summary collapse
-
#idf_default ⇒ Float
The default value returned when a term is not found in the tf-idf corpus.
-
#num_docs ⇒ Integer
The total number of documents in the tf-idf corpus.
-
#stopwords ⇒ Array<String>
An array of stopwords.
-
#term_num_docs ⇒ Hash<String, Integer>
A histogram of terms and their term frequency.
Class Method Summary collapse
-
.from_corpus(corpus_filename, default_idf = DEFAULT_IDF) ⇒ TfIdf
Convenience method for creating a TfIdf instance.
Instance Method Summary collapse
-
#add_input_document(input) ⇒ Object
Add terms in the specified document to the IDF corpus.
-
#doc_keywords(curr_doc) ⇒ Array
Retrieve terms and corresponding tf-idf for the specified document.
-
#get_tokens(input) ⇒ Array<String>
Breaks a string into tokens.
-
#idf(term) ⇒ Float
Retrieves the IDF for the specified term.
-
#initialize(corpus_filename = nil, stopword_filename = nil, default_idf = DEFAULT_IDF) ⇒ TfIdf
constructor
Initialize the tf-idf dictionary.
-
#save_corpus_to_file(idf_filename, stopword_filename, stopword_percentage_threshold = 0.01) ⇒ Object
Saves the tf-idf corpus and stopword list to the specified file.
-
#sort_by_tfidf(tfidf) ⇒ Array<Array<String, Float>>
Sorts terms by decreasing tf-idf.
-
#to_s ⇒ String
Returns a string representation of the tf-idf corpus.
Constructor Details
#initialize(corpus_filename = nil, stopword_filename = nil, default_idf = DEFAULT_IDF) ⇒ TfIdf
Initialize the tf-idf dictionary.
If a corpus file is supplied, reads the idf dictionary from it, in the format of:
# of total documents
term: # of documents containing the term
If a stopword file is specified, reads the stopword list from it, in the format of one stopword per line.
The DEFAULT_IDF value is returned when a query term is not found in the IDF corpus.
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/tf-idf.rb', line 46 def initialize(corpus_filename = nil, stopword_filename = nil, default_idf = DEFAULT_IDF) self.num_docs = 0 self.term_num_docs = {} self.stopwords = [] self.idf_default = default_idf raise "Corpus not found" if corpus_filename && !File.exists?(corpus_filename) if corpus_filename entries = File.read(corpus_filename).entries self.num_docs = entries.shift.strip.to_i entries.each do |line| tokens = line.split(":") term = tokens[0].strip frequency = tokens[1].strip.to_i self.term_num_docs[term] = frequency end end raise "Stopwords not found" if stopword_filename && !File.exists?(stopword_filename) if stopword_filename self.stopwords = File.read(stopword_filename).entries.collect{|x| x.strip} end end |
Instance Attribute Details
#idf_default ⇒ Float
Returns The default value returned when a term is not found in the tf-idf corpus.
22 23 24 |
# File 'lib/tf-idf.rb', line 22 def idf_default @idf_default end |
#num_docs ⇒ Integer
Returns The total number of documents in the tf-idf corpus.
13 14 15 |
# File 'lib/tf-idf.rb', line 13 def num_docs @num_docs end |
#stopwords ⇒ Array<String>
Returns An array of stopwords.
19 20 21 |
# File 'lib/tf-idf.rb', line 19 def stopwords @stopwords end |
#term_num_docs ⇒ Hash<String, Integer>
Returns A histogram of terms and their term frequency.
16 17 18 |
# File 'lib/tf-idf.rb', line 16 def term_num_docs @term_num_docs end |
Class Method Details
.from_corpus(corpus_filename, default_idf = DEFAULT_IDF) ⇒ TfIdf
Convenience method for creating a TfIdf instance.
75 76 77 |
# File 'lib/tf-idf.rb', line 75 def self.from_corpus(corpus_filename, default_idf = DEFAULT_IDF) self.new(corpus_filename, nil, default_idf) end |
Instance Method Details
#add_input_document(input) ⇒ Object
Add terms in the specified document to the IDF corpus.
95 96 97 98 99 100 101 102 103 104 105 |
# File 'lib/tf-idf.rb', line 95 def add_input_document(input) self.num_docs += 1 token_set = get_tokens(input).uniq token_set.each do |term| if self.term_num_docs[term] self.term_num_docs[term] += 1 else self.term_num_docs[term] = 1 end end end |
#doc_keywords(curr_doc) ⇒ Array
Retrieve terms and corresponding tf-idf for the specified document.
The returned terms are ordered by decreasing tf-idf.
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
# File 'lib/tf-idf.rb', line 164 def doc_keywords(curr_doc) tfidf = {} tokens = self.get_tokens(curr_doc) token_set = tokens.uniq token_set_sz = token_set.count token_set.each do |term| mytf = tokens.count(term).to_f / token_set_sz myidf = self.idf(term) tfidf[term] = mytf * myidf end sort_by_tfidf(tfidf) end |
#get_tokens(input) ⇒ Array<String>
Breaks a string into tokens. This implementation matches whole words. Clients may wish to override this behaviour with their own tokenization. strategy.
86 87 88 89 |
# File 'lib/tf-idf.rb', line 86 def get_tokens(input) # str.split().collect{|x| x if x =~ /[A-Za-z]+/}.compact input.split.select{|x| x =~ /<a.*?\/a>|<[^\>]*>|[\w'@#]+/} end |
#idf(term) ⇒ Float
Retrieves the IDF for the specified term.
This is computed with:
logarithm of ((number of documents in corpus) divided by
(number of documents containing this term)).
144 145 146 147 148 149 150 151 152 153 154 155 |
# File 'lib/tf-idf.rb', line 144 def idf(term) if self.stopwords.include?(term) return 0 end if self.term_num_docs[term].nil? return self.idf_default end return Math.log((1 + self.num_docs).to_f / (1 + self.term_num_docs[term])) end |
#save_corpus_to_file(idf_filename, stopword_filename, stopword_percentage_threshold = 0.01) ⇒ Object
Saves the tf-idf corpus and stopword list to the specified file.
A word is a stopword if it occurs in more than stopword_threshold% of num_docs. A threshold of 0.4, means that the word must occur in more than 40% of the documents.
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
# File 'lib/tf-idf.rb', line 116 def save_corpus_to_file(idf_filename, stopword_filename, stopword_percentage_threshold = 0.01) File.open(idf_filename, "w") do |file| file.write("#{self.num_docs}\n") self.term_num_docs.each do |term, num_docs| file.write("#{term}: #{num_docs}\n") end end File.open(stopword_filename, "w") do |file| sorted_term_num_docs = sort_by_tfidf(self.term_num_docs) sorted_term_num_docs.each do |term, num_docs| # pp [term, num_docs, stopword_percentage_threshold, self.num_docs, stopword_percentage_threshold * self.num_docs, ] if num_docs > stopword_percentage_threshold * self.num_docs file.write("#{term}\n") end end end end |
#sort_by_tfidf(tfidf) ⇒ Array<Array<String, Float>>
Sorts terms by decreasing tf-idf.
195 196 197 |
# File 'lib/tf-idf.rb', line 195 def sort_by_tfidf(tfidf) tfidf.sort{|a, b| b[1] <=> a[1]} end |
#to_s ⇒ String
Returns a string representation of the tf-idf corpus.
184 185 186 |
# File 'lib/tf-idf.rb', line 184 def to_s {:num_docs => self.num_docs, :term_num_docs => self.term_num_docs.size}.inspect end |