Module: RetrievalLite::TfIdfRetrieval
- Defined in:
- lib/retrieval_lite/tfidf_retrieval.rb
Overview
Class Method Summary collapse
-
.evaluate(corpus, query) ⇒ Array<Document>
Queries a corpus using the tf-idf ranking algorithm and cosine similarity.
-
.evaluate_with_scores(corpus, query) ⇒ Hash<Document, Integer>
Queries a corpus using the tf-idf ranking algorithm and cosine similarity.
-
.normalized_tfidf_weight(corpus, document, term) ⇒ Float
Ranks a document in corpus using the normalized tf-idf scoring.
-
.tfidf_weight(corpus, document, term) ⇒ Float
Ranks a document in corpus using the tf-idf scoring.
Class Method Details
.evaluate(corpus, query) ⇒ Array<Document>
Queries a corpus using the tf-idf ranking algorithm and cosine similarity. Returns documents ordered by tf-idf score.
9 10 11 |
# File 'lib/retrieval_lite/tfidf_retrieval.rb', line 9 def self.evaluate(corpus, query) evaluate_with_scores(corpus, query).keys end |
.evaluate_with_scores(corpus, query) ⇒ Hash<Document, Integer>
Queries a corpus using the tf-idf ranking algorithm and cosine similarity. Same as #evaluate but returns a hash whose keys are documents and values are the tf-idf score.
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# File 'lib/retrieval_lite/tfidf_retrieval.rb', line 20 def self.evaluate_with_scores(corpus, query) query_document = RetrievalLite::Document.new(query) terms = query_document.term_frequencies.keys query_vector = query_document.term_frequencies.values # should be in same order as keys documents = Set.new # ordering of documents doesn't matter right now # gathering only the documents that contain at least one of those terms terms.each do |t| docs_with_term = corpus.documents_with(t) if docs_with_term docs_with_term.each do |d| if !documents.include?(d) documents << d end end end end scores = {} documents.each do |document| document_vector = Array.new(terms.size) terms.each_with_index do |term, index| document_vector[index] = tfidf_weight(corpus, document, term) end scores[document] = RetrievalLite::Vector.cosine_similarity(query_vector, document_vector) end # order it by score in descending order return Hash[scores.sort_by{|key, value| value}.reverse] end |
.normalized_tfidf_weight(corpus, document, term) ⇒ Float
Ranks a document in corpus using the normalized tf-idf scoring.
74 75 76 77 78 79 80 81 82 83 |
# File 'lib/retrieval_lite/tfidf_retrieval.rb', line 74 def self.normalized_tfidf_weight(corpus, document, term) length_of_vector = 0 corpus.documents_with(term).each do |d| weight = tfidf_weight(corpus, d, term) length_of_vector += weight * weight end tfidf_weight(corpus, document, term) / Math.sqrt(length_of_vector) end |
.tfidf_weight(corpus, document, term) ⇒ Float
tf-idf is slightly modified. n_j (# of docs containing term j) is replaced with n_j + 1 to avoid divide by zero
Ranks a document in corpus using the tf-idf scoring.
59 60 61 62 63 64 65 |
# File 'lib/retrieval_lite/tfidf_retrieval.rb', line 59 def self.tfidf_weight(corpus, document, term) if corpus.document_frequency(term) == 0 return 0 else return document.frequency_of(term) * Math.log(1.0 * corpus.size/(corpus.document_frequency(term))) end end |