Class: TfIdf

Inherits:

Object

Object
TfIdf

Defined in:: lib/tf-idf.rb

Overview

Tf-idf class implementing en.wikipedia.org/wiki/Tf-idf.

The library constructs an IDF corpus and stopword list either from documents specified by the client, or by reading from input files. It computes IDF for a specified term based on the corpus, or generates keywords ordered by tf-idf for a specified document.

Constant Summary collapse

DEFAULT_IDF =

1.5

Instance Attribute Summary collapse

#idf_default ⇒ Float

The default value returned when a term is not found in the tf-idf corpus.
#num_docs ⇒ Integer

The total number of documents in the tf-idf corpus.
#stopwords ⇒ Array<String>

An array of stopwords.
#term_num_docs ⇒ Hash<String, Integer>

A histogram of terms and their term frequency.

Class Method Summary collapse

.from_corpus(corpus_filename, default_idf = DEFAULT_IDF) ⇒ TfIdf

Convenience method for creating a TfIdf instance.

Instance Method Summary collapse

#add_input_document(input) ⇒ Object

Add terms in the specified document to the IDF corpus.
#doc_keywords(curr_doc) ⇒ Array

Retrieve terms and corresponding tf-idf for the specified document.
#get_tokens(input) ⇒ Array<String>

Breaks a string into tokens.
#idf(term) ⇒ Float

Retrieves the IDF for the specified term.
#initialize(corpus_filename = nil, stopword_filename = nil, default_idf = DEFAULT_IDF) ⇒ TfIdf constructor

Initialize the tf-idf dictionary.
#save_corpus_to_file(idf_filename, stopword_filename, stopword_percentage_threshold = 0.01) ⇒ Object

Saves the tf-idf corpus and stopword list to the specified file.
#sort_by_tfidf(tfidf) ⇒ Array<Array<String, Float>>

Sorts terms by decreasing tf-idf.
#to_s ⇒ String

Returns a string representation of the tf-idf corpus.

Constructor Details

#initialize(corpus_filename = nil, stopword_filename = nil, default_idf = DEFAULT_IDF) ⇒ `TfIdf`

Initialize the tf-idf dictionary.

If a corpus file is supplied, reads the idf dictionary from it, in the format of:

# of total documents
term: # of documents containing the term

If a stopword file is specified, reads the stopword list from it, in the format of one stopword per line.

The DEFAULT_IDF value is returned when a query term is not found in the IDF corpus.

Parameters:

corpus_filename (String) (defaults to: nil) —

The disk location of the IDF corpus.
stopword_filename (String) (defaults to: nil) —

The disk location of the stopword list.
default_idf (Float) (defaults to: DEFAULT_IDF) —

The value returned when a term is not found in the IDF corpus.

Raises:

("Corpus Not Found") —

Thrown when the corpus isn’t found.
("Stopwords Not Found") —

Thrown when the stopwords list isn’t found.

# File 'lib/tf-idf.rb', line 46

def initialize(corpus_filename = nil, stopword_filename = nil, default_idf = DEFAULT_IDF)
  self.num_docs = 0
  self.term_num_docs = {} 
  self.stopwords = []
  self.idf_default = default_idf

  raise "Corpus not found" if corpus_filename && !File.exists?(corpus_filename)    
  if corpus_filename
    entries = File.read(corpus_filename).entries
    self.num_docs = entries.shift.strip.to_i
    entries.each do |line|
      tokens = line.split(":")
      term = tokens[0].strip
      frequency = tokens[1].strip.to_i
      self.term_num_docs[term] = frequency
    end
  end
  
  raise "Stopwords not found" if stopword_filename && !File.exists?(stopword_filename)
  if stopword_filename
    self.stopwords = File.read(stopword_filename).entries.collect{|x| x.strip}
  end
end

Instance Attribute Details

#idf_default ⇒ `Float`

Returns The default value returned when a term is not found in the tf-idf corpus.

Returns:

(Float) —

The default value returned when a term is not found in the tf-idf corpus.



22
23
24

# File 'lib/tf-idf.rb', line 22

def idf_default
  @idf_default
end

#num_docs ⇒ `Integer`

Returns The total number of documents in the tf-idf corpus.

Returns:

(Integer) —

The total number of documents in the tf-idf corpus.



13
14
15

# File 'lib/tf-idf.rb', line 13

def num_docs
  @num_docs
end

#stopwords ⇒ `Array<String>`

Returns An array of stopwords.

Returns:

(Array<String>) —

An array of stopwords.



19
20
21

# File 'lib/tf-idf.rb', line 19

def stopwords
  @stopwords
end

#term_num_docs ⇒ `Hash<String, Integer>`

Returns A histogram of terms and their term frequency.

Returns:

(Hash<String, Integer>) —

A histogram of terms and their term frequency.



16
17
18

# File 'lib/tf-idf.rb', line 16

def term_num_docs
  @term_num_docs
end

Class Method Details

.from_corpus(corpus_filename, default_idf = DEFAULT_IDF) ⇒ `TfIdf`

Convenience method for creating a TfIdf instance.

Parameters:

corpus_filename (String) —

The disk location of the IDF corpus.

Returns:

(TfIdf) —

A TfIdf instance loaded with the corpus.



75
76
77

# File 'lib/tf-idf.rb', line 75

def self.from_corpus(corpus_filename, default_idf = DEFAULT_IDF)
  self.new(corpus_filename, nil, default_idf)
end

Instance Method Details

#add_input_document(input) ⇒ `Object`

Add terms in the specified document to the IDF corpus.

Parameters:

input (String) —

String representation of a document.

# File 'lib/tf-idf.rb', line 95

def add_input_document(input)
  self.num_docs += 1
  token_set = get_tokens(input).uniq
  token_set.each do |term|
    if self.term_num_docs[term]
      self.term_num_docs[term] += 1
    else
      self.term_num_docs[term] = 1
    end
  end
end

#doc_keywords(curr_doc) ⇒ `Array`

Retrieve terms and corresponding tf-idf for the specified document.

The returned terms are ordered by decreasing tf-idf.

Parameters:

curr_doc (String) —

String representation of an existing document.

Returns:

(Array) —

Terms ordered by decreasing tf-idf rank.

# File 'lib/tf-idf.rb', line 164

def doc_keywords(curr_doc)
  tfidf = {}

  tokens = self.get_tokens(curr_doc)
  token_set = tokens.uniq
  token_set_sz = token_set.count
  
  token_set.each do |term|
    mytf = tokens.count(term).to_f / token_set_sz
    myidf = self.idf(term)
    tfidf[term] = mytf * myidf
  end

  sort_by_tfidf(tfidf)
end

#get_tokens(input) ⇒ `Array<String>`

Breaks a string into tokens. This implementation matches whole words. Clients may wish to override this behaviour with their own tokenization. strategy.

Parameters:

input (String) —

String representation of a document

Returns:

(Array<String>) —

A list of tokens

# File 'lib/tf-idf.rb', line 86

def get_tokens(input)
  # str.split().collect{|x| x if x =~ /[A-Za-z]+/}.compact
  input.split.select{|x| x =~ /<a.*?\/a>|<[^\>]*>|[\w'@#]+/}
end

#idf(term) ⇒ `Float`

Retrieves the IDF for the specified term.

This is computed with:

logarithm of ((number of documents in corpus) divided by 
              (number of documents containing this term)).

Parameters:

term (String) —

A term in the IDF corpus.

Returns:

(Float) —

The IDF for the specified term.

# File 'lib/tf-idf.rb', line 144

def idf(term)
  if self.stopwords.include?(term)
    return 0
  end
          
  if self.term_num_docs[term].nil?
    return self.idf_default
  end

  return Math.log((1 + self.num_docs).to_f / 
                  (1 + self.term_num_docs[term]))
end

#save_corpus_to_file(idf_filename, stopword_filename, stopword_percentage_threshold = 0.01) ⇒ `Object`

Saves the tf-idf corpus and stopword list to the specified file.

A word is a stopword if it occurs in more than stopword_threshold% of num_docs. A threshold of 0.4, means that the word must occur in more than 40% of the documents.

Parameters:

idf_filename (String) —

Filename.
stopword_filename (String) —

Filename.
stopword_percentage_threshold (Float) (defaults to: 0.01) —

Stopword threshold. Lower threshold lower criteria.

# File 'lib/tf-idf.rb', line 116

def save_corpus_to_file(idf_filename, stopword_filename, stopword_percentage_threshold = 0.01)
  File.open(idf_filename, "w") do |file|
    file.write("#{self.num_docs}\n")
    self.term_num_docs.each do |term, num_docs|
      file.write("#{term}: #{num_docs}\n")
    end
  end
  
  File.open(stopword_filename, "w") do |file|
    sorted_term_num_docs = sort_by_tfidf(self.term_num_docs)
    sorted_term_num_docs.each do |term, num_docs|
      # pp [term, num_docs, stopword_percentage_threshold, self.num_docs, stopword_percentage_threshold * self.num_docs, ]
      if num_docs > stopword_percentage_threshold * self.num_docs
        file.write("#{term}\n")
      end
    end
  end
end

#sort_by_tfidf(tfidf) ⇒ `Array<Array<String, Float>>`

Sorts terms by decreasing tf-idf.

Examples:

Sort by tf-idf

"{'and'=>0.0025, 'fork'=>0.0025, 'the'=>0.37688590118819, 'spoon'=>1.0025}" #=>
"[['spoon', 1.0025], ['the', 0.37688590118819], ['fork', 0.0025], ['and', 0.0025]]"

Returns:

(Array<Array<String, Float>>) —

An array of term/IDF array pairs.



195
196
197

# File 'lib/tf-idf.rb', line 195

def sort_by_tfidf(tfidf)
  tfidf.sort{|a, b| b[1] <=> a[1]}
end

#to_s ⇒ `String`

Returns a string representation of the tf-idf corpus.

Returns:

(String) —

Contains # docs, # term and frequency.



184
185
186

# File 'lib/tf-idf.rb', line 184

def to_s
  {:num_docs => self.num_docs, :term_num_docs => self.term_num_docs.size}.inspect
end

Class: TfIdf

Overview

Constant Summary collapse

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(corpus_filename = nil, stopword_filename = nil, default_idf = DEFAULT_IDF) ⇒ TfIdf

Instance Attribute Details

#idf_default ⇒ Float

#num_docs ⇒ Integer

#stopwords ⇒ Array<String>

#term_num_docs ⇒ Hash<String, Integer>

Class Method Details

.from_corpus(corpus_filename, default_idf = DEFAULT_IDF) ⇒ TfIdf

Instance Method Details

#add_input_document(input) ⇒ Object

#doc_keywords(curr_doc) ⇒ Array

#get_tokens(input) ⇒ Array<String>

#idf(term) ⇒ Float

#save_corpus_to_file(idf_filename, stopword_filename, stopword_percentage_threshold = 0.01) ⇒ Object

#sort_by_tfidf(tfidf) ⇒ Array<Array<String, Float>>

Examples:

Sort by tf-idf

#to_s ⇒ String

#initialize(corpus_filename = nil, stopword_filename = nil, default_idf = DEFAULT_IDF) ⇒ `TfIdf`

#idf_default ⇒ `Float`

#num_docs ⇒ `Integer`

#stopwords ⇒ `Array<String>`

#term_num_docs ⇒ `Hash<String, Integer>`

.from_corpus(corpus_filename, default_idf = DEFAULT_IDF) ⇒ `TfIdf`

#add_input_document(input) ⇒ `Object`

#doc_keywords(curr_doc) ⇒ `Array`

#get_tokens(input) ⇒ `Array<String>`

#idf(term) ⇒ `Float`

#save_corpus_to_file(idf_filename, stopword_filename, stopword_percentage_threshold = 0.01) ⇒ `Object`

#sort_by_tfidf(tfidf) ⇒ `Array<Array<String, Float>>`

#to_s ⇒ `String`