Class: Topical::Extractors::TermExtractor

Inherits:

Object

Object
Topical::Extractors::TermExtractor

show all

Defined in:: lib/topical/extractors/term_extractor.rb

Overview

Extracts distinctive terms from documents using c-TF-IDF

Constant Summary collapse

DEFAULT_STOP_WORDS = Default English stop words

Set.new(%w[
  the be to of and a in that have i it for not on with he as you do at
  this but his by from they we say her she or an will my one all would
  there their what so up out if about who get which go me when make can
  like time no just him know take people into year your good some could
  them see other than then now look only come its over think also back
  after use two how our work first well way even new want because any
  these give day most us is was are been has had were said did get may
])

Instance Method Summary collapse

#extract_distinctive_terms(topic_docs:, all_docs:, top_n: 20) ⇒ Array<String>

Extract distinctive terms using c-TF-IDF.
#initialize(stop_words: DEFAULT_STOP_WORDS, min_word_length: 3, max_word_length: 20) ⇒ TermExtractor constructor

A new instance of TermExtractor.

Constructor Details

#initialize(stop_words: DEFAULT_STOP_WORDS, min_word_length: 3, max_word_length: 20) ⇒ `TermExtractor`

Returns a new instance of TermExtractor.

# File 'lib/topical/extractors/term_extractor.rb', line 20

def initialize(stop_words: DEFAULT_STOP_WORDS, min_word_length: 3, max_word_length: 20)
  @stop_words = stop_words
  @min_word_length = min_word_length
  @max_word_length = max_word_length
end

Instance Method Details

#extract_distinctive_terms(topic_docs:, all_docs:, top_n: 20) ⇒ `Array<String>`

Extract distinctive terms using c-TF-IDF

Parameters:

topic_docs (Array<String>) —

Documents in the topic
all_docs (Array<String>) —

All documents in the corpus
top_n (Integer) (defaults to: 20) —

Number of top terms to return

Returns:

(Array<String>) —

Top distinctive terms

# File 'lib/topical/extractors/term_extractor.rb', line 31

def extract_distinctive_terms(topic_docs:, all_docs:, top_n: 20)
  # Tokenize and count terms in topic
  topic_terms = count_terms(topic_docs)
  
  # Tokenize and count document frequency across all docs
  doc_frequencies = compute_document_frequencies(all_docs)
  
  # Compute c-TF-IDF scores
  scores = {}
  total_docs = all_docs.length.to_f
  
  topic_terms.each do |term, tf|
    # c-TF-IDF formula: tf * log(N / df)
    df = doc_frequencies[term] || 1
    idf = Math.log(total_docs / df)
    scores[term] = tf * idf
  end
  
  # Return top scoring terms
  scores.sort_by { |_, score| -score }
         .first(top_n)
         .map(&:first)
end