Class: Topical::Extractors::TermExtractor

Inherits:
Object
  • Object
show all
Defined in:
lib/topical/extractors/term_extractor.rb

Overview

Extracts distinctive terms from documents using c-TF-IDF

Constant Summary collapse

DEFAULT_STOP_WORDS =

Default English stop words

Set.new(%w[
  the be to of and a in that have i it for not on with he as you do at
  this but his by from they we say her she or an will my one all would
  there their what so up out if about who get which go me when make can
  like time no just him know take people into year your good some could
  them see other than then now look only come its over think also back
  after use two how our work first well way even new want because any
  these give day most us is was are been has had were said did get may
])

Instance Method Summary collapse

Constructor Details

#initialize(stop_words: DEFAULT_STOP_WORDS, min_word_length: 3, max_word_length: 20) ⇒ TermExtractor

Returns a new instance of TermExtractor.



20
21
22
23
24
# File 'lib/topical/extractors/term_extractor.rb', line 20

def initialize(stop_words: DEFAULT_STOP_WORDS, min_word_length: 3, max_word_length: 20)
  @stop_words = stop_words
  @min_word_length = min_word_length
  @max_word_length = max_word_length
end

Instance Method Details

#extract_distinctive_terms(topic_docs:, all_docs:, top_n: 20) ⇒ Array<String>

Extract distinctive terms using c-TF-IDF

Parameters:

  • topic_docs (Array<String>)

    Documents in the topic

  • all_docs (Array<String>)

    All documents in the corpus

  • top_n (Integer) (defaults to: 20)

    Number of top terms to return

Returns:

  • (Array<String>)

    Top distinctive terms



31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# File 'lib/topical/extractors/term_extractor.rb', line 31

def extract_distinctive_terms(topic_docs:, all_docs:, top_n: 20)
  # Tokenize and count terms in topic
  topic_terms = count_terms(topic_docs)
  
  # Tokenize and count document frequency across all docs
  doc_frequencies = compute_document_frequencies(all_docs)
  
  # Compute c-TF-IDF scores
  scores = {}
  total_docs = all_docs.length.to_f
  
  topic_terms.each do |term, tf|
    # c-TF-IDF formula: tf * log(N / df)
    df = doc_frequencies[term] || 1
    idf = Math.log(total_docs / df)
    scores[term] = tf * idf
  end
  
  # Return top scoring terms
  scores.sort_by { |_, score| -score }
         .first(top_n)
         .map(&:first)
end