Class: Topical::Extractors::TermExtractor
- Inherits:
-
Object
- Object
- Topical::Extractors::TermExtractor
- Defined in:
- lib/topical/extractors/term_extractor.rb
Overview
Extracts distinctive terms from documents using c-TF-IDF
Constant Summary collapse
- DEFAULT_STOP_WORDS =
Default English stop words
Set.new(%w[ the be to of and a in that have i it for not on with he as you do at this but his by from they we say her she or an will my one all would there their what so up out if about who get which go me when make can like time no just him know take people into year your good some could them see other than then now look only come its over think also back after use two how our work first well way even new want because any these give day most us is was are been has had were said did get may ])
Instance Method Summary collapse
-
#extract_distinctive_terms(topic_docs:, all_docs:, top_n: 20) ⇒ Array<String>
Extract distinctive terms using c-TF-IDF.
-
#initialize(stop_words: DEFAULT_STOP_WORDS, min_word_length: 3, max_word_length: 20) ⇒ TermExtractor
constructor
A new instance of TermExtractor.
Constructor Details
#initialize(stop_words: DEFAULT_STOP_WORDS, min_word_length: 3, max_word_length: 20) ⇒ TermExtractor
Returns a new instance of TermExtractor.
20 21 22 23 24 |
# File 'lib/topical/extractors/term_extractor.rb', line 20 def initialize(stop_words: DEFAULT_STOP_WORDS, min_word_length: 3, max_word_length: 20) @stop_words = stop_words @min_word_length = min_word_length @max_word_length = max_word_length end |
Instance Method Details
#extract_distinctive_terms(topic_docs:, all_docs:, top_n: 20) ⇒ Array<String>
Extract distinctive terms using c-TF-IDF
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
# File 'lib/topical/extractors/term_extractor.rb', line 31 def extract_distinctive_terms(topic_docs:, all_docs:, top_n: 20) # Tokenize and count terms in topic topic_terms = count_terms(topic_docs) # Tokenize and count document frequency across all docs doc_frequencies = compute_document_frequencies(all_docs) # Compute c-TF-IDF scores scores = {} total_docs = all_docs.length.to_f topic_terms.each do |term, tf| # c-TF-IDF formula: tf * log(N / df) df = doc_frequencies[term] || 1 idf = Math.log(total_docs / df) scores[term] = tf * idf end # Return top scoring terms scores.sort_by { |_, score| -score } .first(top_n) .map(&:first) end |