Class: Classifier::TFIDF

Inherits:
Object show all
Includes:
Streaming
Defined in:
lib/classifier/tfidf.rb

Overview

TF-IDF vectorizer: transforms text to weighted feature vectors. Downweights common words, upweights discriminative terms.

Example:

tfidf = Classifier::TFIDF.new
tfidf.fit(["Dogs are great pets", "Cats are independent"])
tfidf.transform("Dogs are loyal")  # => {:dog=>0.7071..., :loyal=>0.7071...}

Constant Summary

Constants included from Streaming

Streaming::DEFAULT_BATCH_SIZE

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Streaming

#delete_checkpoint, #list_checkpoints, #save_checkpoint

Constructor Details

#initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false) ⇒ TFIDF

Creates a new TF-IDF vectorizer.

  • min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)

  • ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams

  • sublinear_tf: use 1 + log(tf) instead of raw term frequency



42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# File 'lib/classifier/tfidf.rb', line 42

def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false)
  validate_df!(min_df, 'min_df')
  validate_df!(max_df, 'max_df')
  validate_ngram_range!(ngram_range)

  @min_df = min_df
  @max_df = max_df
  @ngram_range = ngram_range
  @sublinear_tf = sublinear_tf
  @vocabulary = {}
  @idf = {}
  @num_documents = 0
  @fitted = false
  @dirty = false
  @storage = nil
end

Instance Attribute Details

#idfObject (readonly)

Returns the value of attribute idf.



32
33
34
# File 'lib/classifier/tfidf.rb', line 32

def idf
  @idf
end

#num_documentsObject (readonly)

Returns the value of attribute num_documents.



32
33
34
# File 'lib/classifier/tfidf.rb', line 32

def num_documents
  @num_documents
end

#storageObject

Returns the value of attribute storage.



33
34
35
# File 'lib/classifier/tfidf.rb', line 33

def storage
  @storage
end

#vocabularyObject (readonly)

Returns the value of attribute vocabulary.



32
33
34
# File 'lib/classifier/tfidf.rb', line 32

def vocabulary
  @vocabulary
end

Class Method Details

.from_json(json) ⇒ Object

Loads a vectorizer from JSON.

Raises:

  • (ArgumentError)


218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
# File 'lib/classifier/tfidf.rb', line 218

def self.from_json(json)
  data = json.is_a?(String) ? JSON.parse(json) : json
  raise ArgumentError, "Invalid vectorizer type: #{data['type']}" unless data['type'] == 'tfidf'

  instance = new(
    min_df: data['min_df'],
    max_df: data['max_df'],
    ngram_range: data['ngram_range'],
    sublinear_tf: data['sublinear_tf']
  )

  instance.instance_variable_set(:@vocabulary, symbolize_keys(data['vocabulary']))
  instance.instance_variable_set(:@idf, symbolize_keys(data['idf']))
  instance.instance_variable_set(:@num_documents, data['num_documents'])
  instance.instance_variable_set(:@fitted, data['fitted'])
  instance.instance_variable_set(:@dirty, false)
  instance.instance_variable_set(:@storage, nil)

  instance
end

.load(storage:) ⇒ Object

Loads a vectorizer from the configured storage.

Raises:



153
154
155
156
157
158
159
160
# File 'lib/classifier/tfidf.rb', line 153

def self.load(storage:)
  data = storage.read
  raise StorageError, 'No saved state found' unless data

  instance = from_json(data)
  instance.storage = storage
  instance
end

.load_checkpoint(storage:, checkpoint_id:) ⇒ Object

Loads a vectorizer from a checkpoint.

Raises:

  • (ArgumentError)


254
255
256
257
258
259
260
261
262
263
264
265
266
# File 'lib/classifier/tfidf.rb', line 254

def self.load_checkpoint(storage:, checkpoint_id:)
  raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File)

  dir = File.dirname(storage.path)
  base = File.basename(storage.path, '.*')
  ext = File.extname(storage.path)
  checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}")

  checkpoint_storage = Storage::File.new(path: checkpoint_path)
  instance = load(storage: checkpoint_storage)
  instance.storage = storage
  instance
end

.load_from_file(path) ⇒ Object

Loads a vectorizer from a file.



164
165
166
# File 'lib/classifier/tfidf.rb', line 164

def self.load_from_file(path)
  from_json(File.read(path))
end

Instance Method Details

#as_json(_options = nil) ⇒ Object



196
197
198
199
200
201
202
203
204
205
206
207
208
209
# File 'lib/classifier/tfidf.rb', line 196

def as_json(_options = nil)
  {
    version: 1,
    type: 'tfidf',
    min_df: @min_df,
    max_df: @max_df,
    ngram_range: @ngram_range,
    sublinear_tf: @sublinear_tf,
    vocabulary: @vocabulary,
    idf: @idf,
    num_documents: @num_documents,
    fitted: @fitted
  }
end

#dirty?Boolean

Returns true if there are unsaved changes.

Returns:

  • (Boolean)


130
131
132
# File 'lib/classifier/tfidf.rb', line 130

def dirty?
  @dirty
end

#feature_namesObject

Returns vocabulary terms in index order.



119
120
121
# File 'lib/classifier/tfidf.rb', line 119

def feature_names
  @vocabulary.keys.sort_by { |term| @vocabulary[term] }
end

#fit(documents) ⇒ Object

Learns vocabulary and IDF weights from the corpus.

Raises:

  • (ArgumentError)


61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
# File 'lib/classifier/tfidf.rb', line 61

def fit(documents)
  raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array)
  raise ArgumentError, 'documents cannot be empty' if documents.empty?

  @num_documents = documents.size
  document_frequencies = Hash.new(0)

  documents.each do |doc|
    terms = extract_terms(doc)
    terms.each_key { |term| document_frequencies[term] += 1 }
  end

  @vocabulary = {}
  @idf = {}
  vocab_index = 0

  document_frequencies.each do |term, df|
    next unless within_df_bounds?(df, @num_documents)

    @vocabulary[term] = vocab_index
    vocab_index += 1

    # IDF: log((N + 1) / (df + 1)) + 1
    @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1
  end

  @fitted = true
  @dirty = true
  self
end

#fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object

Fits the vectorizer from an IO stream. Collects all documents from the stream, then fits the model. Note: All documents must be collected in memory for IDF calculation.

Examples:

Fit from a file

tfidf.fit_from_stream(File.open('corpus.txt'))

With progress tracking

tfidf.fit_from_stream(io, batch_size: 500) do |progress|
  puts "#{progress.completed} documents loaded"
end


281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
# File 'lib/classifier/tfidf.rb', line 281

def fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE)
  reader = Streaming::LineReader.new(io, batch_size: batch_size)
  total = reader.estimate_line_count
  progress = Streaming::Progress.new(total: total)

  documents = [] #: Array[String]

  reader.each_batch do |batch|
    documents.concat(batch)
    progress.completed += batch.size
    progress.current_batch += 1
    yield progress if block_given?
  end

  fit(documents) unless documents.empty?
  self
end

#fit_transform(documents) ⇒ Object

Fits and transforms in one step.



112
113
114
115
# File 'lib/classifier/tfidf.rb', line 112

def fit_transform(documents)
  fit(documents)
  documents.map { |doc| transform(doc) }
end

#fitted?Boolean

Returns:

  • (Boolean)


124
125
126
# File 'lib/classifier/tfidf.rb', line 124

def fitted?
  @fitted
end

#marshal_dumpObject



240
241
242
# File 'lib/classifier/tfidf.rb', line 240

def marshal_dump
  [@min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted]
end

#marshal_load(data) ⇒ Object



245
246
247
248
249
# File 'lib/classifier/tfidf.rb', line 245

def marshal_load(data)
  @min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted = data
  @dirty = false
  @storage = nil
end

#reloadObject

Reloads the vectorizer from storage, raising if there are unsaved changes.

Raises:

  • (ArgumentError)


170
171
172
173
174
175
176
177
178
179
180
# File 'lib/classifier/tfidf.rb', line 170

def reload
  raise ArgumentError, 'No storage configured' unless storage
  raise UnsavedChangesError, 'Unsaved changes would be lost. Call save first or use reload!' if @dirty

  data = storage.read
  raise StorageError, 'No saved state found' unless data

  restore_from_json(data)
  @dirty = false
  self
end

#reload!Object

Force reloads the vectorizer from storage, discarding any unsaved changes.

Raises:

  • (ArgumentError)


184
185
186
187
188
189
190
191
192
193
# File 'lib/classifier/tfidf.rb', line 184

def reload!
  raise ArgumentError, 'No storage configured' unless storage

  data = storage.read
  raise StorageError, 'No saved state found' unless data

  restore_from_json(data)
  @dirty = false
  self
end

#saveObject

Saves the vectorizer to the configured storage.

Raises:

  • (ArgumentError)


136
137
138
139
140
141
# File 'lib/classifier/tfidf.rb', line 136

def save
  raise ArgumentError, 'No storage configured' unless storage

  storage.write(to_json)
  @dirty = false
end

#save_to_file(path) ⇒ Object

Saves the vectorizer state to a file.



145
146
147
148
149
# File 'lib/classifier/tfidf.rb', line 145

def save_to_file(path)
  result = File.write(path, to_json)
  @dirty = false
  result
end

#to_json(_options = nil) ⇒ Object



212
213
214
# File 'lib/classifier/tfidf.rb', line 212

def to_json(_options = nil)
  JSON.generate(as_json)
end

#train_batchObject

TFIDF doesn’t support train_batch (use fit instead). This method raises NotImplementedError with guidance.

Raises:

  • (NotImplementedError)


311
312
313
# File 'lib/classifier/tfidf.rb', line 311

def train_batch(*) # steep:ignore
  raise NotImplementedError, 'TFIDF uses fit instead of train_batch'
end

#train_from_streamObject

TFIDF doesn’t support train_from_stream (use fit_from_stream instead). This method raises NotImplementedError with guidance.

Raises:

  • (NotImplementedError)


303
304
305
# File 'lib/classifier/tfidf.rb', line 303

def train_from_stream(*) # steep:ignore
  raise NotImplementedError, 'TFIDF uses fit_from_stream instead of train_from_stream'
end

#transform(document) ⇒ Object

Transforms a document into a normalized TF-IDF vector.

Raises:



94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# File 'lib/classifier/tfidf.rb', line 94

def transform(document)
  raise NotFittedError, 'TFIDF has not been fitted. Call fit first.' unless @fitted

  terms = extract_terms(document)
  result = {} #: Hash[Symbol, Float]

  terms.each do |term, tf|
    next unless @vocabulary.key?(term)

    tf_value = @sublinear_tf && tf.positive? ? 1 + Math.log(tf) : tf.to_f
    result[term] = (tf_value * @idf[term]).to_f
  end

  normalize_vector(result)
end