Class: Classifier::LogisticRegression

Inherits:
Object
  • Object
show all
Includes:
Streaming, Mutex_m
Defined in:
lib/classifier/logistic_regression.rb

Overview

Logistic Regression (MaxEnt) classifier using Stochastic Gradient Descent. Often provides better accuracy than Naive Bayes while remaining fast and interpretable.

Example:

classifier = Classifier::LogisticRegression.new(:spam, :ham)
classifier.train(spam: ["Buy now!", "Free money!!!"])
classifier.train(ham: ["Meeting tomorrow", "Project update"])
classifier.classify("Claim your prize!") # => "Spam"
classifier.probabilities("Claim your prize!") # => {"Spam" => 0.92, "Ham" => 0.08}

Constant Summary collapse

DEFAULT_LEARNING_RATE =
0.1
DEFAULT_REGULARIZATION =
0.01
DEFAULT_MAX_ITERATIONS =
100
DEFAULT_TOLERANCE =
1e-4

Constants included from Streaming

Streaming::DEFAULT_BATCH_SIZE

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Streaming

#delete_checkpoint, #list_checkpoints, #save_checkpoint

Constructor Details

#initialize(*categories, learning_rate: DEFAULT_LEARNING_RATE, regularization: DEFAULT_REGULARIZATION, max_iterations: DEFAULT_MAX_ITERATIONS, tolerance: DEFAULT_TOLERANCE) ⇒ LogisticRegression

Creates a new Logistic Regression classifier with the specified categories.

classifier = Classifier::LogisticRegression.new(:spam, :ham)
classifier = Classifier::LogisticRegression.new('Positive', 'Negative', 'Neutral')
classifier = Classifier::LogisticRegression.new(['Positive', 'Negative', 'Neutral'])

Options:

  • learning_rate: Step size for gradient descent (default: 0.1)

  • regularization: L2 regularization strength (default: 0.01)

  • max_iterations: Maximum training iterations (default: 100)

  • tolerance: Convergence threshold (default: 1e-4)

Raises:

  • (ArgumentError)


59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# File 'lib/classifier/logistic_regression.rb', line 59

def initialize(*categories, learning_rate: DEFAULT_LEARNING_RATE,
               regularization: DEFAULT_REGULARIZATION,
               max_iterations: DEFAULT_MAX_ITERATIONS,
               tolerance: DEFAULT_TOLERANCE)
  super()
  categories = categories.flatten
  raise ArgumentError, 'At least two categories required' if categories.size < 2

  @categories = categories.map { |c| c.to_s.prepare_category_name }
  @weights = @categories.to_h { |c| [c, {}] }
  @bias = @categories.to_h { |c| [c, 0.0] }
  @vocabulary = {}
  @training_data = []
  @learning_rate = learning_rate
  @regularization = regularization
  @max_iterations = max_iterations
  @tolerance = tolerance
  @fitted = false
  @dirty = false
  @storage = nil
end

Dynamic Method Handling

This class handles dynamic methods through the method_missing method

#method_missing(name, *args) ⇒ Object

Provides training methods for the categories.

classifier.train_spam "Buy now!"

Raises:

  • (StandardError)


192
193
194
195
196
197
198
199
200
# File 'lib/classifier/logistic_regression.rb', line 192

def method_missing(name, *args)
  category_match = name.to_s.match(/train_(\w+)/)
  return super unless category_match

  category = category_match[1].to_s.prepare_category_name
  raise StandardError, "No such category: #{category}" unless @categories.include?(category)

  args.each { |text| train(category, text) }
end

Instance Attribute Details

#storageObject

Returns the value of attribute storage.



38
39
40
# File 'lib/classifier/logistic_regression.rb', line 38

def storage
  @storage
end

Class Method Details

.from_json(json) ⇒ Object

Loads a classifier from a JSON string or Hash.

Raises:

  • (ArgumentError)


237
238
239
240
241
242
243
244
245
# File 'lib/classifier/logistic_regression.rb', line 237

def self.from_json(json)
  data = json.is_a?(String) ? JSON.parse(json) : json
  raise ArgumentError, "Invalid classifier type: #{data['type']}" unless data['type'] == 'logistic_regression'

  categories = data['categories'].map(&:to_sym)
  instance = allocate
  instance.send(:restore_state, data, categories)
  instance
end

.load(storage:) ⇒ Object

Loads a classifier from the configured storage.

Raises:



269
270
271
272
273
274
275
276
# File 'lib/classifier/logistic_regression.rb', line 269

def self.load(storage:)
  data = storage.read
  raise StorageError, 'No saved state found' unless data

  instance = from_json(data)
  instance.storage = storage
  instance
end

.load_checkpoint(storage:, checkpoint_id:) ⇒ Object

Loads a classifier from a checkpoint.

Raises:

  • (ArgumentError)


338
339
340
341
342
343
344
345
346
347
348
349
350
# File 'lib/classifier/logistic_regression.rb', line 338

def self.load_checkpoint(storage:, checkpoint_id:)
  raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File)

  dir = File.dirname(storage.path)
  base = File.basename(storage.path, '.*')
  ext = File.extname(storage.path)
  checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}")

  checkpoint_storage = Storage::File.new(path: checkpoint_path)
  instance = load(storage: checkpoint_storage)
  instance.storage = storage
  instance
end

.load_from_file(path) ⇒ Object

Loads a classifier from a file.



281
282
283
# File 'lib/classifier/logistic_regression.rb', line 281

def self.load_from_file(path)
  from_json(File.read(path))
end

Instance Method Details

#as_json(_options = nil) ⇒ Object

Returns a hash representation of the classifier state.



210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
# File 'lib/classifier/logistic_regression.rb', line 210

def as_json(_options = nil)
  fit unless @fitted

  {
    version: 1,
    type: 'logistic_regression',
    categories: @categories.map(&:to_s),
    weights: @weights.transform_keys(&:to_s).transform_values { |v| v.transform_keys(&:to_s) },
    bias: @bias.transform_keys(&:to_s),
    vocabulary: @vocabulary.keys.map(&:to_s),
    learning_rate: @learning_rate,
    regularization: @regularization,
    max_iterations: @max_iterations,
    tolerance: @tolerance
  }
end

#categoriesObject

Returns the list of categories.



172
173
174
# File 'lib/classifier/logistic_regression.rb', line 172

def categories
  synchronize { @categories.map(&:to_s) }
end

#classifications(text) ⇒ Object

Returns log-odds scores for each category (before softmax).



142
143
144
145
146
147
148
149
# File 'lib/classifier/logistic_regression.rb', line 142

def classifications(text)
  fit unless @fitted

  features = text.word_hash
  synchronize do
    compute_scores(features).transform_keys(&:to_s)
  end
end

#classify(text) ⇒ Object

Returns the best matching category for the provided text.

classifier.classify("Buy now!") # => "Spam"

Raises:

  • (StandardError)


115
116
117
118
119
120
121
# File 'lib/classifier/logistic_regression.rb', line 115

def classify(text)
  probs = probabilities(text)
  best = probs.max_by { |_, v| v }
  raise StandardError, 'No classifications available' unless best

  best.first
end

#dirty?Boolean

Returns true if there are unsaved changes.

Returns:

  • (Boolean)


186
187
188
# File 'lib/classifier/logistic_regression.rb', line 186

def dirty?
  @dirty
end

#fitObject

Fits the model to all accumulated training data. Called automatically during classify/probabilities if not already fitted.



99
100
101
102
103
104
105
106
107
108
# File 'lib/classifier/logistic_regression.rb', line 99

def fit
  synchronize do
    return self if @training_data.empty?

    optimize_weights
    @fitted = true
    @dirty = false
  end
  self
end

#fitted?Boolean

Returns true if the model has been fitted.

Returns:

  • (Boolean)


179
180
181
# File 'lib/classifier/logistic_regression.rb', line 179

def fitted?
  @fitted
end

#marshal_dumpObject

Custom marshal serialization to exclude mutex state.



317
318
319
320
321
# File 'lib/classifier/logistic_regression.rb', line 317

def marshal_dump
  fit unless @fitted
  [@categories, @weights, @bias, @vocabulary, @learning_rate, @regularization,
   @max_iterations, @tolerance, @fitted]
end

#marshal_load(data) ⇒ Object

Custom marshal deserialization to recreate mutex.



326
327
328
329
330
331
332
333
# File 'lib/classifier/logistic_regression.rb', line 326

def marshal_load(data)
  mu_initialize
  @categories, @weights, @bias, @vocabulary, @learning_rate, @regularization,
    @max_iterations, @tolerance, @fitted = data
  @training_data = []
  @dirty = false
  @storage = nil
end

#probabilities(text) ⇒ Object

Returns probability distribution across all categories. Probabilities are well-calibrated (unlike Naive Bayes).

classifier.probabilities("Buy now!")
# => {"Spam" => 0.92, "Ham" => 0.08}


130
131
132
133
134
135
136
137
# File 'lib/classifier/logistic_regression.rb', line 130

def probabilities(text)
  fit unless @fitted

  features = text.word_hash
  synchronize do
    softmax(compute_scores(features))
  end
end

#reloadObject

Reloads the classifier from storage, raising if there are unsaved changes.

Raises:

  • (ArgumentError)


288
289
290
291
292
293
294
295
296
297
298
# File 'lib/classifier/logistic_regression.rb', line 288

def reload
  raise ArgumentError, 'No storage configured' unless storage
  raise UnsavedChangesError, 'Unsaved changes would be lost. Call save first or use reload!' if @dirty

  data = storage.read
  raise StorageError, 'No saved state found' unless data

  restore_from_json(data)
  @dirty = false
  self
end

#reload!Object

Force reloads the classifier from storage, discarding any unsaved changes.

Raises:

  • (ArgumentError)


303
304
305
306
307
308
309
310
311
312
# File 'lib/classifier/logistic_regression.rb', line 303

def reload!
  raise ArgumentError, 'No storage configured' unless storage

  data = storage.read
  raise StorageError, 'No saved state found' unless data

  restore_from_json(data)
  @dirty = false
  self
end

#respond_to_missing?(name, include_private = false) ⇒ Boolean

Returns:

  • (Boolean)


203
204
205
# File 'lib/classifier/logistic_regression.rb', line 203

def respond_to_missing?(name, include_private = false)
  !!(name.to_s =~ /train_(\w+)/) || super
end

#saveObject

Saves the classifier to the configured storage.

Raises:

  • (ArgumentError)


250
251
252
253
254
255
# File 'lib/classifier/logistic_regression.rb', line 250

def save
  raise ArgumentError, 'No storage configured' unless storage

  storage.write(to_json)
  @dirty = false
end

#save_to_file(path) ⇒ Object

Saves the classifier state to a file.



260
261
262
263
264
# File 'lib/classifier/logistic_regression.rb', line 260

def save_to_file(path)
  result = File.write(path, to_json)
  @dirty = false
  result
end

#to_json(_options = nil) ⇒ Object

Serializes the classifier state to a JSON string.



230
231
232
# File 'lib/classifier/logistic_regression.rb', line 230

def to_json(_options = nil)
  JSON.generate(as_json)
end

#train(category = nil, text = nil, **categories) ⇒ Object

Trains the classifier with text for a category.

classifier.train(spam: "Buy now!", ham: ["Hello", "Meeting tomorrow"])
classifier.train(:spam, "legacy positional API")


87
88
89
90
91
92
93
# File 'lib/classifier/logistic_regression.rb', line 87

def train(category = nil, text = nil, **categories)
  return train_single(category, text) if category && text

  categories.each do |cat, texts|
    (texts.is_a?(Array) ? texts : [texts]).each { |t| train_single(cat, t) }
  end
end

#train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block) ⇒ Object

Trains the classifier with an array of documents in batches. Note: The model is NOT automatically fitted after batch training. Call #fit to train the model after adding all data.

Examples:

Positional style

classifier.train_batch(:spam, documents, batch_size: 100)
classifier.fit

Keyword style

classifier.train_batch(spam: documents, ham: other_docs)
classifier.fit


405
406
407
408
409
410
411
412
413
# File 'lib/classifier/logistic_regression.rb', line 405

def train_batch(category = nil, documents = nil, batch_size: Streaming::DEFAULT_BATCH_SIZE, **categories, &block)
  if category && documents
    train_batch_for_category(category, documents, batch_size: batch_size, &block)
  else
    categories.each do |cat, docs|
      train_batch_for_category(cat, Array(docs), batch_size: batch_size, &block)
    end
  end
end

#train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object

Trains the classifier from an IO stream. Each line in the stream is treated as a separate document. Note: The model is NOT automatically fitted after streaming. Call #fit to train the model after adding all data.

Examples:

Train from a file

classifier.train_from_stream(:spam, File.open('spam_corpus.txt'))
classifier.fit  # Required to train the model

With progress tracking

classifier.train_from_stream(:spam, io, batch_size: 500) do |progress|
  puts "#{progress.completed} documents processed"
end
classifier.fit

Raises:

  • (StandardError)


368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
# File 'lib/classifier/logistic_regression.rb', line 368

def train_from_stream(category, io, batch_size: Streaming::DEFAULT_BATCH_SIZE)
  category = category.to_s.prepare_category_name
  raise StandardError, "No such category: #{category}" unless @categories.include?(category)

  reader = Streaming::LineReader.new(io, batch_size: batch_size)
  total = reader.estimate_line_count
  progress = Streaming::Progress.new(total: total)

  reader.each_batch do |batch|
    synchronize do
      batch.each do |text|
        features = text.word_hash
        features.each_key { |word| @vocabulary[word] = true }
        @training_data << { category: category, features: features }
      end
      @fitted = false
      @dirty = true
    end
    progress.completed += batch.size
    progress.current_batch += 1
    yield progress if block_given?
  end
end

#weights(category, limit: nil) ⇒ Object

Returns feature weights for a category, sorted by importance. Positive weights indicate the feature supports the category.

classifier.weights(:spam)
# => {:free => 2.3, :buy => 1.8, :money => 1.5, ...}

Raises:

  • (StandardError)


158
159
160
161
162
163
164
165
166
167
# File 'lib/classifier/logistic_regression.rb', line 158

def weights(category, limit: nil)
  fit unless @fitted

  cat = category.to_s.prepare_category_name
  raise StandardError, "No such category: #{cat}" unless @weights.key?(cat)

  sorted = @weights[cat].sort_by { |_, v| -v.abs }
  sorted = sorted.first(limit) if limit
  sorted.to_h
end