Class: Classifier::TFIDF
- Includes:
- Streaming
- Defined in:
- lib/classifier/tfidf.rb
Overview
TF-IDF vectorizer: transforms text to weighted feature vectors. Downweights common words, upweights discriminative terms.
Example:
tfidf = Classifier::TFIDF.new
tfidf.fit(["Dogs are great pets", "Cats are independent"])
tfidf.transform("Dogs are loyal") # => {:dog=>0.7071..., :loyal=>0.7071...}
Constant Summary
Constants included from Streaming
Instance Attribute Summary collapse
-
#idf ⇒ Object
readonly
Returns the value of attribute idf.
-
#num_documents ⇒ Object
readonly
Returns the value of attribute num_documents.
-
#storage ⇒ Object
Returns the value of attribute storage.
-
#vocabulary ⇒ Object
readonly
Returns the value of attribute vocabulary.
Class Method Summary collapse
-
.from_json(json) ⇒ Object
Loads a vectorizer from JSON.
-
.load(storage:) ⇒ Object
Loads a vectorizer from the configured storage.
-
.load_checkpoint(storage:, checkpoint_id:) ⇒ Object
Loads a vectorizer from a checkpoint.
-
.load_from_file(path) ⇒ Object
Loads a vectorizer from a file.
Instance Method Summary collapse
- #as_json(_options = nil) ⇒ Object
-
#dirty? ⇒ Boolean
Returns true if there are unsaved changes.
-
#feature_names ⇒ Object
Returns vocabulary terms in index order.
-
#fit(documents) ⇒ Object
Learns vocabulary and IDF weights from the corpus.
-
#fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object
Fits the vectorizer from an IO stream.
-
#fit_transform(documents) ⇒ Object
Fits and transforms in one step.
- #fitted? ⇒ Boolean
-
#initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false) ⇒ TFIDF
constructor
Creates a new TF-IDF vectorizer.
- #marshal_dump ⇒ Object
- #marshal_load(data) ⇒ Object
-
#reload ⇒ Object
Reloads the vectorizer from storage, raising if there are unsaved changes.
-
#reload! ⇒ Object
Force reloads the vectorizer from storage, discarding any unsaved changes.
-
#save ⇒ Object
Saves the vectorizer to the configured storage.
-
#save_to_file(path) ⇒ Object
Saves the vectorizer state to a file.
- #to_json(_options = nil) ⇒ Object
-
#train_batch ⇒ Object
TFIDF doesn’t support train_batch (use fit instead).
-
#train_from_stream ⇒ Object
TFIDF doesn’t support train_from_stream (use fit_from_stream instead).
-
#transform(document) ⇒ Object
Transforms a document into a normalized TF-IDF vector.
Methods included from Streaming
#delete_checkpoint, #list_checkpoints, #save_checkpoint
Constructor Details
#initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false) ⇒ TFIDF
Creates a new TF-IDF vectorizer.
-
min_df/max_df: filter terms by document frequency (Integer for count, Float for proportion)
-
ngram_range: [1,1] for unigrams, [1,2] for unigrams+bigrams
-
sublinear_tf: use 1 + log(tf) instead of raw term frequency
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
# File 'lib/classifier/tfidf.rb', line 42 def initialize(min_df: 1, max_df: 1.0, ngram_range: [1, 1], sublinear_tf: false) validate_df!(min_df, 'min_df') validate_df!(max_df, 'max_df') validate_ngram_range!(ngram_range) @min_df = min_df @max_df = max_df @ngram_range = ngram_range @sublinear_tf = sublinear_tf @vocabulary = {} @idf = {} @num_documents = 0 @fitted = false @dirty = false @storage = nil end |
Instance Attribute Details
#idf ⇒ Object (readonly)
Returns the value of attribute idf.
32 33 34 |
# File 'lib/classifier/tfidf.rb', line 32 def idf @idf end |
#num_documents ⇒ Object (readonly)
Returns the value of attribute num_documents.
32 33 34 |
# File 'lib/classifier/tfidf.rb', line 32 def num_documents @num_documents end |
#storage ⇒ Object
Returns the value of attribute storage.
33 34 35 |
# File 'lib/classifier/tfidf.rb', line 33 def storage @storage end |
#vocabulary ⇒ Object (readonly)
Returns the value of attribute vocabulary.
32 33 34 |
# File 'lib/classifier/tfidf.rb', line 32 def vocabulary @vocabulary end |
Class Method Details
.from_json(json) ⇒ Object
Loads a vectorizer from JSON.
218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 |
# File 'lib/classifier/tfidf.rb', line 218 def self.from_json(json) data = json.is_a?(String) ? JSON.parse(json) : json raise ArgumentError, "Invalid vectorizer type: #{data['type']}" unless data['type'] == 'tfidf' instance = new( min_df: data['min_df'], max_df: data['max_df'], ngram_range: data['ngram_range'], sublinear_tf: data['sublinear_tf'] ) instance.instance_variable_set(:@vocabulary, symbolize_keys(data['vocabulary'])) instance.instance_variable_set(:@idf, symbolize_keys(data['idf'])) instance.instance_variable_set(:@num_documents, data['num_documents']) instance.instance_variable_set(:@fitted, data['fitted']) instance.instance_variable_set(:@dirty, false) instance.instance_variable_set(:@storage, nil) instance end |
.load(storage:) ⇒ Object
Loads a vectorizer from the configured storage.
153 154 155 156 157 158 159 160 |
# File 'lib/classifier/tfidf.rb', line 153 def self.load(storage:) data = storage.read raise StorageError, 'No saved state found' unless data instance = from_json(data) instance.storage = storage instance end |
.load_checkpoint(storage:, checkpoint_id:) ⇒ Object
Loads a vectorizer from a checkpoint.
254 255 256 257 258 259 260 261 262 263 264 265 266 |
# File 'lib/classifier/tfidf.rb', line 254 def self.load_checkpoint(storage:, checkpoint_id:) raise ArgumentError, 'Storage must be File storage for checkpoints' unless storage.is_a?(Storage::File) dir = File.dirname(storage.path) base = File.basename(storage.path, '.*') ext = File.extname(storage.path) checkpoint_path = File.join(dir, "#{base}_checkpoint_#{checkpoint_id}#{ext}") checkpoint_storage = Storage::File.new(path: checkpoint_path) instance = load(storage: checkpoint_storage) instance.storage = storage instance end |
.load_from_file(path) ⇒ Object
Loads a vectorizer from a file.
164 165 166 |
# File 'lib/classifier/tfidf.rb', line 164 def self.load_from_file(path) from_json(File.read(path)) end |
Instance Method Details
#as_json(_options = nil) ⇒ Object
196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
# File 'lib/classifier/tfidf.rb', line 196 def as_json( = nil) { version: 1, type: 'tfidf', min_df: @min_df, max_df: @max_df, ngram_range: @ngram_range, sublinear_tf: @sublinear_tf, vocabulary: @vocabulary, idf: @idf, num_documents: @num_documents, fitted: @fitted } end |
#dirty? ⇒ Boolean
Returns true if there are unsaved changes.
130 131 132 |
# File 'lib/classifier/tfidf.rb', line 130 def dirty? @dirty end |
#feature_names ⇒ Object
Returns vocabulary terms in index order.
119 120 121 |
# File 'lib/classifier/tfidf.rb', line 119 def feature_names @vocabulary.keys.sort_by { |term| @vocabulary[term] } end |
#fit(documents) ⇒ Object
Learns vocabulary and IDF weights from the corpus.
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
# File 'lib/classifier/tfidf.rb', line 61 def fit(documents) raise ArgumentError, 'documents must be an array' unless documents.is_a?(Array) raise ArgumentError, 'documents cannot be empty' if documents.empty? @num_documents = documents.size document_frequencies = Hash.new(0) documents.each do |doc| terms = extract_terms(doc) terms.each_key { |term| document_frequencies[term] += 1 } end @vocabulary = {} @idf = {} vocab_index = 0 document_frequencies.each do |term, df| next unless within_df_bounds?(df, @num_documents) @vocabulary[term] = vocab_index vocab_index += 1 # IDF: log((N + 1) / (df + 1)) + 1 @idf[term] = Math.log((@num_documents + 1).to_f / (df + 1)) + 1 end @fitted = true @dirty = true self end |
#fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) ⇒ Object
Fits the vectorizer from an IO stream. Collects all documents from the stream, then fits the model. Note: All documents must be collected in memory for IDF calculation.
281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 |
# File 'lib/classifier/tfidf.rb', line 281 def fit_from_stream(io, batch_size: Streaming::DEFAULT_BATCH_SIZE) reader = Streaming::LineReader.new(io, batch_size: batch_size) total = reader.estimate_line_count progress = Streaming::Progress.new(total: total) documents = [] #: Array[String] reader.each_batch do |batch| documents.concat(batch) progress.completed += batch.size progress.current_batch += 1 yield progress if block_given? end fit(documents) unless documents.empty? self end |
#fit_transform(documents) ⇒ Object
Fits and transforms in one step.
112 113 114 115 |
# File 'lib/classifier/tfidf.rb', line 112 def fit_transform(documents) fit(documents) documents.map { |doc| transform(doc) } end |
#fitted? ⇒ Boolean
124 125 126 |
# File 'lib/classifier/tfidf.rb', line 124 def fitted? @fitted end |
#marshal_dump ⇒ Object
240 241 242 |
# File 'lib/classifier/tfidf.rb', line 240 def marshal_dump [@min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted] end |
#marshal_load(data) ⇒ Object
245 246 247 248 249 |
# File 'lib/classifier/tfidf.rb', line 245 def marshal_load(data) @min_df, @max_df, @ngram_range, @sublinear_tf, @vocabulary, @idf, @num_documents, @fitted = data @dirty = false @storage = nil end |
#reload ⇒ Object
Reloads the vectorizer from storage, raising if there are unsaved changes.
170 171 172 173 174 175 176 177 178 179 180 |
# File 'lib/classifier/tfidf.rb', line 170 def reload raise ArgumentError, 'No storage configured' unless storage raise UnsavedChangesError, 'Unsaved changes would be lost. Call save first or use reload!' if @dirty data = storage.read raise StorageError, 'No saved state found' unless data restore_from_json(data) @dirty = false self end |
#reload! ⇒ Object
Force reloads the vectorizer from storage, discarding any unsaved changes.
184 185 186 187 188 189 190 191 192 193 |
# File 'lib/classifier/tfidf.rb', line 184 def reload! raise ArgumentError, 'No storage configured' unless storage data = storage.read raise StorageError, 'No saved state found' unless data restore_from_json(data) @dirty = false self end |
#save ⇒ Object
Saves the vectorizer to the configured storage.
136 137 138 139 140 141 |
# File 'lib/classifier/tfidf.rb', line 136 def save raise ArgumentError, 'No storage configured' unless storage storage.write(to_json) @dirty = false end |
#save_to_file(path) ⇒ Object
Saves the vectorizer state to a file.
145 146 147 148 149 |
# File 'lib/classifier/tfidf.rb', line 145 def save_to_file(path) result = File.write(path, to_json) @dirty = false result end |
#to_json(_options = nil) ⇒ Object
212 213 214 |
# File 'lib/classifier/tfidf.rb', line 212 def to_json( = nil) JSON.generate(as_json) end |
#train_batch ⇒ Object
TFIDF doesn’t support train_batch (use fit instead). This method raises NotImplementedError with guidance.
311 312 313 |
# File 'lib/classifier/tfidf.rb', line 311 def train_batch(*) # steep:ignore raise NotImplementedError, 'TFIDF uses fit instead of train_batch' end |
#train_from_stream ⇒ Object
TFIDF doesn’t support train_from_stream (use fit_from_stream instead). This method raises NotImplementedError with guidance.
303 304 305 |
# File 'lib/classifier/tfidf.rb', line 303 def train_from_stream(*) # steep:ignore raise NotImplementedError, 'TFIDF uses fit_from_stream instead of train_from_stream' end |
#transform(document) ⇒ Object
Transforms a document into a normalized TF-IDF vector.
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
# File 'lib/classifier/tfidf.rb', line 94 def transform(document) raise NotFittedError, 'TFIDF has not been fitted. Call fit first.' unless @fitted terms = extract_terms(document) result = {} #: Hash[Symbol, Float] terms.each do |term, tf| next unless @vocabulary.key?(term) tf_value = @sublinear_tf && tf.positive? ? 1 + Math.log(tf) : tf.to_f result[term] = (tf_value * @idf[term]).to_f end normalize_vector(result) end |