Class: Linguist::Classifier
- Inherits:
-
Object
- Object
- Linguist::Classifier
- Defined in:
- lib/linguist/classifier.rb
Overview
Language content classifier.
Constant Summary collapse
- CLASSIFIER_CONSIDER_BYTES =
Maximum number of bytes to consider for classification. This is only used at evaluation time. During training, full content of samples is used.
50 * 1024
Class Method Summary collapse
-
.call(blob, possible_languages) ⇒ Object
Public: Use the classifier to detect language of the blob.
-
.classify(db, tokens, languages = nil) ⇒ Object
Public: Guess language of data.
-
.finalize_train!(db) ⇒ Object
Public: Finalize training.
-
.train!(db, language, data) ⇒ Object
Public: Train classifier that data is a certain language.
Instance Method Summary collapse
-
#classify(tokens, languages) ⇒ Object
Internal: Guess language of data.
-
#initialize(db = {}) ⇒ Classifier
constructor
Internal: Initialize a Classifier.
Constructor Details
#initialize(db = {}) ⇒ Classifier
Internal: Initialize a Classifier.
110 111 112 113 114 |
# File 'lib/linguist/classifier.rb', line 110 def initialize(db = {}) @vocabulary = db['vocabulary'] @centroids = db['centroids'] @icf = db['icf'] end |
Class Method Details
.call(blob, possible_languages) ⇒ Object
Public: Use the classifier to detect language of the blob.
blob - An object that quacks like a blob. possible_languages - Array of Language objects
Examples
Classifier.call(FileBlob.new("path/to/file"), [
Language["Ruby"], Language["Python"]
])
Returns an Array of Language objects, most probable first.
24 25 26 27 28 29 |
# File 'lib/linguist/classifier.rb', line 24 def self.call(blob, possible_languages) language_names = possible_languages.map(&:name) classify(Samples.cache, blob.data[0...CLASSIFIER_CONSIDER_BYTES], language_names).map do |name, _| Language[name] # Return the actual Language objects end end |
.classify(db, tokens, languages = nil) ⇒ Object
Public: Guess language of data.
db - Hash of classifier tokens database. data - Array of tokens or String data to analyze. languages - Array of language name Strings to restrict to.
Examples
Classifier.classify(db, "def hello; end")
# => [ 'Ruby', 0.90], ['Python', 0.2], ... ]
Returns sorted Array of result pairs. Each pair contains the String language name and a Float score between 0.0 and 1.0.
104 105 106 107 |
# File 'lib/linguist/classifier.rb', line 104 def self.classify(db, tokens, languages = nil) languages ||= db['centroids'].keys new(db).classify(tokens, languages) end |
.finalize_train!(db) ⇒ Object
Public: Finalize training.
db - Hash classifier database object
Examples:
Classifier.finalize_train!(db)
Returns nil.
This method must be called after the last #train! call.
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 |
# File 'lib/linguist/classifier.rb', line 75 def self.finalize_train!(db) db['vocabulary'] ||= {} # Unset hash autoincrement db['vocabulary'].default_proc = nil db['samples'] ||= [] filter_vocab_by_freq! db, MIN_DOCUMENT_FREQUENCY sort_vocab! db db['icf'] = inverse_class_freqs db normalize_samples! db db['centroids'] = get_centroids db db.delete 'samples' nil end |
.train!(db, language, data) ⇒ Object
Public: Train classifier that data is a certain language.
db - Hash classifier database object language - String language of data data - String contents of file or array of tokens.
Examples
Classifier.train!(db, 'Ruby', "def hello; end")
Returns nil.
Set LINGUIST_DEBUG=1, =2 or =3 to print internal statistics.
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
# File 'lib/linguist/classifier.rb', line 44 def self.train!(db, language, data) tokens = data tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String) db['vocabulary'] ||= {} # Set hash to autoincremented index value if db['vocabulary'].default_proc.nil? db['vocabulary'].default_proc = proc do |hash, key| hash[key] = hash.length end end db['samples'] ||= {} db['samples'][language] ||= [] termfreq = to_vocabulary_index_termfreq(db['vocabulary'], tokens) db['samples'][language] << termfreq nil end |
Instance Method Details
#classify(tokens, languages) ⇒ Object
Internal: Guess language of data
data - Array of tokens or String data to analyze. languages - Array of language name Strings to restrict to.
Returns sorted Array of result pairs. Each pair contains the String language name and a Float score between 0.0 and 1.0.
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
# File 'lib/linguist/classifier.rb', line 123 def classify(tokens, languages) return [] if tokens.nil? || languages.empty? tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String) debug_dump_tokens(tokens) if verbosity >= 3 vec = Classifier.to_vocabulary_index_termfreq_gaps(@vocabulary, tokens) vec.each do |idx, freq| tf = 1.0 + Math.log(freq) vec[idx] = tf * @icf[idx] end return [] if vec.empty? Classifier.l2_normalize!(vec) scores = {} languages.each do |language| centroid = @centroids[language] score = Classifier.similarity(vec, centroid) if score > 0.0 scores[language] = score end end scores = scores.sort_by { |x| -x[1] } debug_dump_all_tokens(tokens, scores) if verbosity >= 2 debug_dump_scores(scores) if verbosity >= 1 scores end |