Class: Linguist::Classifier

Inherits:
Object
  • Object
show all
Defined in:
lib/linguist/classifier.rb

Overview

Language content classifier.

Constant Summary collapse

CLASSIFIER_CONSIDER_BYTES =

Maximum number of bytes to consider for classification. This is only used at evaluation time. During training, full content of samples is used.

50 * 1024

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(db = {}) ⇒ Classifier

Internal: Initialize a Classifier.



110
111
112
113
114
# File 'lib/linguist/classifier.rb', line 110

def initialize(db = {})
  @vocabulary = db['vocabulary']
  @centroids  = db['centroids']
  @icf = db['icf']
end

Class Method Details

.call(blob, possible_languages) ⇒ Object

Public: Use the classifier to detect language of the blob.

blob - An object that quacks like a blob. possible_languages - Array of Language objects

Examples

Classifier.call(FileBlob.new("path/to/file"), [
  Language["Ruby"], Language["Python"]
])

Returns an Array of Language objects, most probable first.



24
25
26
27
28
29
# File 'lib/linguist/classifier.rb', line 24

def self.call(blob, possible_languages)
  language_names = possible_languages.map(&:name)
  classify(Samples.cache, blob.data[0...CLASSIFIER_CONSIDER_BYTES], language_names).map do |name, _|
    Language[name] # Return the actual Language objects
  end
end

.classify(db, tokens, languages = nil) ⇒ Object

Public: Guess language of data.

db - Hash of classifier tokens database. data - Array of tokens or String data to analyze. languages - Array of language name Strings to restrict to.

Examples

Classifier.classify(db, "def hello; end")
# => [ 'Ruby', 0.90], ['Python', 0.2], ... ]

Returns sorted Array of result pairs. Each pair contains the String language name and a Float score between 0.0 and 1.0.



104
105
106
107
# File 'lib/linguist/classifier.rb', line 104

def self.classify(db, tokens, languages = nil)
  languages ||= db['centroids'].keys
  new(db).classify(tokens, languages)
end

.finalize_train!(db) ⇒ Object

Public: Finalize training.

db - Hash classifier database object

Examples:

Classifier.finalize_train!(db)

Returns nil.

This method must be called after the last #train! call.



75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/linguist/classifier.rb', line 75

def self.finalize_train!(db)
  db['vocabulary'] ||= {}

  # Unset hash autoincrement
  db['vocabulary'].default_proc = nil

  db['samples'] ||= []
  filter_vocab_by_freq! db, MIN_DOCUMENT_FREQUENCY
  sort_vocab! db
  db['icf'] = inverse_class_freqs db
  normalize_samples! db
  db['centroids'] = get_centroids db
  db.delete 'samples'
  nil
end

.train!(db, language, data) ⇒ Object

Public: Train classifier that data is a certain language.

db - Hash classifier database object language - String language of data data - String contents of file or array of tokens.

Examples

Classifier.train!(db, 'Ruby', "def hello; end")

Returns nil.

Set LINGUIST_DEBUG=1, =2 or =3 to print internal statistics.



44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# File 'lib/linguist/classifier.rb', line 44

def self.train!(db, language, data)
  tokens = data
  tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String)

  db['vocabulary'] ||= {}
  # Set hash to autoincremented index value
  if db['vocabulary'].default_proc.nil?
    db['vocabulary'].default_proc = proc do |hash, key|
      hash[key] = hash.length
    end
  end

  db['samples'] ||= {}
  db['samples'][language] ||= []

  termfreq = to_vocabulary_index_termfreq(db['vocabulary'], tokens)
  db['samples'][language] << termfreq

  nil
end

Instance Method Details

#classify(tokens, languages) ⇒ Object

Internal: Guess language of data

data - Array of tokens or String data to analyze. languages - Array of language name Strings to restrict to.

Returns sorted Array of result pairs. Each pair contains the String language name and a Float score between 0.0 and 1.0.



123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
# File 'lib/linguist/classifier.rb', line 123

def classify(tokens, languages)
  return [] if tokens.nil? || languages.empty?
  tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String)

  debug_dump_tokens(tokens) if verbosity >= 3

  vec = Classifier.to_vocabulary_index_termfreq_gaps(@vocabulary, tokens)
  vec.each do |idx, freq|
    tf = 1.0 + Math.log(freq)
    vec[idx] = tf * @icf[idx]
  end
  return [] if vec.empty?
  Classifier.l2_normalize!(vec)

  scores = {}
  languages.each do |language|
    centroid = @centroids[language]
    score = Classifier.similarity(vec, centroid)
    if score > 0.0
      scores[language] = score
    end
  end
  scores = scores.sort_by { |x| -x[1] }
  debug_dump_all_tokens(tokens, scores) if verbosity >= 2
  debug_dump_scores(scores) if verbosity >= 1
  scores
end