Class: Linguist::Classifier

Inherits:
Object
  • Object
show all
Defined in:
lib/linguist/classifier.rb

Overview

Language bayesian classifier.

Constant Summary collapse

CLASSIFIER_CONSIDER_BYTES =
50 * 1024

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(db = {}) ⇒ Classifier

Internal: Initialize a Classifier.



88
89
90
91
92
93
94
95
# File 'lib/linguist/classifier.rb', line 88

def initialize(db = {})
  @tokens_total    = db['tokens_total']
  @languages_total = db['languages_total']
  @tokens          = db['tokens']
  @language_tokens = db['language_tokens']
  @languages       = db['languages']
  @unknown_logprob = Math.log(1 / db['tokens_total'].to_f)
end

Class Method Details

.call(blob, possible_languages) ⇒ Object

Public: Use the classifier to detect language of the blob.

blob - An object that quacks like a blob. possible_languages - Array of Language objects

Examples

Classifier.call(FileBlob.new("path/to/file"), [
  Language["Ruby"], Language["Python"]
])

Returns an Array of Language objects, most probable first.



20
21
22
23
24
25
# File 'lib/linguist/classifier.rb', line 20

def self.call(blob, possible_languages)
  language_names = possible_languages.map(&:name)
  classify(Samples.cache, blob.data[0...CLASSIFIER_CONSIDER_BYTES], language_names).map do |name, _|
    Language[name] # Return the actual Language objects
  end
end

.classify(db, tokens, languages = nil) ⇒ Object

Public: Guess language of data.

db - Hash of classifier tokens database. data - Array of tokens or String data to analyze. languages - Array of language name Strings to restrict to.

Examples

Classifier.classify(db, "def hello; end")
# => [ 'Ruby', 0.90], ['Python', 0.2], ... ]

Returns sorted Array of result pairs. Each pair contains the String language name and a Float score.



82
83
84
85
# File 'lib/linguist/classifier.rb', line 82

def self.classify(db, tokens, languages = nil)
  languages ||= db['languages'].keys
  new(db).classify(tokens, languages)
end

.train!(db, language, data) ⇒ Object

Public: Train classifier that data is a certain language.

db - Hash classifier database object language - String language of data data - String contents of file

Examples

Classifier.train(db, 'Ruby', "def hello; end")

Returns nothing.

Set LINGUIST_DEBUG=1 or =2 to see probabilities per-token or per-language. See also #dump_all_tokens, below.



41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# File 'lib/linguist/classifier.rb', line 41

def self.train!(db, language, data)
  tokens = data
  tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String)

  counts = Hash.new(0)
  tokens.each { |tok| counts[tok] += 1 }

  db['tokens_total'] ||= 0
  db['languages_total'] ||= 0
  db['tokens'] ||= {}
  db['language_tokens'] ||= {}
  db['languages'] ||= {}

  counts.each do |token, count|
    db['tokens'][language] ||= {}
    db['tokens'][language][token] ||= 0
    db['tokens'][language][token] += count
    db['language_tokens'][language] ||= 0
    db['language_tokens'][language] += count
    db['tokens_total'] += count
  end
  db['languages'][language] ||= 0
  db['languages'][language] += 1
  db['languages_total'] += 1

  nil
end

Instance Method Details

#classify(tokens, languages) ⇒ Object

Internal: Guess language of data

data - Array of tokens or String data to analyze. languages - Array of language name Strings to restrict to.

Returns sorted Array of result pairs. Each pair contains the String language name and a Float score.



104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'lib/linguist/classifier.rb', line 104

def classify(tokens, languages)
  return [] if tokens.nil? || languages.empty?
  tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String)
  scores = {}

  debug_dump_all_tokens(tokens, languages) if verbosity >= 2

  counts = Hash.new(0)
  tokens.each { |tok| counts[tok] += 1 }

  languages.each do |language|
    scores[language] = tokens_probability(counts, language) + language_probability(language)
    debug_dump_probabilities(counts, language, scores[language]) if verbosity >= 1
  end

  scores.sort { |a, b| b[1] <=> a[1] }.map { |score| [score[0], score[1]] }
end

#language_probability(language) ⇒ Object

Internal: Probably of a language occurring - P©

language - Language to check.

Returns Float between 0.0 and 1.0.



157
158
159
# File 'lib/linguist/classifier.rb', line 157

def language_probability(language)
  Math.log(@languages[language].to_f / @languages_total.to_f)
end

#token_probability(token, language) ⇒ Object

Internal: Log-probability of token in language occurring - P(F | C)

token - String token. language - Language to check.

Returns Float.



142
143
144
145
146
147
148
149
150
# File 'lib/linguist/classifier.rb', line 142

def token_probability(token, language)
  count = @tokens[language][token]
  if count.nil? || count == 0
    # This is usually the most common case, so we cache the result.
    @unknown_logprob
  else
    Math.log(count.to_f / @language_tokens[language].to_f)
  end
end

#tokens_probability(counts, language) ⇒ Object

Internal: Probably of set of tokens in a language occurring - P(D | C)

tokens - Array of String tokens. language - Language to check.

Returns Float between 0.0 and 1.0.



128
129
130
131
132
133
134
# File 'lib/linguist/classifier.rb', line 128

def tokens_probability(counts, language)
  sum = 0
  counts.each do |token, count|
    sum += count * token_probability(token, language)
  end
  sum
end