Class: Linguist::Heuristics

Inherits:
Object
  • Object
show all
Defined in:
lib/linguist/heuristics.rb

Overview

A collection of simple heuristics that can be used to better analyze languages.

Constant Summary collapse

HEURISTICS_CONSIDER_BYTES =
50 * 1024

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(exts, rules) ⇒ Heuristics

Internal



101
102
103
104
# File 'lib/linguist/heuristics.rb', line 101

def initialize(exts, rules)
  @exts = exts
  @rules = rules
end

Class Method Details

.allObject

Public: Get all heuristic definitions

Returns an Array of heuristic objects.



40
41
42
43
# File 'lib/linguist/heuristics.rb', line 40

def self.all
  self.load()
  @heuristics
end

.call(blob, candidates) ⇒ Object

Public: Use heuristics to detect language of the blob.

blob - An object that quacks like a blob. possible_languages - Array of Language objects

Examples

Heuristics.call(FileBlob.new("path/to/file"), [
  Language["Ruby"], Language["Python"]
])

Returns an Array of languages, or empty if none matched or were inconclusive.



20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# File 'lib/linguist/heuristics.rb', line 20

def self.call(blob, candidates)
  return [] if blob.symlink?
  self.load()

  data = blob.data[0...HEURISTICS_CONSIDER_BYTES]

  @heuristics.each do |heuristic|
    if heuristic.matches?(blob.name, candidates)
      return Array(heuristic.call(data))
    end
  end

  [] # No heuristics matched
rescue Regexp::TimeoutError
  [] # Return nothing if we have a bad regexp which leads to a timeout enforced by Regexp.timeout in Ruby 3.2 or later
end

.loadObject

Internal: Load heuristics from ‘heuristics.yml’.



46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# File 'lib/linguist/heuristics.rb', line 46

def self.load()
  if @heuristics.any?
    return
  end

  data = self.load_config
  named_patterns = data['named_patterns'].map { |k,v| [k, self.to_regex(v)] }.to_h

  data['disambiguations'].each do |disambiguation|
    exts = disambiguation['extensions']
    rules = disambiguation['rules']
    rules.map! do |rule|
      rule['pattern'] = self.parse_rule(named_patterns, rule)
      rule
    end
    @heuristics << new(exts, rules)
  end
end

.load_configObject



65
66
67
# File 'lib/linguist/heuristics.rb', line 65

def self.load_config
  YAML.load_file(File.expand_path("../heuristics.yml", __FILE__))
end

.parse_rule(named_patterns, rule) ⇒ Object



69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/linguist/heuristics.rb', line 69

def self.parse_rule(named_patterns, rule)
  if !rule['and'].nil?
    rules = rule['and'].map { |block| self.parse_rule(named_patterns, block) }
    return And.new(rules)
  elsif !rule['pattern'].nil?
    return self.to_regex(rule['pattern'])
  elsif !rule['negative_pattern'].nil?
    pat = self.to_regex(rule['negative_pattern'])
    return NegativePattern.new(pat)
  elsif !rule['named_pattern'].nil?
    return named_patterns[rule['named_pattern']]
  else
    return AlwaysMatch.new()
  end
end

.to_regex(str) ⇒ Object

Internal: Converts a string or array of strings to regexp

str: string or array of strings. If it is an array of strings,

Regexp.union will be used.


89
90
91
92
93
94
95
# File 'lib/linguist/heuristics.rb', line 89

def self.to_regex(str)
  if str.kind_of?(Array)
    Regexp.union(str.map { |s| Regexp.new(s) })
  else
    Regexp.new(str)
  end
end

Instance Method Details

#call(data) ⇒ Object

Internal: Perform the heuristic



127
128
129
130
131
132
133
134
135
136
137
138
139
# File 'lib/linguist/heuristics.rb', line 127

def call(data)
  matched = @rules.find do |rule|
    rule['pattern'].match?(data)
  end
  if !matched.nil?
    languages = matched['language']
    if languages.is_a?(Array)
      languages.map{ |l| Language[l] }
    else
      Language[languages]
    end
  end
end

#extensionsObject

Internal: Return the heuristic’s target extensions



107
108
109
# File 'lib/linguist/heuristics.rb', line 107

def extensions
  @exts
end

#languagesObject

Internal: Return the heuristic’s candidate languages



112
113
114
115
116
# File 'lib/linguist/heuristics.rb', line 112

def languages
  @rules.map do |rule|
    [rule['language']].flatten(2).map { |name| Language[name] }
  end.flatten.uniq
end

#matches?(filename, candidates) ⇒ Boolean

Internal: Check if this heuristic matches the candidate filenames or languages.

Returns:

  • (Boolean)


120
121
122
123
124
# File 'lib/linguist/heuristics.rb', line 120

def matches?(filename, candidates)
  filename = filename.downcase
  candidates = candidates.compact.map(&:name)
  @exts.any? { |ext| filename.end_with?(ext) }
end