Class: Linguist::Heuristics
- Inherits:
-
Object
- Object
- Linguist::Heuristics
- Defined in:
- lib/linguist/heuristics.rb
Overview
A collection of simple heuristics that can be used to better analyze languages.
Constant Summary collapse
- HEURISTICS_CONSIDER_BYTES =
50 * 1024
Class Method Summary collapse
-
.all ⇒ Object
Public: Get all heuristic definitions.
-
.call(blob, candidates) ⇒ Object
Public: Use heuristics to detect language of the blob.
-
.load ⇒ Object
Internal: Load heuristics from ‘heuristics.yml’.
- .load_config ⇒ Object
- .parse_rule(named_patterns, rule) ⇒ Object
-
.to_regex(str) ⇒ Object
Internal: Converts a string or array of strings to regexp.
Instance Method Summary collapse
-
#call(data) ⇒ Object
Internal: Perform the heuristic.
-
#extensions ⇒ Object
Internal: Return the heuristic’s target extensions.
-
#initialize(exts, rules) ⇒ Heuristics
constructor
Internal.
-
#languages ⇒ Object
Internal: Return the heuristic’s candidate languages.
-
#matches?(filename, candidates) ⇒ Boolean
Internal: Check if this heuristic matches the candidate filenames or languages.
Constructor Details
#initialize(exts, rules) ⇒ Heuristics
Internal
101 102 103 104 |
# File 'lib/linguist/heuristics.rb', line 101 def initialize(exts, rules) @exts = exts @rules = rules end |
Class Method Details
.all ⇒ Object
Public: Get all heuristic definitions
Returns an Array of heuristic objects.
40 41 42 43 |
# File 'lib/linguist/heuristics.rb', line 40 def self.all self.load() @heuristics end |
.call(blob, candidates) ⇒ Object
Public: Use heuristics to detect language of the blob.
blob - An object that quacks like a blob. possible_languages - Array of Language objects
Examples
Heuristics.call(FileBlob.new("path/to/file"), [
Language["Ruby"], Language["Python"]
])
Returns an Array of languages, or empty if none matched or were inconclusive.
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
# File 'lib/linguist/heuristics.rb', line 20 def self.call(blob, candidates) return [] if blob.symlink? self.load() data = blob.data[0...HEURISTICS_CONSIDER_BYTES] @heuristics.each do |heuristic| if heuristic.matches?(blob.name, candidates) return Array(heuristic.call(data)) end end [] # No heuristics matched rescue Regexp::TimeoutError [] # Return nothing if we have a bad regexp which leads to a timeout enforced by Regexp.timeout in Ruby 3.2 or later end |
.load ⇒ Object
Internal: Load heuristics from ‘heuristics.yml’.
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
# File 'lib/linguist/heuristics.rb', line 46 def self.load() if @heuristics.any? return end data = self.load_config named_patterns = data['named_patterns'].map { |k,v| [k, self.to_regex(v)] }.to_h data['disambiguations'].each do |disambiguation| exts = disambiguation['extensions'] rules = disambiguation['rules'] rules.map! do |rule| rule['pattern'] = self.parse_rule(named_patterns, rule) rule end @heuristics << new(exts, rules) end end |
.load_config ⇒ Object
65 66 67 |
# File 'lib/linguist/heuristics.rb', line 65 def self.load_config YAML.load_file(File.("../heuristics.yml", __FILE__)) end |
.parse_rule(named_patterns, rule) ⇒ Object
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
# File 'lib/linguist/heuristics.rb', line 69 def self.parse_rule(named_patterns, rule) if !rule['and'].nil? rules = rule['and'].map { |block| self.parse_rule(named_patterns, block) } return And.new(rules) elsif !rule['pattern'].nil? return self.to_regex(rule['pattern']) elsif !rule['negative_pattern'].nil? pat = self.to_regex(rule['negative_pattern']) return NegativePattern.new(pat) elsif !rule['named_pattern'].nil? return named_patterns[rule['named_pattern']] else return AlwaysMatch.new() end end |
.to_regex(str) ⇒ Object
Internal: Converts a string or array of strings to regexp
str: string or array of strings. If it is an array of strings,
Regexp.union will be used.
89 90 91 92 93 94 95 |
# File 'lib/linguist/heuristics.rb', line 89 def self.to_regex(str) if str.kind_of?(Array) Regexp.union(str.map { |s| Regexp.new(s) }) else Regexp.new(str) end end |
Instance Method Details
#call(data) ⇒ Object
Internal: Perform the heuristic
127 128 129 130 131 132 133 134 135 136 137 138 139 |
# File 'lib/linguist/heuristics.rb', line 127 def call(data) matched = @rules.find do |rule| rule['pattern'].match(data) end if !matched.nil? languages = matched['language'] if languages.is_a?(Array) languages.map{ |l| Language[l] } else Language[languages] end end end |
#extensions ⇒ Object
Internal: Return the heuristic’s target extensions
107 108 109 |
# File 'lib/linguist/heuristics.rb', line 107 def extensions @exts end |
#languages ⇒ Object
Internal: Return the heuristic’s candidate languages
112 113 114 115 116 |
# File 'lib/linguist/heuristics.rb', line 112 def languages @rules.map do |rule| [rule['language']].flatten(2).map { |name| Language[name] } end.flatten.uniq end |
#matches?(filename, candidates) ⇒ Boolean
Internal: Check if this heuristic matches the candidate filenames or languages.
120 121 122 123 124 |
# File 'lib/linguist/heuristics.rb', line 120 def matches?(filename, candidates) filename = filename.downcase candidates = candidates.compact.map(&:name) @exts.any? { |ext| filename.end_with?(ext) } end |