Module: FormatParser
- Defined in:
- lib/hash_utils.rb,
lib/text.rb,
lib/audio.rb,
lib/image.rb,
lib/video.rb,
lib/archive.rb,
lib/document.rb,
lib/format_parser.rb,
lib/format_parser/version.rb,
lib/active_storage/blob_io.rb,
lib/active_storage/blob_analyzer.rb,
lib/parsers/iso_base_media_file_format/box.rb,
lib/parsers/iso_base_media_file_format/utils.rb,
lib/parsers/iso_base_media_file_format/decoder.rb
Overview
An analyzer class that can be hooked to ActiveStorage, in order to enable FormatParser to do the blob analysis instead of ActiveStorage builtin-analyzers. Invoked if properly integrated in Rails initializer.
Defined Under Namespace
Modules: ActiveStorage, AttributesJSON, EXIFParser, IOUtils, ISOBaseMediaFileFormat Classes: AACParser, AIFFParser, ARWParser, AdtsHeaderInfo, Archive, Audio, BMPParser, CR2Parser, CR3Parser, DPXParser, Document, FDXParser, FLACParser, GIFParser, HEIFParser, HashUtils, IOConstraint, Image, JPEGParser, JSONParser, M3UParser, MOVParser, MP3Parser, MP4Parser, MPEGParser, NEFParser, OggParser, PDFParser, PNGParser, PSDParser, RW2Parser, ReadLimiter, ReadLimitsConfig, RemoteIO, TIFFParser, Text, UTF8Reader, Video, WAVParser, WebpParser, ZIPParser
Constant Summary collapse
- PARSER_MUX =
Is used to manage access to the shared array of parser constructors, which might potentially be mutated from different threads. The mutex won’t be hit too often since it only locks when adding/removing parsers.
Mutex.new
- MAX_BYTES_READ_PER_PARSER =
1024 * 1024 * 2
- LEAST_PRIORITY =
The value will ensure the parser having it will be applied to the file last.
99
- VERSION =
'2.10.0'
Class Method Summary collapse
-
.default_limits_config ⇒ ReadLimitsConfig
We need to apply various limits so that parsers do not over-read, do not cause too many HTTP requests to be dispatched and so on.
-
.deregister_parser(callable_parser) ⇒ Object
Deregister a parser object (makes FormatParser forget this parser existed).
- .execute_parser_and_capture_expected_exceptions(parser, limited_io) ⇒ Object
-
.parse(io, natures: @parsers_per_nature.keys, formats: @parsers_per_format.keys, results: :first, limits_config: default_limits_config, filename_hint: nil) ⇒ Array<Result>, ...
Parses the resource contained in the given IO-ish object, and returns either the first matched result (omitting all the other parsers), the first N results or all results.
-
.parse_file_at(path, **kwargs) ⇒ Object
Parses the file at the given ‘path` and returns the results as if it were any IO given to `.parse`.
-
.parse_http(url, headers: {}, **kwargs) ⇒ Object
Parses the resource at the given ‘url` and returns the results as if it were any IO given to `.parse`.
-
.parsers_for(desired_natures, desired_formats, filename_hint = nil) ⇒ Array<#call>
Returns objects that respond to ‘call` and can be called to perform parsing based on the intersection of the two given nature/format constraints.
-
.register_parser(callable_parser, formats:, natures:, priority: LEAST_PRIORITY) ⇒ Object
Register a parser object to be used to perform file format detection.
- .registered_formats ⇒ Object
- .registered_natures ⇒ Object
- .string_to_lossy_utf8(str) ⇒ Object
Class Method Details
.default_limits_config ⇒ ReadLimitsConfig
We need to apply various limits so that parsers do not over-read, do not cause too many HTTP requests to be dispatched and so on. These should be balanced with one another- for example, we cannot tell a parser that it is limited to reading 1024 bytes while at the same time limiting the size of the cache pages it may slurp in to less than that amount, since it can quickly become frustrating. The limits configurator computes these limits for us, in a fairly balanced way, based on one setting.
This method returns a ReadLimitsConfig object preset from the ‘MAX_BYTES_READ_PER_PARSER` default.
214 215 216 |
# File 'lib/format_parser.rb', line 214 def self.default_limits_config FormatParser::ReadLimitsConfig.new(MAX_BYTES_READ_PER_PARSER) end |
.deregister_parser(callable_parser) ⇒ Object
Deregister a parser object (makes FormatParser forget this parser existed). Is mostly used in tests, but can also be used to forcibly disable some formats completely.
93 94 95 96 97 98 99 100 101 |
# File 'lib/format_parser.rb', line 93 def self.deregister_parser(callable_parser) # Used only in tests PARSER_MUX.synchronize do (@parsers || []).delete(callable_parser) (@parsers_per_nature || {}).values.map { |e| e.delete(callable_parser) } (@parsers_per_format || {}).values.map { |e| e.delete(callable_parser) } (@parser_priorities || {}).delete(callable_parser) end end |
.execute_parser_and_capture_expected_exceptions(parser, limited_io) ⇒ Object
218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
# File 'lib/format_parser.rb', line 218 def self.execute_parser_and_capture_expected_exceptions(parser, limited_io) parser_name_for_instrumentation = parser.class.to_s.split('::').last.underscore Measurometer.instrument('format_parser.parser.%s' % parser_name_for_instrumentation) do parser.call(limited_io).tap do |result| if result Measurometer.increment_counter('format_parser.detected_natures', 1, nature: result.nature) Measurometer.increment_counter('format_parser.detected_formats', 1, format: result.format) end end end rescue IOUtils::InvalidRead # There was not enough data for this parser to work on, # and it triggered an error Measurometer.increment_counter('format_parser.invalid_read_errors', 1) rescue IOUtils::MalformedFile # Unexpected input was encountered during the parsing of # a file. This might indicate either a malicious or a # corruped file. Measurometer.increment_counter('format_parser.malformed_errors', 1) rescue ReadLimiter::BudgetExceeded # The parser tried to read too much - most likely the file structure # caused the parser to go off-track. Strictly speaking we should log this # and examine the file more closely. # Or the parser caused too many cache pages to be fetched, which likely means we should not allow # it to continue Measurometer.increment_counter('format_parser.exceeded_budget_errors', 1) ensure limited_io.send_metrics(parser_name_for_instrumentation) end |
.parse(io, natures: @parsers_per_nature.keys, formats: @parsers_per_format.keys, results: :first, limits_config: default_limits_config, filename_hint: nil) ⇒ Array<Result>, ...
Parses the resource contained in the given IO-ish object, and returns either the first matched result (omitting all the other parsers), the first N results or all results.
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 |
# File 'lib/format_parser.rb', line 151 def self.parse(io, natures: @parsers_per_nature.keys, formats: @parsers_per_format.keys, results: :first, limits_config: default_limits_config, filename_hint: nil) # Limit the number of cached _pages_ we may fetch. This allows us to limit the number # of page faults (page cache misses) a parser may incur read_limiter_under_cache = FormatParser::ReadLimiter.new(io, max_reads: limits_config.max_pagefaults_per_parser) # Then configure a layer of caching on top of that cached_io = Care::IOWrapper.new(read_limiter_under_cache, page_size: limits_config.cache_page_size) # How many results has the user asked for? Used to determinate whether an array # is returned or not. amount = case results when :all @parsers.count when :first 1 else throw ArgumentError.new(':results does not match any supported mode (:all, :first)') end # Always instantiate parsers fresh for each input, since they might # contain instance variables which otherwise would have to be reset # between invocations, and would complicate threading situations parsers = parsers_for(natures, formats, filename_hint) # Limit how many operations the parser can perform limited_io = ReadLimiter.new( cached_io, max_bytes: limits_config.max_read_bytes_per_parser, max_reads: limits_config.max_reads_per_parser, max_seeks: limits_config.max_seeks_per_parser ) results = parsers.lazy.map do |parser| # Reset all the read limits, per parser limited_io.reset_limits! read_limiter_under_cache.reset_limits! # We need to rewind for each parser, anew limited_io.seek(0) execute_parser_and_capture_expected_exceptions(parser, limited_io) end.reject(&:nil?).take(amount) # Convert the results from a lazy enumerator to an Array. results = results.to_a Measurometer.increment_counter('format_parser.unknown_files', 1) if results.empty? amount == 1 ? results.first : results ensure cached_io.clear if cached_io end |
.parse_file_at(path, **kwargs) ⇒ Object
Parses the file at the given ‘path` and returns the results as if it were any IO given to `.parse`. The accepted keyword arguments are the same as the ones for `parse`. The file path will be used to provide the `filename_hint` to `.parse()`.
124 125 126 127 128 |
# File 'lib/format_parser.rb', line 124 def self.parse_file_at(path, **kwargs) File.open(path, 'rb') do |io| parse(io, filename_hint: File.basename(path), **kwargs) end end |
.parse_http(url, headers: {}, **kwargs) ⇒ Object
Parses the resource at the given ‘url` and returns the results as if it were any IO given to `.parse`. The accepted keyword arguments are the same as the ones for `parse`.
110 111 112 113 114 115 |
# File 'lib/format_parser.rb', line 110 def self.parse_http(url, headers: {}, **kwargs) # Do not extract the filename, since the URL # can really be "anything". But if the caller # provides filename_hint it will be carried over parse(RemoteIO.new(url, headers: headers), **kwargs) end |
.parsers_for(desired_natures, desired_formats, filename_hint = nil) ⇒ Array<#call>
Returns objects that respond to ‘call` and can be called to perform parsing based on the intersection of the two given nature/format constraints. For example, a constraint of “only image and only ZIP files” can be given - but would raise an error since no parsers provide both ZIP file parsing and images as their information.
260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 |
# File 'lib/format_parser.rb', line 260 def self.parsers_for(desired_natures, desired_formats, filename_hint = nil) assemble_parser_set = ->(hash_of_sets, keys_of_interest) { hash_of_sets.values_at(*keys_of_interest).compact.inject(&:+) || Set.new } fitting_by_natures = assemble_parser_set[@parsers_per_nature, desired_natures] fitting_by_formats = assemble_parser_set[@parsers_per_format, desired_formats] parsers = fitting_by_natures & fitting_by_formats raise ArgumentError, "No parsers provide both natures #{desired_natures.inspect} and formats #{desired_formats.inspect}" if parsers.empty? # Order the parsers according to their priority value. The ones having a lower # value will sort higher and will be applied sooner parsers_in_order_of_priority = parsers.to_a.sort do |parser_a, parser_b| if @parser_priorities[parser_a] != @parser_priorities[parser_b] @parser_priorities[parser_a] <=> @parser_priorities[parser_b] else # Some parsers have the same priority and we want them to be always sorted # in the same way, to not change the result of FormatParser.parse(results: :first). # When this changes, it can generate flaky tests or event different # results in different environments, which can be hard to understand why. # There is also no guarantee in the order that the elements are added in # @@parser_priorities # So, to have always the same order, we sort by the order that the parsers # were registered if the priorities are the same. @parsers.index(parser_a) <=> @parsers.index(parser_b) end end # If there is one parser that is more likely to match, place it first if first_match = parsers_in_order_of_priority.find { |f| f.respond_to?(:likely_match?) && f.likely_match?(filename_hint) } parsers_in_order_of_priority.delete(first_match) parsers_in_order_of_priority.unshift(first_match) end parsers_in_order_of_priority end |
.register_parser(callable_parser, formats:, natures:, priority: LEAST_PRIORITY) ⇒ Object
Register a parser object to be used to perform file format detection. Each parser FormatParser provides out of the box registers itself using this method.
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
# File 'lib/format_parser.rb', line 54 def self.register_parser(callable_parser, formats:, natures:, priority: LEAST_PRIORITY) parser_provided_formats = Array(formats) parser_provided_natures = Array(natures) PARSER_MUX.synchronize do # It can't be a Set because the method `parsers_for` depends on the order # that the parsers were added. @parsers ||= [] @parsers << callable_parser unless @parsers.include?(callable_parser) @parsers_per_nature ||= {} parser_provided_natures.each do |provided_nature| @parsers_per_nature[provided_nature] ||= Set.new @parsers_per_nature[provided_nature] << callable_parser end @parsers_per_format ||= {} parser_provided_formats.each do |provided_format| @parsers_per_format[provided_format] ||= Set.new @parsers_per_format[provided_format] << callable_parser end @parser_priorities ||= {} @parser_priorities[callable_parser] = priority @registered_natures |= parser_provided_natures @registered_formats |= parser_provided_formats end end |
.registered_formats ⇒ Object
84 85 86 |
# File 'lib/format_parser.rb', line 84 def self.registered_formats @registered_formats end |
.registered_natures ⇒ Object
80 81 82 |
# File 'lib/format_parser.rb', line 80 def self.registered_natures @registered_natures end |
.string_to_lossy_utf8(str) ⇒ Object
298 299 300 301 |
# File 'lib/format_parser.rb', line 298 def self.string_to_lossy_utf8(str) replacement_char = [0xFFFD].pack('U') str.encode(Encoding::UTF_8, undef: :replace, replace: replacement_char) end |