Class: Preprocessor
- Inherits:
-
Object
- Object
- Preprocessor
- Defined in:
- lib/logstash/filters/preprocessor.rb
Overview
The Preprocessor class is designed for processing and masking log events. This class provides functionality to parse, anonymize, and sanitize log data, ensuring sensitive information is masked before further processing or storage.
Key Features:
-
Initialization with dictionaries and regex patterns for custom preprocessing.
-
Support for custom log formats using a flexible regex generator method.
Usage: The class is initialized with a gram dictionary for tokenizing log events, a set of regexes for custom masking tailored to specific log files, and a log format for parsing log events. Once initialized, it can generate regex patterns based on the provided log format and mask sensitive information in log events, replacing it with a generic mask string.
Methods:
-
initialize(gram_dict, regexes, logformat): Sets up the preprocessing environment with the necessary dictionaries and formats.
-
regex_generator(logformat): Generates a regular expression based on a specified log format, useful for parsing logs with known structures.
-
token_splitter(log_line): splits a log line into tokens
-
upload_grams_to_gram_dict(tokens): uploads a list of tokens into the single_gram, bi_gram and tri_gram dictionaries
-
process_log_event(event): processes an entire log event by calling Parser.parse()
Example:
preprocessor = Preprocessor.new(gram_dict, logformat, content_specifier, regexes)
This class is essential for log management systems where data privacy and security are paramount.
Instance Method Summary collapse
-
#initialize(gram_dict, logformat, content_specifier, regexes) ⇒ Preprocessor
constructor
A new instance of Preprocessor.
-
#preprocess_known_dynamic_tokens(log_line, regexes) ⇒ Object
Processes a log line to replace known dynamic tokens using the passed in regexes and the general regexes.
-
#process_log_event(log_event, dynamic_token_threshold, parse) ⇒ Object
Processes a given log event by tokenizing it, parsing it, and updating the gram dictionary.
-
#regex_generator(logformat) ⇒ Object
Method: regex_generator This method generates a regular expression based on a specified log format.
-
#token_splitter(log_line) ⇒ Object
Splits a log line into tokens based on a given format and regular expression.
Constructor Details
#initialize(gram_dict, logformat, content_specifier, regexes) ⇒ Preprocessor
Returns a new instance of Preprocessor.
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
# File 'lib/logstash/filters/preprocessor.rb', line 33 def initialize(gram_dict, logformat, content_specifier, regexes) # gram_dict for uploading log event tokens @gram_dict = gram_dict # Regex for specific log event format @format = regex_generator(logformat) # This is the content specifier in the @format regex @content_specifier = content_specifier @general_regex = [ /([\w-]+\.)+[\w-]+(:\d+)/, # url %r{/?([0-9]+\.){3}[0-9]+(:[0-9]+)?(:|)}, # IP /(?<=\W)(-?\+?\d+)(?=\W)|[0-9]+$/ # Numbers ] @regexes = regexes end |
Instance Method Details
#preprocess_known_dynamic_tokens(log_line, regexes) ⇒ Object
Processes a log line to replace known dynamic tokens using the passed in regexes and the general regexes
Parameters: log_line [String] the log line to be processed Returns:
- String
-
a string that is a copy of the log except that the known dynamic tokens have been replaced with ‘<*>’
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 |
# File 'lib/logstash/filters/preprocessor.rb', line 93 def preprocess_known_dynamic_tokens(log_line, regexes) preprocessed_dynamic_token = {} log_line = " #{log_line}" regexes.each do |regex| log_line.gsub!(regex).each_with_index do |match, index| key = "manual_processed_dynamic_token_#{index + 1}" preprocessed_dynamic_token[key] = match '<*>' end end @general_regex.each do |regex| log_line.gsub!(regex).each_with_index do |match, index| key = "global_processed_dynamic_token_#{index + 1}" preprocessed_dynamic_token[key] = match '<*>' end end [log_line, preprocessed_dynamic_token] end |
#process_log_event(log_event, dynamic_token_threshold, parse) ⇒ Object
Processes a given log event by tokenizing it, parsing it, and updating the gram dictionary.
This method first calls the ‘token_splitter` method to split the log event into tokens based on the pre-configured format. The tokens are then passed to the `upload_grams` method, which iteratively uploads single grams, digrams, and trigrams to the `@gram_dict`.
The process involves two primary steps: tokenization and dictionary updating. Tokenization is done based on the log format and involves masking sensitive information before splitting. Each token, digram, and trigram found in the log event is then uploaded to the gram dictionary, enhancing the dictionary’s ability to process future log events.
Parameters: log_event [String] the log event to be processed dynamic_token_threshold [Float] the threshold for a token to be considered a dynamic token or not parse [Boolean] a boolean that controls whether the log_event should be parsed. This will be set to False for
seed log events.
Returns: event_string [String], template_string, which are useful for log analysis and pattern recognition. It also updates the gram dict based on this information.
158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
# File 'lib/logstash/filters/preprocessor.rb', line 158 def process_log_event(log_event, dynamic_token_threshold, parse) template_string = nil dynamic_tokens = nil all_dynamic_tokens = {} # Split log event into tokens tokens, preprocessed_dynamic_token = token_splitter(log_event) all_dynamic_tokens.merge(preprocessed_dynamic_token) if preprocessed_dynamic_token # If no tokens were returned, do not parse the logs and return return if tokens.nil? # Parse the log based on the pre-existing gramdict data if parse # Parse the log based on the pre-existing gramdict data parser = Parser.new(@gram_dict, dynamic_token_threshold) template_string, dynamic_tokens = parser.parse(tokens) # there should be no conflicts here as long as all preprocess_known_dynamic_tokens have # the format "[global/manual]_preprocessed_dynamic_token_{i}" and all the dynamic tokens have the # format "dynamic_token_{i}" all_dynamic_tokens.merge(dynamic_tokens) if dynamic_tokens end # Update gram_dict @gram_dict.upload_grams(tokens) [template_string, all_dynamic_tokens] end |
#regex_generator(logformat) ⇒ Object
Method: regex_generator This method generates a regular expression based on a specified log format. It is designed to parse log files where the format of the logs is known and can be described using placeholders.
Parameters: logformat: A string representing the log format.
Returns: A Regexp object that can be used to match and extract data from log lines that follow the specified format.
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
# File 'lib/logstash/filters/preprocessor.rb', line 61 def regex_generator(logformat) # Split the logformat string into an array of strings and placeholders. # Placeholders are identified as text within angle brackets (< >). splitters = logformat.split(/(<[^<>]+>)/) format = '' # Iterate through the array of strings and placeholders. splitters.each_with_index do |splitter, k| if k.even? # For the actual string parts (even-indexed elements), # substitute spaces with the regex pattern for whitespace (\s+). format += splitter.gsub(/\s+/, '\s+') else # For placeholders (odd-indexed elements), # remove angle brackets and create named capture groups. # This transforms each placeholder into a regex pattern that matches any characters. header = splitter.gsub(/[<>]/, '') format += "(?<#{header}>.*?)" end end # Compile the complete regex pattern, anchored at the start and end, Regexp.new("^#{format}$") end |
#token_splitter(log_line) ⇒ Object
Splits a log line into tokens based on a given format and regular expression.
Parameters: log_line [String] the log line to be processed Returns:
- Array, nil
-
an array of tokens if matches are found, otherwise nil
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
# File 'lib/logstash/filters/preprocessor.rb', line 121 def token_splitter(log_line) # Finds matches in the stripped line for the regex format stripped_log_line = log_line.strip match = stripped_log_line.match(@format) # If not match found, return nil if match.nil? [nil, nil] else # Gets content and return content = match[@content_specifier] line, preprocessed_dynamic_token = preprocess_known_dynamic_tokens(content, @regexes) [line.strip.split, preprocessed_dynamic_token] end end |