Class: Parser
- Inherits:
-
Object
- Object
- Parser
- Defined in:
- lib/logstash/filters/parser.rb
Overview
The Parser class is responsible for analyzing log tokens and generating templates and events. It identifies dynamic tokens within logs and creates standardized templates by replacing these dynamic tokens. The class is initialized with three parameters:
-
gramdict: An instance of the GramDict class used for n-gram frequency analysis.
-
threshold: A numeric value used to determine if a token is dynamic based on its frequency.
If it’s frequency is less than this threshold, it’s dynamic.
Methods:
-
dynamic_token?: Determines if a token is dynamic by comparing its frequency to the set threshold.
-
calculate_token_frequency: Calculates frequency of a token considering its index position.
-
calculate_bigram_frequency: Determines frequency based on adjacent tokens (bigrams).
-
calculate_trigram_frequency: Calculates frequency based on trigram context.
-
find_dynamic_indices: Identifies all dynamic tokens in a log entry.
-
template_generator: Generates a log template by replacing dynamic tokens.
-
parse: Processes each token list to generate event strings and templates.
Instance Method Summary collapse
-
#calculate_bigram_frequency(tokens, index) ⇒ Object
Method: calculate_bigram_frequency This method calculates the frequency of a token within the context of a bigram (pair of adjacent tokens).
-
#calculate_token_frequency(tokens, dynamic_indices, index) ⇒ Object
Method: calculate_token_frequency This method determines the frequency of a token within a log entry, considering the context provided by adjacent tokens.
-
#calculate_trigram_frequency(tokens, index) ⇒ Object
Method: calculate_trigram_frequency This method calculates the frequency of a token within the context of a trigram (sequence of three adjacent tokens).
-
#dynamic_token?(tokens, dynamic_indices, index) ⇒ Boolean
Method: dynamic_token? This method evaluates if a given token in a log is dynamic by assessing its frequency relative to a set threshold.
-
#find_dynamic_indices(tokens) ⇒ Object
Method: find_dynamic_indices This method identifies dynamic tokens in a given log entry.
-
#initialize(gramdict, threshold) ⇒ Parser
constructor
A new instance of Parser.
-
#parse(log_tokens) ⇒ Object
Parameters: log_tokens: An array of tokens from the log entry.
-
#template_generator(tokens, dynamic_indices) ⇒ Object
Method: template_generator Generates a standardized log template from a list of tokens.
Constructor Details
#initialize(gramdict, threshold) ⇒ Parser
Returns a new instance of Parser.
22 23 24 25 |
# File 'lib/logstash/filters/parser.rb', line 22 def initialize(gramdict, threshold) @gramdict = gramdict @threshold = threshold end |
Instance Method Details
#calculate_bigram_frequency(tokens, index) ⇒ Object
Method: calculate_bigram_frequency This method calculates the frequency of a token within the context of a bigram (pair of adjacent tokens). It forms a bigram with the token and its preceding token, then checks their frequency in the GramDict instance. The frequency is determined as the ratio of the bigram frequency to the frequency of the preceding single token.
Parameters: tokens: An array of tokens representing the log entry. index: The current index of the token for which the bigram frequency is being calculated.
Returns: The frequency of the bigram as a float. If the bigram or singlegram is not found in the dictionaries, it returns 0, indicating a lack of previous occurrences.
83 84 85 86 87 88 89 90 91 92 |
# File 'lib/logstash/filters/parser.rb', line 83 def calculate_bigram_frequency(tokens, index) singlegram = tokens[index - 1] doublegram = "#{singlegram}^#{tokens[index]}" if @gramdict.double_gram_dict.key?(doublegram) && @gramdict.single_gram_dict.key?(singlegram) @gramdict.double_gram_dict[doublegram].to_f / @gramdict.single_gram_dict[singlegram] else 0 end end |
#calculate_token_frequency(tokens, dynamic_indices, index) ⇒ Object
Method: calculate_token_frequency This method determines the frequency of a token within a log entry, considering the context provided by adjacent tokens. It switches between bigram and trigram frequency calculations based on the token’s position and the dynamic status of preceding tokens.
The method returns 1 for the first token (index 0), giving it maximum frequency as its assuming no previous context. For the second token (index 1), it calculates the bigram frequency. For a token where the token two indices before is dynamic, a bigram is also used as trigram frequency calculation does not make sense on a dynamic token. In all other cases, it calculates the trigram frequency.
Parameters:
-
tokens: An array of tokens from the log entry.
-
dynamic_indices: An array of indices for previously identified dynamic tokens.
-
index: The index of the current token for which the frequency is calculated.
Returns: The calculated frequency of the token as a float, based on bigram or trigram analysis.
61 62 63 64 65 66 67 68 69 |
# File 'lib/logstash/filters/parser.rb', line 61 def calculate_token_frequency(tokens, dynamic_indices, index) if index.zero? 1 elsif index == 1 || dynamic_indices.include?(index - 2) calculate_bigram_frequency(tokens, index) else calculate_trigram_frequency(tokens, index) end end |
#calculate_trigram_frequency(tokens, index) ⇒ Object
Method: calculate_trigram_frequency This method calculates the frequency of a token within the context of a trigram (sequence of three adjacent tokens). It forms a trigram with the token and its two preceding tokens and also considers the intermediate bigram. The frequency is determined as the ratio of the trigram frequency to the frequency of the preceding bigram.
Parameters: tokens: An array of tokens representing the log entry. index: The current index of the token for which the trigram frequency is being calculated.
Returns: The frequency of the trigram as a float. If the trigram or the intermediate bigram is not found in the dictionaries, it returns 0, suggesting a unique or rare occurrence in the logs.
106 107 108 109 110 111 112 113 114 115 |
# File 'lib/logstash/filters/parser.rb', line 106 def calculate_trigram_frequency(tokens, index) doublegram = "#{tokens[index - 2]}^#{tokens[index - 1]}" trigram = "#{doublegram}^#{tokens[index]}" if @gramdict.tri_gram_dict.key?(trigram) && @gramdict.double_gram_dict.key?(doublegram) @gramdict.tri_gram_dict[trigram].to_f / @gramdict.double_gram_dict[doublegram] else 0 end end |
#dynamic_token?(tokens, dynamic_indices, index) ⇒ Boolean
Method: dynamic_token? This method evaluates if a given token in a log is dynamic by assessing its frequency relative to a set threshold. A token is deemed dynamic if its frequency is equal to or lower than the threshold value.
Parameters:
-
tokens: An array of tokens from a log entry.
-
dynamic_indices: An array containing indices of previously identified dynamic tokens.
-
index: The index of the current token being evaluated.
Returns: A boolean indicating whether the token is dynamic (true) or static (false).
38 39 40 41 |
# File 'lib/logstash/filters/parser.rb', line 38 def dynamic_token?(tokens, dynamic_indices, index) frequency = calculate_token_frequency(tokens, dynamic_indices, index) frequency <= @threshold end |
#find_dynamic_indices(tokens) ⇒ Object
Method: find_dynamic_indices This method identifies dynamic tokens in a given log entry. It iterates through the tokens and uses the dynamic_token? method to check if each token is dynamic. Dynamic tokens are those whose frequency is less than or equal to a certain threshold, suggesting variability in log entries.
Parameters: tokens: An array of tokens representing the log entry.
Returns: An array of indices corresponding to dynamic tokens within the log entry.
127 128 129 130 131 132 133 134 135 136 137 |
# File 'lib/logstash/filters/parser.rb', line 127 def find_dynamic_indices(tokens) dynamic_indices = [] if tokens.length >= 2 index = 1 while index < tokens.length dynamic_indices << index if dynamic_token?(tokens, dynamic_indices, index) # Directly calling dynamic_token? index += 1 end end dynamic_indices end |
#parse(log_tokens) ⇒ Object
Parameters: log_tokens: An array of tokens from the log entry. Returns: An array containing the event_string and template_string, which are useful for log analysis and pattern recognition.
178 179 180 181 182 183 184 185 186 187 188 |
# File 'lib/logstash/filters/parser.rb', line 178 def parse(log_tokens) dynamic_indices = find_dynamic_indices(log_tokens) template_string, dynamic_tokens = template_generator(log_tokens, dynamic_indices) # TODO: The Python iteration of the parser does a few regex checks here on the templates # It's unclear based on prelimilarly data if we need this, but once the full plugin has been fleshed out we can # revisit template_string.gsub!(/[,'"]/, '') [template_string, dynamic_tokens] end |
#template_generator(tokens, dynamic_indices) ⇒ Object
Method: template_generator Generates a standardized log template from a list of tokens. This method replaces dynamic tokens (identified by their indices in dynamic_indices) with a placeholder symbol ‘<*>’ and stores the tokens for output. The result is a template that represents the static structure of the log entry, with dynamic parts parsed out.
Parameters: tokens: An array of tokens from the log entry. dynamic_indices: An array of indices indicating which tokens are dynamic.
Returns: template: A string representing the log template, with dynamic tokens replaced by ‘<*>’. dynamic_tokens: a map of dynamic tokens, structured as { “dynamic_token_index” : <dynamic_token> }
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
# File 'lib/logstash/filters/parser.rb', line 152 def template_generator(tokens, dynamic_indices) template = String.new('') dynamic_tokens = {} tokens.each_with_index do |token, index| if dynamic_indices.include?(index) template << '<*> ' dynamic_tokens["dynamic_token_#{index}"] = token else template << "#{token} " end end [template, dynamic_tokens] end |