Class: Parser

Inherits:
Object
  • Object
show all
Defined in:
lib/logstash/filters/parser.rb

Overview

The Parser class is responsible for analyzing log tokens and generating templates and events. It identifies dynamic tokens within logs and creates standardized templates by replacing these dynamic tokens. The class is initialized with three parameters:

  • gramdict: An instance of the GramDict class used for n-gram frequency analysis.

  • threshold: A numeric value used to determine if a token is dynamic based on its frequency.

If it’s frequency is less than this threshold, it’s dynamic.

Methods:

  • dynamic_token?: Determines if a token is dynamic by comparing its frequency to the set threshold.

  • calculate_token_frequency: Calculates frequency of a token considering its index position.

  • calculate_bigram_frequency: Determines frequency based on adjacent tokens (bigrams).

  • calculate_trigram_frequency: Calculates frequency based on trigram context.

  • find_dynamic_indices: Identifies all dynamic tokens in a log entry.

  • template_generator: Generates a log template by replacing dynamic tokens.

  • parse: Processes each token list to generate event strings and templates.

Instance Method Summary collapse

Constructor Details

#initialize(gramdict, threshold) ⇒ Parser

Returns a new instance of Parser.



22
23
24
25
# File 'lib/logstash/filters/parser.rb', line 22

def initialize(gramdict, threshold)
  @gramdict = gramdict
  @threshold = threshold
end

Instance Method Details

#calculate_bigram_frequency(tokens, index) ⇒ Object

Method: calculate_bigram_frequency This method calculates the frequency of a token within the context of a bigram (pair of adjacent tokens). It forms a bigram with the token and its preceding token, then checks their frequency in the GramDict instance. The frequency is determined as the ratio of the bigram frequency to the frequency of the preceding single token.

Parameters: tokens: An array of tokens representing the log entry. index: The current index of the token for which the bigram frequency is being calculated.

Returns: The frequency of the bigram as a float. If the bigram or singlegram is not found in the dictionaries, it returns 0, indicating a lack of previous occurrences.



83
84
85
86
87
88
89
90
91
92
# File 'lib/logstash/filters/parser.rb', line 83

def calculate_bigram_frequency(tokens, index)
  singlegram = tokens[index - 1]
  doublegram = "#{singlegram}^#{tokens[index]}"

  if @gramdict.double_gram_dict.key?(doublegram) && @gramdict.single_gram_dict.key?(singlegram)
    @gramdict.double_gram_dict[doublegram].to_f / @gramdict.single_gram_dict[singlegram]
  else
    0
  end
end

#calculate_token_frequency(tokens, dynamic_indices, index) ⇒ Object

Method: calculate_token_frequency This method determines the frequency of a token within a log entry, considering the context provided by adjacent tokens. It switches between bigram and trigram frequency calculations based on the token’s position and the dynamic status of preceding tokens.

The method returns 1 for the first token (index 0), giving it maximum frequency as its assuming no previous context. For the second token (index 1), it calculates the bigram frequency. For a token where the token two indices before is dynamic, a bigram is also used as trigram frequency calculation does not make sense on a dynamic token. In all other cases, it calculates the trigram frequency.

Parameters:

  • tokens: An array of tokens from the log entry.

  • dynamic_indices: An array of indices for previously identified dynamic tokens.

  • index: The index of the current token for which the frequency is calculated.

Returns: The calculated frequency of the token as a float, based on bigram or trigram analysis.



61
62
63
64
65
66
67
68
69
# File 'lib/logstash/filters/parser.rb', line 61

def calculate_token_frequency(tokens, dynamic_indices, index)
  if index.zero?
    1
  elsif index == 1 || dynamic_indices.include?(index - 2)
    calculate_bigram_frequency(tokens, index)
  else
    calculate_trigram_frequency(tokens, index)
  end
end

#calculate_trigram_frequency(tokens, index) ⇒ Object

Method: calculate_trigram_frequency This method calculates the frequency of a token within the context of a trigram (sequence of three adjacent tokens). It forms a trigram with the token and its two preceding tokens and also considers the intermediate bigram. The frequency is determined as the ratio of the trigram frequency to the frequency of the preceding bigram.

Parameters: tokens: An array of tokens representing the log entry. index: The current index of the token for which the trigram frequency is being calculated.

Returns: The frequency of the trigram as a float. If the trigram or the intermediate bigram is not found in the dictionaries, it returns 0, suggesting a unique or rare occurrence in the logs.



106
107
108
109
110
111
112
113
114
115
# File 'lib/logstash/filters/parser.rb', line 106

def calculate_trigram_frequency(tokens, index)
  doublegram = "#{tokens[index - 2]}^#{tokens[index - 1]}"
  trigram = "#{doublegram}^#{tokens[index]}"

  if @gramdict.tri_gram_dict.key?(trigram) && @gramdict.double_gram_dict.key?(doublegram)
    @gramdict.tri_gram_dict[trigram].to_f / @gramdict.double_gram_dict[doublegram]
  else
    0
  end
end

#dynamic_token?(tokens, dynamic_indices, index) ⇒ Boolean

Method: dynamic_token? This method evaluates if a given token in a log is dynamic by assessing its frequency relative to a set threshold. A token is deemed dynamic if its frequency is equal to or lower than the threshold value.

Parameters:

  • tokens: An array of tokens from a log entry.

  • dynamic_indices: An array containing indices of previously identified dynamic tokens.

  • index: The index of the current token being evaluated.

Returns: A boolean indicating whether the token is dynamic (true) or static (false).

Returns:

  • (Boolean)


38
39
40
41
# File 'lib/logstash/filters/parser.rb', line 38

def dynamic_token?(tokens, dynamic_indices, index)
  frequency = calculate_token_frequency(tokens, dynamic_indices, index)
  frequency <= @threshold
end

#find_dynamic_indices(tokens) ⇒ Object

Method: find_dynamic_indices This method identifies dynamic tokens in a given log entry. It iterates through the tokens and uses the dynamic_token? method to check if each token is dynamic. Dynamic tokens are those whose frequency is less than or equal to a certain threshold, suggesting variability in log entries.

Parameters: tokens: An array of tokens representing the log entry.

Returns: An array of indices corresponding to dynamic tokens within the log entry.



127
128
129
130
131
132
133
134
135
136
137
# File 'lib/logstash/filters/parser.rb', line 127

def find_dynamic_indices(tokens)
  dynamic_indices = []
  if tokens.length >= 2
    index = 1
    while index < tokens.length
      dynamic_indices << index if dynamic_token?(tokens, dynamic_indices, index) # Directly calling dynamic_token?
      index += 1
    end
  end
  dynamic_indices
end

#parse(log_tokens) ⇒ Object

Parameters: log_tokens: An array of tokens from the log entry. Returns: An array containing the event_string and template_string, which are useful for log analysis and pattern recognition.



178
179
180
181
182
183
184
185
186
187
188
# File 'lib/logstash/filters/parser.rb', line 178

def parse(log_tokens)
  dynamic_indices = find_dynamic_indices(log_tokens)
  template_string, dynamic_tokens = template_generator(log_tokens, dynamic_indices)

  # TODO: The Python iteration of the parser does a few regex checks here on the templates
  # It's unclear based on prelimilarly data if we need this, but once the full plugin has been fleshed out we can
  # revisit
  template_string.gsub!(/[,'"]/, '')

  [template_string, dynamic_tokens]
end

#template_generator(tokens, dynamic_indices) ⇒ Object

Method: template_generator Generates a standardized log template from a list of tokens. This method replaces dynamic tokens (identified by their indices in dynamic_indices) with a placeholder symbol ‘<*>’ and stores the tokens for output. The result is a template that represents the static structure of the log entry, with dynamic parts parsed out.

Parameters: tokens: An array of tokens from the log entry. dynamic_indices: An array of indices indicating which tokens are dynamic.

Returns: template: A string representing the log template, with dynamic tokens replaced by ‘<*>’. dynamic_tokens: a map of dynamic tokens, structured as { “dynamic_token_index” : <dynamic_token> }



152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# File 'lib/logstash/filters/parser.rb', line 152

def template_generator(tokens, dynamic_indices)
  template = String.new('')
  dynamic_tokens = {}

  tokens.each_with_index do |token, index|
    if dynamic_indices.include?(index)
      template << '<*> '
      dynamic_tokens["dynamic_token_#{index}"] = token
    else
      template << "#{token} "
    end
  end

  [template, dynamic_tokens]
end