Class: Bosh::Monitor::Plugins::ResurrectorHelper::AlertTracker

Inherits:
Object
  • Object
show all
Defined in:
lib/bosh/monitor/plugins/resurrector_helper.rb

Overview

Service which tracks alerts and decides whether or not the cluster is melting down. When the cluster is melting down, the resurrector backs off on fixing instances.

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(args = {}) ⇒ AlertTracker

Returns a new instance of AlertTracker.



44
45
46
47
48
49
50
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 44

def initialize(args={})
  @agent_manager       = Bhm.agent_manager
  @alert_times         = {} # maps JobInstanceKey to time of last Alert
  @minimum_down_jobs   = args.fetch('minimum_down_jobs', 5)
  @percent_threshold   = args.fetch('percent_threshold', 0.2)
  @time_threshold      = args.fetch('time_threshold', 600)
end

Instance Attribute Details

#minimum_down_jobsObject

Below this number of down agents we don’t consider a meltdown occurring



34
35
36
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 34

def minimum_down_jobs
  @minimum_down_jobs
end

#percent_thresholdObject

Percentage of the cluster which must be down for scanning to stop. Float fraction between 0 and 1.



42
43
44
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 42

def percent_threshold
  @percent_threshold
end

#time_thresholdObject

Number of seconds at which an alert is considered “current”; alerts older than this are ignored. Integer number of seconds.



38
39
40
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 38

def time_threshold
  @time_threshold
end

Instance Method Details

#melting_down?(deployment) ⇒ Boolean

“Melting down” means a large part of the cluster is offline and manual intervention may be required to fix.

Returns:

  • (Boolean)


54
55
56
57
58
59
60
61
62
63
64
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 54

def melting_down?(deployment)
  agent_alerts = alerts_for_deployment(deployment)
  total_number_of_agents = agent_alerts.size
  number_of_down_agents = agent_alerts.select { |_, alert_time|
    alert_time > (Time.now - time_threshold)
  }.size

  return false if number_of_down_agents < minimum_down_jobs

  (number_of_down_agents.to_f / total_number_of_agents) >= percent_threshold
end

#record(agent_key, alert_time) ⇒ Object



66
67
68
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 66

def record(agent_key, alert_time)
  @alert_times[agent_key] = alert_time
end