Class: Bosh::Monitor::Plugins::ResurrectorHelper::AlertTracker
- Inherits:
-
Object
- Object
- Bosh::Monitor::Plugins::ResurrectorHelper::AlertTracker
- Defined in:
- lib/bosh/monitor/plugins/resurrector_helper.rb
Overview
Service which tracks alerts and decides whether or not the cluster is melting down. When the cluster is melting down, the resurrector backs off on fixing instances.
Instance Attribute Summary collapse
-
#minimum_down_jobs ⇒ Object
Below this number of down agents we don’t consider a meltdown occurring.
-
#percent_threshold ⇒ Object
Percentage of the cluster which must be down for scanning to stop.
-
#time_threshold ⇒ Object
Number of seconds at which an alert is considered “current”; alerts older than this are ignored.
Instance Method Summary collapse
-
#initialize(args = {}) ⇒ AlertTracker
constructor
A new instance of AlertTracker.
-
#melting_down?(deployment) ⇒ Boolean
“Melting down” means a large part of the cluster is offline and manual intervention may be required to fix.
- #record(agent_key, alert_time) ⇒ Object
Constructor Details
#initialize(args = {}) ⇒ AlertTracker
Returns a new instance of AlertTracker.
44 45 46 47 48 49 50 |
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 44 def initialize(args={}) @agent_manager = Bhm.agent_manager @alert_times = {} # maps JobInstanceKey to time of last Alert @minimum_down_jobs = args.fetch('minimum_down_jobs', 5) @percent_threshold = args.fetch('percent_threshold', 0.2) @time_threshold = args.fetch('time_threshold', 600) end |
Instance Attribute Details
#minimum_down_jobs ⇒ Object
Below this number of down agents we don’t consider a meltdown occurring
34 35 36 |
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 34 def minimum_down_jobs @minimum_down_jobs end |
#percent_threshold ⇒ Object
Percentage of the cluster which must be down for scanning to stop. Float fraction between 0 and 1.
42 43 44 |
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 42 def percent_threshold @percent_threshold end |
#time_threshold ⇒ Object
Number of seconds at which an alert is considered “current”; alerts older than this are ignored. Integer number of seconds.
38 39 40 |
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 38 def time_threshold @time_threshold end |
Instance Method Details
#melting_down?(deployment) ⇒ Boolean
“Melting down” means a large part of the cluster is offline and manual intervention may be required to fix.
54 55 56 57 58 59 60 61 62 63 64 |
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 54 def melting_down?(deployment) agent_alerts = alerts_for_deployment(deployment) total_number_of_agents = agent_alerts.size number_of_down_agents = agent_alerts.select { |_, alert_time| alert_time > (Time.now - time_threshold) }.size return false if number_of_down_agents < minimum_down_jobs (number_of_down_agents.to_f / total_number_of_agents) >= percent_threshold end |
#record(agent_key, alert_time) ⇒ Object
66 67 68 |
# File 'lib/bosh/monitor/plugins/resurrector_helper.rb', line 66 def record(agent_key, alert_time) @alert_times[agent_key] = alert_time end |