Class: Gouda::Workload

Inherits:
ActiveRecord::Base
  • Object
show all
Defined in:
lib/gouda/workload.rb

Overview

This model is called “workload” for a reason. The ActiveJob can be enqueued multiple times with the same job ID which gets generated by Rails. These multiple enqueues of the same job are not exactly copies of one another. When you use job-iteration for example, your job will be retried with a different cursor position value. When you use ActiveJob ‘rescue_from` as well - the job will be retried and keep the same active job ID, but it then gets returned into the queue “in some way”. What we want is that the records in our table represent a unit of work that the worker has to execute “at some point”. If the same job gets enqueued multiple times due to retries or pause/resume we want the enqueues to be separate workloads, which can fail or succeed independently. This also allows the queue records to be “append-only” which allows the records to be pruned on a regular basis. This is why they are called “workloads” and not “jobs”. “Executions” is a great term used by good_job but it seems that it is not clear what has the “identity”. With the Workload the ID of the workload is the “provider ID” for ActiveJob. It is therefore possible (and likely) that multiple Workloads will exist sharing the same ActiveJob ID.

Constant Summary collapse

ZOMBIE_MAX_THRESHOLD =
"5 minutes"

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.checkout_and_lock_one(executing_on:, queue_constraint: Gouda::AnyQueue) ⇒ Object

Lock the next workload and mark it as executing



91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# File 'lib/gouda/workload.rb', line 91

def self.checkout_and_lock_one(executing_on:, queue_constraint: Gouda::AnyQueue)
  where_query = <<~SQL
    #{queue_constraint.to_sql}
    AND workloads.state = 'enqueued'
    AND NOT EXISTS (
      SELECT NULL
      FROM #{quoted_table_name} AS concurrent
      WHERE concurrent.state = 'executing'
        AND concurrent.execution_concurrency_key = workloads.execution_concurrency_key
    )
    AND workloads.scheduled_at <= clock_timestamp()
  SQL
  # Enter a txn just to mark this job as being executed "by us". This allows us to avoid any
  # locks during execution itself, including advisory locks
  workloads = Gouda::Workload
    .select("workloads.*")
    .from("#{quoted_table_name} AS workloads")
    .where(where_query)
    .order("workloads.priority ASC NULLS LAST")
    .lock("FOR UPDATE SKIP LOCKED")
    .limit(1)

  _first_available_workload = ActiveSupport::Notifications.instrument(:checkout_and_lock_one, {queue_constraint: queue_constraint.to_sql}) do |payload|
    payload[:condition_sql] = workloads.to_sql
    payload[:retried_checkouts_due_to_concurrent_exec] = 0
    uncached do # Necessary because we SELECT with a clock_timestamp() which otherwise gets cached by ActiveRecord query cache
      transaction do
        workload = Gouda.suppressing_sql_logs { workloads.first } # Silence SQL output as this gets called very frequently
        return nil unless workload

        if workload.scheduler_key && !Gouda::Scheduler.known_scheduler_keys.include?(workload.scheduler_key)
          # Check whether this workload was enqueued with a scheduler key, but no longer is in the cron table.
          # If that is the case (we are trying to execute a workload which has a scheduler key, but the scheduler
          # does not know about that key) it means that the workload has been removed from the cron table and must not run.
          # Moreover: running it can be dangerous because it was likely removed from the table for a reason.
          # Should that be the case, mark the job "finished" and return `nil` to get to the next poll. If the deployed worker still has
          # the workload in its scheduler table, but a new deploy removed it - this is a race condition, but we are willing to accept it.
          # Note that we are already "just not enqueueing" that job when the cron table gets loaded - this already happens.
          #
          # Removing jobs from the queue forcibly when we load the cron table is nice, but not enough, because our system can be in a state
          # of partial deployment:
          #
          #   [  release 1 does have some_job_hourly crontab entry ]
          #                  [  release 2 no longer does                           ]
          #                  ^ --- race conditions possible here --^
          #
          # So even if we remove the crontabled workloads during app boot, it does not give us a guarantee that release 1 won't reinsert them.
          # This is why this safeguard is needed.
          error = {class_name: "WorkloadSkippedError", message: "Skipped as scheduler_key was no longer in the cron table"}
          workload.update!(state: "finished", error:)
          # And return nil. This will cause a brief "sleep" in the polling routine since the caller may think there are no more workloads
          # in the queue, but only for a brief moment.
          nil
        else
          # Once we have verified this job is OK to execute
          workload.update!(state: "executing", executing_on: executing_on, last_execution_heartbeat_at: Time.now.utc, execution_started_at: Time.now.utc)
          workload
        end
      rescue ActiveRecord::RecordNotUnique
        # It can happen that due to a race the `execution_concurrency_key NOT IN` does not capture
        # a job which _just_ entered the "executing" state, apparently after we do our SELECT. This will happen regardless
        # whether we are using a CTE or a sub-SELECT
        payload[:retried_checkouts_due_to_concurrent_exec] += 1
        nil
      end
    end
  end
end

.checkout_and_perform_one(executing_on:, queue_constraint: Gouda::AnyQueue, in_progress: Set.new) ⇒ Object

Get a new workload and call perform

Parameters:

  • in_progress (#add, #delete) (defaults to: Set.new)

    Used for tracking work in progress for heartbeats



162
163
164
165
166
167
168
169
170
171
# File 'lib/gouda/workload.rb', line 162

def self.checkout_and_perform_one(executing_on:, queue_constraint: Gouda::AnyQueue, in_progress: Set.new)
  # Select a job and mark it as "executing" which will make it unavailable to any other
  workload = checkout_and_lock_one(executing_on: executing_on, queue_constraint: queue_constraint)
  if workload
    in_progress.add(workload.id)
    workload.perform_and_update_state!
  end
ensure
  in_progress.delete(workload.id) if workload
end

.pruneObject



46
47
48
49
50
51
52
# File 'lib/gouda/workload.rb', line 46

def self.prune
  if Gouda.config.preserve_job_records
    where(state: "finished").where("execution_finished_at < ?", Gouda.config.cleanup_preserved_jobs_before.ago).delete_all
  else
    where(state: "finished").delete_all
  end
end

.queue_namesObject



42
43
44
# File 'lib/gouda/workload.rb', line 42

def self.queue_names
  connection.select_values("SELECT DISTINCT(queue_name) FROM #{quoted_table_name} ORDER BY queue_name ASC")
end

.reap_zombie_workloadsObject

Re-enqueue zombie workloads which have been left to rot due to machine kills, worker OOM kills and the like With a lock so no single zombie job gets enqueued more than once And wrapped in transactions with the possibility to roll back a single workload without it rollbacking the entire batch



57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/gouda/workload.rb', line 57

def self.reap_zombie_workloads
  uncached do # again needed due to the use of clock_timestamp() in the SQL
    transaction do
      zombie_workloads_scope = Gouda::Workload.lock("FOR UPDATE SKIP LOCKED").where("state = 'executing' AND last_execution_heartbeat_at < (clock_timestamp() - interval '#{ZOMBIE_MAX_THRESHOLD}')")
      zombie_workloads_scope.find_each(batch_size: 1000) do |workload|
        # with_lock will start its own transaction
        workload.with_lock("FOR UPDATE SKIP LOCKED") do
          Gouda.logger.info { "Reviving (re-enqueueing) Gouda workload #{workload.id} after interruption" }

          Gouda.instrument(:workloads_revived_counter, {size: 1, job_class: workload.active_job_class_name})

          interrupted_at = workload.last_execution_heartbeat_at
          workload.update!(state: "finished", interrupted_at: interrupted_at, last_execution_heartbeat_at: Time.now.utc, execution_finished_at: Time.now.utc)
          revived_job = ActiveJob::Base.deserialize(workload.active_job_data)
          # Save the interrupted_at timestamp so that upon execution the new job will raise a Gouda::Interrpupted exception.
          # The exception can then be handled like any other ActiveJob exception (using rescue_from or similar).
          revived_job.interrupted_at = interrupted_at
          revived_job.enqueue
        end
      rescue ActiveRecord::RecordNotFound
        # This will happen if we have selected the zombie workload in the outer block, but
        # by the point we reload it and take a FOR UPDATE SKIP LOCKED lock another worker is
        # already reaping it - a call to `reload` will cause a RecordNotFound, since Postgres
        # will hide the row from us. This is what we want in fact - we want to progress to
        # the next row. So we allow the code to proceed, as we expect that the other worker
        # (which stole the workload from us) will have set it to "state=finished" by the time we reattempt
        # our SELECT with conditions
        Gouda.logger.debug { "Gouda workload #{workload.id} cannot be reaped as it was hijacked by another worker" }
      end
    end
  end
end

Instance Method Details

#active_job_dataObject



239
240
241
# File 'lib/gouda/workload.rb', line 239

def active_job_data
  serialized_params.deep_dup.merge("provider_job_id" => id, "interrupted_at" => interrupted_at, "scheduler_key" => scheduler_key) # TODO: is this memory-economical?
end

#enqueued_atObject



173
174
175
# File 'lib/gouda/workload.rb', line 173

def enqueued_at
  Time.parse(serialized_params["enqueued_at"]) if serialized_params["enqueued_at"]
end

#error_hash(error) ⇒ Object



235
236
237
# File 'lib/gouda/workload.rb', line 235

def error_hash(error)
  {class_name: error.class.to_s, backtrace: error.backtrace.to_a, message: error.message}
end

#mark_finished!Object



219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
# File 'lib/gouda/workload.rb', line 219

def mark_finished!
  with_lock do
    now = Time.now.utc
    execution_started_at ||= now

    return if state == "finished"

    update!(
      state: "finished", last_execution_heartbeat_at: now,
      execution_finished_at: now, execution_started_at: execution_started_at,
      error: {class_name: "RemovedError", message: "Manually removed at #{now}"}
    )
    Gouda::Scheduler.enqueue_next_scheduled_workload_for(self)
  end
end

#perform_and_update_state!Object



177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
# File 'lib/gouda/workload.rb', line 177

def perform_and_update_state!
  Gouda.instrument(:perform_job, {workload: self}) do |instrument_payload|
    extras = {}
    if Gouda::JobFuse.exists?(active_job_class_name: active_job_class_name)
      extras[:error] = {class_name: "WorkloadSkippedError", message: "Skipped because of a fuse at #{Time.now.utc}"}
    else
      job_result = ActiveJob::Base.execute(active_job_data)

      if job_result.is_a?(Exception)
        # When an exception is handled, let's say we have a retry_on <exception> in our job, we end up here
        # and it won't be rescueed
        handled_error = job_result
        update!(error: error_hash(handled_error))
      end

      instrument_payload[:value] = job_result
      instrument_payload[:handled_error] = handled_error

      job_result
    end
  rescue => exception_not_retried_by_active_job
    # When a job fails and is not retryable it will end up here.
    update!(error: error_hash(exception_not_retried_by_active_job))
    instrument_payload[:unhandled_error] = exception_not_retried_by_active_job
    Gouda.logger.error { exception_not_retried_by_active_job }
    exception_not_retried_by_active_job # Return the exception instead of re-raising it
  ensure
    update!(state: "finished", last_execution_heartbeat_at: Time.now.utc, execution_finished_at: Time.now.utc, **extras)
    # If the workload that just finished was a scheduled workload (via timer/cron) enqueue the next execution.
    # Otherwise the next job will only get enqueued once the config is reloaded
    Gouda::Scheduler.enqueue_next_scheduled_workload_for(self)
  end
end

#schedule_now!Object



211
212
213
214
215
216
217
# File 'lib/gouda/workload.rb', line 211

def schedule_now!
  with_lock do
    return if state != "enqueued"

    update!(scheduled_at: Time.now.utc)
  end
end