Class: Gitlab::Database::LoadBalancing::Host

Inherits:
Object
  • Object
show all
Defined in:
lib/gitlab/database/load_balancing/host.rb

Overview

A single database host used for load balancing.

Constant Summary collapse

CONNECTION_ERRORS =
[
  ActionView::Template::Error,
  ActiveRecord::StatementInvalid,
  ActiveRecord::ConnectionNotEstablished,
  ActiveRecord::StatementTimeout,
  PG::Error
].freeze
CAN_TRACK_LOGICAL_LSN_QUERY =

This query checks that the current user has permissions before we try and query logical replication status. We also only allow >= PG14 because these views are only accessible to superuser before PG14 even if the has_table_privilege says otherwise.

<<~SQL.squish.freeze
  SELECT
    has_table_privilege('pg_replication_origin_status', 'select')
    AND
    has_function_privilege('pg_show_replication_origin_status()', 'execute')
    AND current_setting('server_version_num', true)::int >= 140000
    AS allowed
SQL
LATEST_LSN_WITH_LOGICAL_QUERY =

The following is necessary to handle a mix of logical and physical replicas. We assume that if they have pg_replication_origin_status then they are a logical replica. In a logical replica we need to use remote_lsn rather than pg_last_wal_replay_lsn in order for our LSN to be comparable to the source cluster. This logic would be broken if we have 2 logical subscriptions or if we have a logical subscription in the source primary cluster. Read more at gitlab.com/gitlab-org/gitlab/-/merge_requests/121621

<<~SQL.squish.freeze
  CASE
  WHEN (SELECT TRUE FROM pg_replication_origin_status) THEN
    (SELECT remote_lsn FROM pg_replication_origin_status)
  WHEN pg_is_in_recovery() THEN
    pg_last_wal_replay_lsn()
  ELSE
    pg_current_wal_insert_lsn()
  END
SQL
LATEST_LSN_WITHOUT_LOGICAL_QUERY =
<<~SQL.squish.freeze
  CASE
  WHEN pg_is_in_recovery() THEN
    pg_last_wal_replay_lsn()
  ELSE
    pg_current_wal_insert_lsn()
  END
SQL
REPLICATION_LAG_QUERY =
<<~SQL.squish.freeze
  SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))::float as lag
SQL

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(host, load_balancer, port: nil) ⇒ Host

host - The address of the database. load_balancer - The LoadBalancer that manages this Host.



63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# File 'lib/gitlab/database/load_balancing/host.rb', line 63

def initialize(host, load_balancer, port: nil)
  @host = host
  @port = port
  @load_balancer = load_balancer
  @pool = load_balancer.create_replica_connection_pool(
    load_balancer.configuration.pool_size,
    host,
    port
  )
  @online = true
  @last_checked_at = Time.zone.now
  @lag_time = nil
  @lag_size = nil

  # Randomly somewhere in between interval and 2*interval we'll refresh the status of the host
  interval = load_balancer.configuration.replica_check_interval
  @intervals = (interval..(interval * 2)).step(0.5).to_a
end

Instance Attribute Details

#hostObject (readonly)

Returns the value of attribute host.



8
9
10
# File 'lib/gitlab/database/load_balancing/host.rb', line 8

def host
  @host
end

#intervalsObject (readonly)

Returns the value of attribute intervals.



8
9
10
# File 'lib/gitlab/database/load_balancing/host.rb', line 8

def intervals
  @intervals
end

#last_checked_atObject (readonly)

Returns the value of attribute last_checked_at.



8
9
10
# File 'lib/gitlab/database/load_balancing/host.rb', line 8

def last_checked_at
  @last_checked_at
end

#load_balancerObject (readonly)

Returns the value of attribute load_balancer.



8
9
10
# File 'lib/gitlab/database/load_balancing/host.rb', line 8

def load_balancer
  @load_balancer
end

#poolObject (readonly)

Returns the value of attribute pool.



8
9
10
# File 'lib/gitlab/database/load_balancing/host.rb', line 8

def pool
  @pool
end

#portObject (readonly)

Returns the value of attribute port.



8
9
10
# File 'lib/gitlab/database/load_balancing/host.rb', line 8

def port
  @port
end

Instance Method Details

#caught_up?(location) ⇒ Boolean

Returns true if this host has caught up to the given transaction write location.

location - The transaction write location as reported by a primary.

Returns:

  • (Boolean)


275
276
277
278
# File 'lib/gitlab/database/load_balancing/host.rb', line 275

def caught_up?(location)
  lag = replication_lag_size(location)
  lag.present? && lag.to_i <= 0
end

#check_replica_status?Boolean

Returns:

  • (Boolean)


178
179
180
# File 'lib/gitlab/database/load_balancing/host.rb', line 178

def check_replica_status?
  (Time.zone.now - last_checked_at) >= intervals.sample
end

#connectionObject



82
83
84
# File 'lib/gitlab/database/load_balancing/host.rb', line 82

def connection
  pool.lease_connection
end

#data_is_recent_enough?Boolean

Returns true if the replica has replicated enough data to be useful.

Returns:

  • (Boolean)


215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
# File 'lib/gitlab/database/load_balancing/host.rb', line 215

def data_is_recent_enough?
  # It's possible for a replica to not replay WAL data for a while,
  # despite being up to date. This can happen when a primary does not
  # receive any writes for a while.
  #
  # To prevent this from happening we check if the lag size (in bytes)
  # of the replica is small enough for the replica to be useful. We
  # only do this if we haven't replicated in a while so we only need
  # to connect to the primary when truly necessary.
  if (@lag_size = replication_lag_size)
    @lag_size <= load_balancer.configuration.max_replication_difference
  else
    false
  end
end

#database_replica_locationObject



261
262
263
264
265
266
267
268
269
# File 'lib/gitlab/database/load_balancing/host.rb', line 261

def database_replica_location
  row = query_and_release(<<-SQL.squish)
    SELECT pg_last_wal_replay_lsn()::text AS location
  SQL

  row['location'] if row.any?
rescue *CONNECTION_ERRORS
  nil
end

#disconnect!(timeout: 120) ⇒ Object

Disconnects the pool, once all connections are no longer in use.

timeout - The time after which the pool should be forcefully

disconnected.


90
91
92
93
94
95
96
97
98
99
100
# File 'lib/gitlab/database/load_balancing/host.rb', line 90

def disconnect!(timeout: 120)
  start_time = ::Gitlab::Metrics::System.monotonic_time

  while (::Gitlab::Metrics::System.monotonic_time - start_time) <= timeout
    return if try_disconnect

    sleep(2)
  end

  force_disconnect!
end

#force_disconnect!Object



113
114
115
# File 'lib/gitlab/database/load_balancing/host.rb', line 113

def force_disconnect!
  pool_disconnect!
end

#offline!Object



121
122
123
124
125
126
127
128
129
130
131
# File 'lib/gitlab/database/load_balancing/host.rb', line 121

def offline!
  ::Gitlab::Database::LoadBalancing::Logger.warn(
    event: :host_offline,
    message: 'Marking host as offline',
    db_host: @host,
    db_port: @port
  )

  @online = false
  pool_disconnect!
end

#online?Boolean

Returns true if the host is online.

Returns:

  • (Boolean)


134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
# File 'lib/gitlab/database/load_balancing/host.rb', line 134

def online?
  # Avoid using a discarded connection pool because attempting
  # to use it will fail. After the main process forks, all of
  # its connection pools are discarded from Rails' ForkTracker.
  return false if discarded?
  return @online unless check_replica_status?

  was_online = @online
  refresh_status

  # Log that the host came back online if it was previously offline
  if @online && !was_online
    ::Gitlab::Database::LoadBalancing::Logger.info(
      event: :host_online,
      message: 'Host is online after replica status check',
      db_host: @host,
      db_port: @port,
      lag_time: @lag_time,
      lag_size: @lag_size
    )
  # Always log if the host goes offline
  elsif !@online
    ::Gitlab::Database::LoadBalancing::Logger.warn(
      event: :host_offline,
      message: 'Host is offline after replica status check',
      db_host: @host,
      db_port: @port,
      lag_time: @lag_time,
      lag_size: @lag_size
    )
  end

  @online
rescue *CONNECTION_ERRORS
  offline!
  false
end

#pool_disconnect!Object



117
118
119
# File 'lib/gitlab/database/load_balancing/host.rb', line 117

def pool_disconnect!
  pool.disconnect!
end

#primary_write_locationObject



257
258
259
# File 'lib/gitlab/database/load_balancing/host.rb', line 257

def primary_write_location
  load_balancer.primary_write_location
end

#query_and_releaseObject



280
281
282
283
284
285
286
287
288
# File 'lib/gitlab/database/load_balancing/host.rb', line 280

def query_and_release(...)
  pool.disable_query_cache do
    if low_timeout_for_host_queries?
      query_and_release_fast_timeout(...)
    else
      query_and_release_old(...)
    end
  end
end

#query_and_release_fast_timeout(sql) ⇒ Object



298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
# File 'lib/gitlab/database/load_balancing/host.rb', line 298

def query_and_release_fast_timeout(sql)
  # If we "set local" the timeout in a transaction that was already open we would taint the outer
  # transaction with that timeout.
  # However, we don't ever run transactions on replicas, and we only do these health checks on replicas.
  # Double-check that we're not in a transaction, but this path should never happen.
  if connection.transaction_open?
    Gitlab::Database::LoadBalancing::Logger.warn(
      event: :health_check_in_transaction,
      message: "Attempt to run a health check query inside of a transaction"
    )
    return query_and_release_old(sql)
  end

  begin
    connection.transaction do
      connection.exec_query("SET LOCAL statement_timeout TO '100ms';")
      connection.select_all(sql).first || {}
    end
  rescue StandardError
    {}
  ensure
    release_connection
  end
end

#query_and_release_old(sql) ⇒ Object



290
291
292
293
294
295
296
# File 'lib/gitlab/database/load_balancing/host.rb', line 290

def query_and_release_old(sql)
  connection.select_all(sql).first || {}
rescue StandardError
  {}
ensure
  release_connection
end

#refresh_statusObject



172
173
174
175
176
# File 'lib/gitlab/database/load_balancing/host.rb', line 172

def refresh_status
  @latest_lsn_query = nil # Periodically clear the cached @latest_lsn_query value in case permissions change
  @online = replica_is_up_to_date?
  @last_checked_at = Time.zone.now
end

#replica_is_up_to_date?Boolean

Returns:

  • (Boolean)


182
183
184
# File 'lib/gitlab/database/load_balancing/host.rb', line 182

def replica_is_up_to_date?
  replication_lag_below_threshold? || data_is_recent_enough?
end

#replication_lag_below_threshold?Boolean

Returns:

  • (Boolean)


186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
# File 'lib/gitlab/database/load_balancing/host.rb', line 186

def replication_lag_below_threshold?
  @lag_time = replication_lag_time
  return false unless @lag_time
  return true if @lag_time <= load_balancer.configuration.max_replication_lag_time

  if ignore_replication_lag_time?
    ::Gitlab::Database::LoadBalancing::Logger.info(
      event: :replication_lag_ignored,
      lag_time: @lag_time,
      message: 'Replication lag is treated as low because of load_balancer_ignore_replication_lag_time feature flag'
    )

    return true
  end

  if double_replication_lag_time? && @lag_time <= (load_balancer.configuration.max_replication_lag_time * 2)
    ::Gitlab::Database::LoadBalancing::Logger.info(
      event: :replication_lag_below_double,
      lag_time: @lag_time,
      message: 'Replication lag is treated as low because of load_balancer_double_replication_lag_time feature flag'
    )

    return true
  end

  false
end

#replication_lag_size(location = primary_write_location) ⇒ Object

Returns the number of bytes this secondary is lagging behind the primary.

This method will return nil if no lag size could be calculated.



245
246
247
248
249
250
251
252
253
254
255
# File 'lib/gitlab/database/load_balancing/host.rb', line 245

def replication_lag_size(location = primary_write_location)
  location = connection.quote(location)

  row = query_and_release(<<-SQL.squish)
    SELECT pg_wal_lsn_diff(#{location}, (#{latest_lsn_query}))::float AS diff
  SQL

  row['diff'].to_i if row.any?
rescue *CONNECTION_ERRORS
  nil
end

#replication_lag_timeObject

Returns the replication lag time of this secondary in seconds as a float.

This method will return nil if no lag time could be calculated.



235
236
237
238
239
# File 'lib/gitlab/database/load_balancing/host.rb', line 235

def replication_lag_time
  row = query_and_release(REPLICATION_LAG_QUERY)

  row['lag'].to_f if row.any?
end

#try_disconnectObject

Attempt to disconnect the pool if all connections are no longer in use. Returns true if the pool was disconnected, false if not.



104
105
106
107
108
109
110
111
# File 'lib/gitlab/database/load_balancing/host.rb', line 104

def try_disconnect
  if pool.connections.none?(&:in_use?)
    pool_disconnect!
    return true
  end

  false
end