Class: Gitlab::Database::PostgresHll::BatchDistinctCounter

Inherits:
Object
  • Object
show all
Defined in:
lib/gitlab/database/postgres_hll/batch_distinct_counter.rb

Overview

Note:

HyperLogLog is an PROBABILISTIC algorithm that ESTIMATES distinct count of given attribute value for supplied relation Like all probabilistic algorithm is has ERROR RATE margin, that can affect values, for given implementation no higher value was reported (gitlab.com/gitlab-org/gitlab/-/merge_requests/45673#accuracy-estimation) than 5.3% for the most of a cases this value is lower. However, if the exact value is necessary other tools has to be used.

For large tables, PostgreSQL can take a long time to count rows due to MVCC. Implements a distinct batch counter based on HyperLogLog algorithm Needs indexes on the column below to calculate max, min and range queries For larger tables just set higher batch_size with index optimization

In order to not use a possible complex time consuming query when calculating min and max values, the start and finish can be sent specifically, start and finish should contain max and min values for PRIMARY KEY of relation (most cases ‘id` column) rather than counted attribute eg: estimate_distinct_count(start: ::Project.aimed_for_deletion.minimum(:id), finish: ::Project.aimed_for_deletion.maximum(:id))

Grouped relations are NOT supported yet.

Examples:

Usage

::Gitlab::Database::PostgresHllBatchDistinctCount.new(::Project, :creator_id).execute
::Gitlab::Database::PostgresHllBatchDistinctCount.new(::Project.aimed_for_deletion.service_desk_enabled.where(time_period))
  .execute(
    batch_size: 1_000,
    start: ::Project.aimed_for_deletion.service_desk_enabled.where(time_period).minimum(:id),
    finish: ::Project.aimed_for_deletion.service_desk_enabled.where(time_period).maximum(:id)
  )

Constant Summary collapse

ERROR_RATE =

max encountered empirical error rate, used in tests

4.9
MIN_REQUIRED_BATCH_SIZE =
750
SLEEP_TIME_IN_SECONDS =

10 msec sleep

0.01
MAX_DATA_VOLUME =
4_000_000_000
DEFAULT_BATCH_SIZE =
10_000
ZERO_OFFSET =
1
BUCKET_ID_MASK =
(Buckets::TOTAL_BUCKETS - ZERO_OFFSET).to_s(2)
BIT_31_MASK =
"B'0#{'1' * 31}'"
BIT_32_NORMALIZED_BUCKET_ID_MASK =
"B'#{'0' * (32 - BUCKET_ID_MASK.size)}#{BUCKET_ID_MASK}'"
WRONG_CONFIGURATION_ERROR =
Class.new(ActiveRecord::StatementInvalid)

Instance Method Summary collapse

Constructor Details

#initialize(relation, column = nil) ⇒ BatchDistinctCounter

Returns a new instance of BatchDistinctCounter.



47
48
49
50
# File 'lib/gitlab/database/postgres_hll/batch_distinct_counter.rb', line 47

def initialize(relation, column = nil)
  @relation = relation
  @column = column || relation.primary_key
end

Instance Method Details

#execute(batch_size: nil, start: nil, finish: nil) ⇒ Gitlab::Database::PostgresHll::Buckets

Executes counter that iterates over database source and return Gitlab::Database::PostgresHll::Buckets that can be used to estimation of number of uniq elements in analysed set

Parameters:

  • batch_size (defaults to: nil)

    maximal number of rows that will be analysed by single database query

  • start (defaults to: nil)

    initial pkey range

  • finish (defaults to: nil)

    final pkey range

Returns:

Raises:



59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# File 'lib/gitlab/database/postgres_hll/batch_distinct_counter.rb', line 59

def execute(batch_size: nil, start: nil, finish: nil)
  raise 'BatchCount can not be run inside a transaction' if transaction_open?

  batch_size ||= DEFAULT_BATCH_SIZE
  start = actual_start(start)
  finish = actual_finish(finish)

  raise WRONG_CONFIGURATION_ERROR if unwanted_configuration?(start, finish, batch_size)

  batch_start = start
  hll_buckets = Buckets.new

  while batch_start <= finish
    hll_buckets.merge_hash!(hll_buckets_for_batch(batch_start, batch_start + batch_size))
    batch_start += batch_size
    sleep(SLEEP_TIME_IN_SECONDS)
  end

  hll_buckets
end