Module: Spark::Helper::Statistic

Included in:
Command::Histogram, RDD
Defined in:
lib/spark/helper/statistic.rb

Instance Method Summary collapse

Instance Method Details

#bisect_right(data, value, low = 0, high = data.size) ⇒ Object

Bisect right

Examples:

data = [1,5,6,8,96,120,133]

bisect_right(data, 0)   # => 0
bisect_right(data, 1)   # => 1
bisect_right(data, 5)   # => 2
bisect_right(data, 9)   # => 4
bisect_right(data, 150) # => 7


58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# File 'lib/spark/helper/statistic.rb', line 58

def bisect_right(data, value, low=0, high=data.size)
  if low < 0
    raise ArgumentError, 'Low must be >= 0.'
  end

  while low < high
    mid = (low + high) / 2
    if value < data[mid]
      high = mid
    else
      low = mid + 1
    end
  end

  low
end

#compute_fraction(lower_bound, total, with_replacement) ⇒ Object

Returns a sampling rate that guarantees a sample of size >= sampleSizeLowerBound 99.99% of the time.

How the sampling rate is determined:

Let p = num / total, where num is the sample size and total is the total number of datapoints in the RDD. We’re trying to compute q > p such that

  • when sampling with replacement, we’re drawing each datapoint with prob_i ~ Pois(q), where we want to guarantee Pr[s < num] < 0.0001 for s = sum(prob_i for i from 0 to total), i.e. the failure rate of not having a sufficiently large sample < 0.0001. Setting q = p + 5 * sqrt(p/total) is sufficient to guarantee 0.9999 success rate for num > 12, but we need a slightly larger q (9 empirically determined).

  • when sampling without replacement, we’re drawing each datapoint with prob_i ~ Binomial(total, fraction) and our choice of q guarantees 1-delta, or 0.9999 success rate, where success rate is defined the same as in sampling with replacement.



19
20
21
22
23
24
25
26
27
28
# File 'lib/spark/helper/statistic.rb', line 19

def compute_fraction(lower_bound, total, with_replacement)
  lower_bound = lower_bound.to_f

  if with_replacement
    upper_poisson_bound(lower_bound) / total
  else
    fraction = lower_bound / total
    upper_binomial_bound(0.00001, total, fraction)
  end
end

#determine_bounds(data, num_partitions) ⇒ Object

Determine bound of partitioning

Example:

data = [0,1,2,3,4,5,6,7,8,9,10]
determine_bounds(data, 3)
# => [3, 7]


82
83
84
85
86
87
88
89
90
91
92
93
# File 'lib/spark/helper/statistic.rb', line 82

def determine_bounds(data, num_partitions)
  if num_partitions > data.size
    return data
  end

  bounds = []
  count = data.size
  (0...(num_partitions-1)).each do |index|
    bounds << data[count * (index+1) / num_partitions]
  end
  bounds
end

#upper_binomial_bound(delta, total, fraction) ⇒ Object



42
43
44
45
# File 'lib/spark/helper/statistic.rb', line 42

def upper_binomial_bound(delta, total, fraction)
  gamma = -Math.log(delta) / total
  [1, fraction + gamma + Math.sqrt(gamma*gamma + 2*gamma*fraction)].min
end

#upper_poisson_bound(bound) ⇒ Object



30
31
32
33
34
35
36
37
38
39
40
# File 'lib/spark/helper/statistic.rb', line 30

def upper_poisson_bound(bound)
  num_std = if bound < 6
    12
  elsif bound < 16
    9
  else
    6
  end.to_f

  [bound + num_std * Math.sqrt(bound), 1e-10].max
end