Module: Spark::Helper::Statistic
- Included in:
- Command::Histogram, RDD
- Defined in:
- lib/spark/helper/statistic.rb
Instance Method Summary collapse
-
#bisect_right(data, value, low = 0, high = data.size) ⇒ Object
Bisect right.
-
#compute_fraction(lower_bound, total, with_replacement) ⇒ Object
Returns a sampling rate that guarantees a sample of size >= sampleSizeLowerBound 99.99% of the time.
-
#determine_bounds(data, num_partitions) ⇒ Object
Determine bound of partitioning.
- #upper_binomial_bound(delta, total, fraction) ⇒ Object
- #upper_poisson_bound(bound) ⇒ Object
Instance Method Details
#bisect_right(data, value, low = 0, high = data.size) ⇒ Object
Bisect right
Examples:
data = [1,5,6,8,96,120,133]
bisect_right(data, 0) # => 0
bisect_right(data, 1) # => 1
bisect_right(data, 5) # => 2
bisect_right(data, 9) # => 4
bisect_right(data, 150) # => 7
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 |
# File 'lib/spark/helper/statistic.rb', line 58 def bisect_right(data, value, low=0, high=data.size) if low < 0 raise ArgumentError, 'Low must be >= 0.' end while low < high mid = (low + high) / 2 if value < data[mid] high = mid else low = mid + 1 end end low end |
#compute_fraction(lower_bound, total, with_replacement) ⇒ Object
Returns a sampling rate that guarantees a sample of size >= sampleSizeLowerBound 99.99% of the time.
How the sampling rate is determined:
Let p = num / total, where num is the sample size and total is the total number of datapoints in the RDD. We’re trying to compute q > p such that
-
when sampling with replacement, we’re drawing each datapoint with prob_i ~ Pois(q), where we want to guarantee Pr[s < num] < 0.0001 for s = sum(prob_i for i from 0 to total), i.e. the failure rate of not having a sufficiently large sample < 0.0001. Setting q = p + 5 * sqrt(p/total) is sufficient to guarantee 0.9999 success rate for num > 12, but we need a slightly larger q (9 empirically determined).
-
when sampling without replacement, we’re drawing each datapoint with prob_i ~ Binomial(total, fraction) and our choice of q guarantees 1-delta, or 0.9999 success rate, where success rate is defined the same as in sampling with replacement.
19 20 21 22 23 24 25 26 27 28 |
# File 'lib/spark/helper/statistic.rb', line 19 def compute_fraction(lower_bound, total, with_replacement) lower_bound = lower_bound.to_f if with_replacement upper_poisson_bound(lower_bound) / total else fraction = lower_bound / total upper_binomial_bound(0.00001, total, fraction) end end |
#determine_bounds(data, num_partitions) ⇒ Object
Determine bound of partitioning
Example:
data = [0,1,2,3,4,5,6,7,8,9,10]
determine_bounds(data, 3)
# => [3, 7]
82 83 84 85 86 87 88 89 90 91 92 93 |
# File 'lib/spark/helper/statistic.rb', line 82 def determine_bounds(data, num_partitions) if num_partitions > data.size return data end bounds = [] count = data.size (0...(num_partitions-1)).each do |index| bounds << data[count * (index+1) / num_partitions] end bounds end |
#upper_binomial_bound(delta, total, fraction) ⇒ Object
42 43 44 45 |
# File 'lib/spark/helper/statistic.rb', line 42 def upper_binomial_bound(delta, total, fraction) gamma = -Math.log(delta) / total [1, fraction + gamma + Math.sqrt(gamma*gamma + 2*gamma*fraction)].min end |
#upper_poisson_bound(bound) ⇒ Object
30 31 32 33 34 35 36 37 38 39 40 |
# File 'lib/spark/helper/statistic.rb', line 30 def upper_poisson_bound(bound) num_std = if bound < 6 12 elsif bound < 16 9 else 6 end.to_f [bound + num_std * Math.sqrt(bound), 1e-10].max end |