Class: SamplingHash::Sampler

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/sampling-hash/sampler.rb

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(size, sample_size = 1024, header_samples = 1000, minimum_samples = 5000, remaining_factor = 0.001) ⇒ Sampler

Calculates sample offsets.

Parameters:

  • sample_size: Size of a sample (in bytes).

  • header_samples: Number of samples at front of data always to be included.

  • minimum_samples: Minimum number of samples to be included.

  • remaining_factor: If size is greater than minimum_samples * sample_size, this specifies the

    linear factor function used to determine the additional data used.
    


15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# File 'lib/sampling-hash/sampler.rb', line 15

def initialize(size, sample_size = 1024, header_samples = 1000, minimum_samples = 5000, remaining_factor = 0.001)
  @samples = []

  minimum_sampling_size = minimum_samples * sample_size
  if (size > minimum_sampling_size)
    # Continuous header samples first.
    header_samples.times { |i| @samples << [i * sample_size, sample_size] }

    # Spread the rest.
    start_offset                       = header_samples * sample_size
    remaining_size                     = size - start_offset

    remaining_minimum_samples          = [0, minimum_samples - header_samples].max
    remaining_minimum_sampling_size    = remaining_minimum_samples * sample_size

    remaining_additional_size          = remaining_size - remaining_minimum_sampling_size
    remaining_additional_sampling_size = remaining_additional_size * remaining_factor
    remaining_additional_samples       = (remaining_additional_sampling_size / sample_size).truncate

    remaining_total_samples            = remaining_minimum_samples + remaining_additional_samples
    remaining_total_sampling_size      = remaining_minimum_sampling_size + remaining_additional_sampling_size

    remaining_unsampled_size           = remaining_size - remaining_total_sampling_size
    remaining_sampling_gap             = (remaining_unsampled_size / remaining_total_samples).truncate

    # NOTE: We can not overflow since we calculated the remaining_additional_samples with integer division.
    remaining_total_samples.times do |i|
      @samples << [start_offset + i * (sample_size + remaining_sampling_gap), sample_size]
    end
  else
    total_full_samples = size / sample_size
    last_sample_size   = size - ((size / sample_size) * sample_size)

    # Simply take them all.
    total_full_samples.times { |i| @samples << [i * sample_size, sample_size] }
    @samples << [total_full_samples * sample_size, last_sample_size] if last_sample_size != 0
  end

  @size = @samples.inject(0) { |i, v| i + v[1] }
end

Instance Attribute Details

#samplesObject (readonly)

Returns the value of attribute samples.



5
6
7
# File 'lib/sampling-hash/sampler.rb', line 5

def samples
  @samples
end

#sizeObject (readonly)

Returns the value of attribute size.



5
6
7
# File 'lib/sampling-hash/sampler.rb', line 5

def size
  @size
end

Instance Method Details

#each(&block) ⇒ Object



56
57
58
# File 'lib/sampling-hash/sampler.rb', line 56

def each(&block)
  @samples.each(&block)
end