Class: Rumale::ModelSelection::ShuffleSplit

Inherits:
Object
  • Object
show all
Includes:
Base::Splitter
Defined in:
lib/rumale/model_selection/shuffle_split.rb

Overview

ShuffleSplit is a class that generates the set of data indices for random permutation cross-validation.

Examples:

ss = Rumale::ModelSelection::ShuffleSplit.new(n_splits: 3, test_size: 0.2, random_seed: 1)
ss.split(samples, labels).each do |train_ids, test_ids|
  train_samples = samples[train_ids, true]
  test_samples = samples[test_ids, true]
  ...
end

Instance Attribute Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(n_splits: 3, test_size: 0.1, train_size: nil, random_seed: nil) ⇒ ShuffleSplit

Create a new data splitter for random permutation cross validation.

Parameters:

  • n_splits (Integer) (defaults to: 3)

    The number of folds.

  • test_size (Float) (defaults to: 0.1)

    The ratio of number of samples for test data.

  • train_size (Float) (defaults to: nil)

    The ratio of number of samples for train data.

  • random_seed (Integer) (defaults to: nil)

    The seed value using to initialize the random generator.



34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# File 'lib/rumale/model_selection/shuffle_split.rb', line 34

def initialize(n_splits: 3, test_size: 0.1, train_size: nil, random_seed: nil)
  check_params_integer(n_splits: n_splits)
  check_params_float(test_size: test_size)
  check_params_type_or_nil(Float, train_size: train_size)
  check_params_type_or_nil(Integer, random_seed: random_seed)
  check_params_positive(n_splits: n_splits)
  check_params_positive(test_size: test_size)
  check_params_positive(train_size: train_size) unless train_size.nil?
  @n_splits = n_splits
  @test_size = test_size
  @train_size = train_size
  @random_seed = random_seed
  @random_seed ||= srand
  @rng = Random.new(@random_seed)
end

Instance Attribute Details

#n_splitsInteger (readonly)

Return the number of folds.

Returns:

  • (Integer)


22
23
24
# File 'lib/rumale/model_selection/shuffle_split.rb', line 22

def n_splits
  @n_splits
end

#rngRandom (readonly)

Return the random generator for shuffling the dataset.

Returns:

  • (Random)


26
27
28
# File 'lib/rumale/model_selection/shuffle_split.rb', line 26

def rng
  @rng
end

Instance Method Details

#split(x, _y = nil) ⇒ Array

Generate data indices for random permutation cross validation.

Parameters:

  • x (Numo::DFloat)

    (shape: [n_samples, n_features]) The dataset to be used to generate data indices for random permutation cross validation.

Returns:

  • (Array)

    The set of data indices for constructing the training and testing dataset in each fold.



55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
# File 'lib/rumale/model_selection/shuffle_split.rb', line 55

def split(x, _y = nil)
  check_sample_array(x)
  # Initialize and check some variables.
  n_samples = x.shape[0]
  n_test_samples = (@test_size * n_samples).to_i
  n_train_samples = @train_size.nil? ? n_samples - n_test_samples : (@train_size * n_samples).to_i
  unless @n_splits.between?(1, n_samples)
    raise ArgumentError,
          'The value of n_splits must be not less than 1 and not more than the number of samples.'
  end
  unless n_test_samples.between?(1, n_samples)
    raise RangeError,
          'The number of sample in test split must be not less than 1 and not more than the number of samples.'
  end
  unless n_train_samples.between?(1, n_samples)
    raise RangeError,
          'The number of sample in train split must be not less than 1 and not more than the number of samples.'
  end
  if (n_test_samples + n_train_samples) > n_samples
    raise RangeError,
          'The total number of samples in test split and train split must be not more than the number of samples.'
  end
  sub_rng = @rng.dup
  # Returns array consisting of the training and testing ids for each fold.
  dataset_ids = [*0...n_samples]
  Array.new(@n_splits) do
    test_ids = dataset_ids.sample(n_test_samples, random: sub_rng)
    train_ids = if @train_size.nil?
                  dataset_ids - test_ids
                else
                  (dataset_ids - test_ids).sample(n_train_samples, random: sub_rng)
                end
    [train_ids, test_ids]
  end
end