Module: HybridForest::Utils

Extended by:
Utils
Included in:
Utils
Defined in:
lib/hybridforest/utilities/utils.rb

Defined Under Namespace

Modules: DataFrameExtensions

Class Method Summary collapse

Class Method Details

.accuracy(predicted, actual) ⇒ Object

Given an array of predicted labels and an array of actual labels, returns the accuracy of the predictions.



106
107
108
109
# File 'lib/hybridforest/utilities/utils.rb', line 106

def self.accuracy(predicted, actual)
  accurate = predicted.zip(actual).count { |p, a| equal_labels?(p, a) }
  accurate.to_f / predicted.count
end

.prediction_report(actual, predicted) ⇒ Object

Outputs a report of common prediction metrics. actual and predicted are expected to be equal sized arrays of class labels.



97
98
99
100
101
102
# File 'lib/hybridforest/utilities/utils.rb', line 97

def self.prediction_report(actual, predicted)
  Rumale::EvaluationMeasure.classification_report(
    actual,
    predicted
  )
end

.random_sample(data:, size:, with_replacement: true) ⇒ Object

Draws a random sample of size from data.



80
81
82
83
84
85
86
87
88
89
90
91
92
93
# File 'lib/hybridforest/utilities/utils.rb', line 80

def self.random_sample(data:, size:, with_replacement: true)
  data = to_dataframe(data)

  if size < 1 || (!with_replacement && size > data.count)
    raise ArgumentError, "Invalid sample size"
  end

  rows = if with_replacement
    rand_nums(size, 0...data.count)
  else
    rand_uniq_nums(size, 0...data.count)
  end
  data[rows]
end

.to_dataframe(instances, types: nil) ⇒ Object

Turns instances into a dataframe prepared for fitting and applying models. Instances can be an array of hashes:

[1, b: “one”, 2, b: “two”, 3, b: “three”]

or a hash of arrays:

[1, 2, 3], b: [“one”, “two”, “three”]

or the path to a CSV file:

“dataset.csv”

Accepts an optional hash for specifying feature data types: to_dataframe(“dataset.csv”, types: => :int, “b” => :float)

Raises ArgumentError if given an invalid dataset.

Raises:

  • (ArgumentError)


71
72
73
74
75
76
# File 'lib/hybridforest/utilities/utils.rb', line 71

def self.to_dataframe(instances, types: nil)
  return instances if instances.is_a? Rover::DataFrame
  return instances if success? { instances = Rover::DataFrame.new(instances, types: types) }
  return instances if success? { instances = Rover.read_csv(instances, types: types) }
  raise ArgumentError, @error
end

.train_test_bootstrap_split(dataset) ⇒ Object

Partitions dataset into training and testing datasets, drawing with replacement to the training set, and using the not drawn instances as the testing dataset. Then, splits the testing dataset into a dataframe of independent features and an array of labels. Returns [training_set, testing_set, testing_set_labels]



35
36
37
38
39
40
41
42
43
44
45
46
47
48
# File 'lib/hybridforest/utilities/utils.rb', line 35

def self.train_test_bootstrap_split(dataset)
  dataset = to_dataframe(dataset)
  all_rows = (0...dataset.count).to_a

  train_set_rows = rand_nums(dataset.count, 0...dataset.count)
  train_set = dataset[train_set_rows]

  return train_test_split(dataset) if train_set_rows.sort == all_rows

  test_set = dataset[all_rows - train_set_rows]
  test_set, test_set_labels = test_set.disconnect_labels

  [train_set, test_set, test_set_labels]
end

.train_test_split(dataset, test_set_size = 0.20) ⇒ Object

Partitions dataset into training and testing datasets, and splits the testing dataset into a dataframe of independent features and an array of labels. Returns [training_set, testing_set, testing_set_labels]



14
15
16
17
18
19
20
21
22
23
24
25
26
27
# File 'lib/hybridforest/utilities/utils.rb', line 14

def self.train_test_split(dataset, test_set_size = 0.20)
  # TODO: Offer stratify param
  dataset = to_dataframe(dataset)
  all_rows = (0...dataset.count).to_a

  test_set_count = (dataset.count * test_set_size).floor
  test_set_rows = rand_uniq_nums(test_set_count, 0...dataset.count)
  test_set = dataset[test_set_rows]
  test_set, test_set_labels = test_set.disconnect_labels

  train_set = dataset[all_rows - test_set_rows]

  [train_set, test_set, test_set_labels]
end