Module: HybridForest::Utils

Extended by:: Utils

Included in:: Utils

Defined in:: lib/hybridforest/utilities/utils.rb

Defined Under Namespace

Modules: DataFrameExtensions

Class Method Summary collapse

.accuracy(predicted, actual) ⇒ Object

Given an array of predicted labels and an array of actual labels, returns the accuracy of the predictions.
.prediction_report(actual, predicted) ⇒ Object

Outputs a report of common prediction metrics.
.random_sample(data:, size:, with_replacement: true) ⇒ Object

Draws a random sample of size from data.
.to_dataframe(instances, types: nil) ⇒ Object

Turns instances into a dataframe prepared for fitting and applying models.
.train_test_bootstrap_split(dataset) ⇒ Object

Partitions dataset into training and testing datasets, drawing with replacement to the training set, and using the not drawn instances as the testing dataset.
.train_test_split(dataset, test_set_size = 0.20) ⇒ Object

Partitions dataset into training and testing datasets, and splits the testing dataset into a dataframe of independent features and an array of labels.

Class Method Details

.accuracy(predicted, actual) ⇒ `Object`

Given an array of predicted labels and an array of actual labels, returns the accuracy of the predictions.

# File 'lib/hybridforest/utilities/utils.rb', line 106

def self.accuracy(predicted, actual)
  accurate = predicted.zip(actual).count { |p, a| equal_labels?(p, a) }
  accurate.to_f / predicted.count
end

.prediction_report(actual, predicted) ⇒ `Object`

Outputs a report of common prediction metrics. actual and predicted are expected to be equal sized arrays of class labels.

# File 'lib/hybridforest/utilities/utils.rb', line 97

def self.prediction_report(actual, predicted)
  Rumale::EvaluationMeasure.classification_report(
    actual,
    predicted
  )
end

.random_sample(data:, size:, with_replacement: true) ⇒ `Object`

Draws a random sample of size from data.

# File 'lib/hybridforest/utilities/utils.rb', line 80

def self.random_sample(data:, size:, with_replacement: true)
  data = to_dataframe(data)

  if size < 1 || (!with_replacement && size > data.count)
    raise ArgumentError, "Invalid sample size"
  end

  rows = if with_replacement
    rand_nums(size, 0...data.count)
  else
    rand_uniq_nums(size, 0...data.count)
  end
  data[rows]
end

.to_dataframe(instances, types: nil) ⇒ `Object`

Turns instances into a dataframe prepared for fitting and applying models. Instances can be an array of hashes:

[1, b: “one”, 2, b: “two”, 3, b: “three”]

or a hash of arrays:

[1, 2, 3], b: [“one”, “two”, “three”]

or the path to a CSV file:

“dataset.csv”

Accepts an optional hash for specifying feature data types: to_dataframe(“dataset.csv”, types: => :int, “b” => :float)

Raises ArgumentError if given an invalid dataset.

Raises:

(ArgumentError)

# File 'lib/hybridforest/utilities/utils.rb', line 71

def self.to_dataframe(instances, types: nil)
  return instances if instances.is_a? Rover::DataFrame
  return instances if success? { instances = Rover::DataFrame.new(instances, types: types) }
  return instances if success? { instances = Rover.read_csv(instances, types: types) }
  raise ArgumentError, @error
end

.train_test_bootstrap_split(dataset) ⇒ `Object`

Partitions dataset into training and testing datasets, drawing with replacement to the training set, and using the not drawn instances as the testing dataset. Then, splits the testing dataset into a dataframe of independent features and an array of labels. Returns [training_set, testing_set, testing_set_labels]

# File 'lib/hybridforest/utilities/utils.rb', line 35

def self.train_test_bootstrap_split(dataset)
  dataset = to_dataframe(dataset)
  all_rows = (0...dataset.count).to_a

  train_set_rows = rand_nums(dataset.count, 0...dataset.count)
  train_set = dataset[train_set_rows]

  return train_test_split(dataset) if train_set_rows.sort == all_rows

  test_set = dataset[all_rows - train_set_rows]
  test_set, test_set_labels = test_set.disconnect_labels

  [train_set, test_set, test_set_labels]
end

.train_test_split(dataset, test_set_size = 0.20) ⇒ `Object`

Partitions dataset into training and testing datasets, and splits the testing dataset into a dataframe of independent features and an array of labels. Returns [training_set, testing_set, testing_set_labels]

# File 'lib/hybridforest/utilities/utils.rb', line 14

def self.train_test_split(dataset, test_set_size = 0.20)
  # TODO: Offer stratify param
  dataset = to_dataframe(dataset)
  all_rows = (0...dataset.count).to_a

  test_set_count = (dataset.count * test_set_size).floor
  test_set_rows = rand_uniq_nums(test_set_count, 0...dataset.count)
  test_set = dataset[test_set_rows]
  test_set, test_set_labels = test_set.disconnect_labels

  train_set = dataset[all_rows - test_set_rows]

  [train_set, test_set, test_set_labels]
end

Module: HybridForest::Utils

Defined Under Namespace

Class Method Summary collapse

Class Method Details

.accuracy(predicted, actual) ⇒ Object

.prediction_report(actual, predicted) ⇒ Object

.random_sample(data:, size:, with_replacement: true) ⇒ Object

.to_dataframe(instances, types: nil) ⇒ Object