Module: HybridForest::Utils
Defined Under Namespace
Modules: DataFrameExtensions
Class Method Summary collapse
-
.accuracy(predicted, actual) ⇒ Object
Given an array of predicted labels and an array of actual labels, returns the accuracy of the predictions.
-
.prediction_report(actual, predicted) ⇒ Object
Outputs a report of common prediction metrics.
-
.random_sample(data:, size:, with_replacement: true) ⇒ Object
Draws a random sample of
size
fromdata
. -
.to_dataframe(instances, types: nil) ⇒ Object
Turns
instances
into a dataframe prepared for fitting and applying models. -
.train_test_bootstrap_split(dataset) ⇒ Object
Partitions
dataset
into training and testing datasets, drawing with replacement to the training set, and using the not drawn instances as the testing dataset. -
.train_test_split(dataset, test_set_size = 0.20) ⇒ Object
Partitions
dataset
into training and testing datasets, and splits the testing dataset into a dataframe of independent features and an array of labels.
Class Method Details
.accuracy(predicted, actual) ⇒ Object
Given an array of predicted labels and an array of actual labels, returns the accuracy of the predictions.
106 107 108 109 |
# File 'lib/hybridforest/utilities/utils.rb', line 106 def self.accuracy(predicted, actual) accurate = predicted.zip(actual).count { |p, a| equal_labels?(p, a) } accurate.to_f / predicted.count end |
.prediction_report(actual, predicted) ⇒ Object
Outputs a report of common prediction metrics. actual
and predicted
are expected to be equal sized arrays of class labels.
97 98 99 100 101 102 |
# File 'lib/hybridforest/utilities/utils.rb', line 97 def self.prediction_report(actual, predicted) Rumale::EvaluationMeasure.classification_report( actual, predicted ) end |
.random_sample(data:, size:, with_replacement: true) ⇒ Object
Draws a random sample of size
from data
.
80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
# File 'lib/hybridforest/utilities/utils.rb', line 80 def self.random_sample(data:, size:, with_replacement: true) data = to_dataframe(data) if size < 1 || (!with_replacement && size > data.count) raise ArgumentError, "Invalid sample size" end rows = if with_replacement rand_nums(size, 0...data.count) else rand_uniq_nums(size, 0...data.count) end data[rows] end |
.to_dataframe(instances, types: nil) ⇒ Object
Turns instances
into a dataframe prepared for fitting and applying models. Instances can be an array of hashes:
[1, b: “one”, 2, b: “two”, 3, b: “three”]
or a hash of arrays:
[1, 2, 3], b: [“one”, “two”, “three”]
or the path to a CSV file:
“dataset.csv”
Accepts an optional hash for specifying feature data types: to_dataframe(“dataset.csv”, types: => :int, “b” => :float)
Raises ArgumentError if given an invalid dataset.
71 72 73 74 75 76 |
# File 'lib/hybridforest/utilities/utils.rb', line 71 def self.to_dataframe(instances, types: nil) return instances if instances.is_a? Rover::DataFrame return instances if success? { instances = Rover::DataFrame.new(instances, types: types) } return instances if success? { instances = Rover.read_csv(instances, types: types) } raise ArgumentError, @error end |
.train_test_bootstrap_split(dataset) ⇒ Object
Partitions dataset
into training and testing datasets, drawing with replacement to the training set, and using the not drawn instances as the testing dataset. Then, splits the testing dataset into a dataframe of independent features and an array of labels. Returns [training_set
, testing_set
, testing_set_labels
]
35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# File 'lib/hybridforest/utilities/utils.rb', line 35 def self.train_test_bootstrap_split(dataset) dataset = to_dataframe(dataset) all_rows = (0...dataset.count).to_a train_set_rows = rand_nums(dataset.count, 0...dataset.count) train_set = dataset[train_set_rows] return train_test_split(dataset) if train_set_rows.sort == all_rows test_set = dataset[all_rows - train_set_rows] test_set, test_set_labels = test_set.disconnect_labels [train_set, test_set, test_set_labels] end |
.train_test_split(dataset, test_set_size = 0.20) ⇒ Object
Partitions dataset
into training and testing datasets, and splits the testing dataset into a dataframe of independent features and an array of labels. Returns [training_set
, testing_set
, testing_set_labels
]
14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# File 'lib/hybridforest/utilities/utils.rb', line 14 def self.train_test_split(dataset, test_set_size = 0.20) # TODO: Offer stratify param dataset = to_dataframe(dataset) all_rows = (0...dataset.count).to_a test_set_count = (dataset.count * test_set_size).floor test_set_rows = rand_uniq_nums(test_set_count, 0...dataset.count) test_set = dataset[test_set_rows] test_set, test_set_labels = test_set.disconnect_labels train_set = dataset[all_rows - test_set_rows] [train_set, test_set, test_set_labels] end |