Class: Nimbus::RegressionTree
 Defined in:
 lib/nimbus/regression_tree.rb
Overview
Tree object representing a random regression tree.
A tree is generated following this steps:

1: Calculate loss function for the individuals in the node (first node contains all the individuals).

2: Take a random sample of the SNPs (size m << total count of SNPs)

3: Compute the loss function (quadratic loss) for the split of the sample based on value of every SNP.

4: If the SNP with minimum loss function also minimizes the general loss of the node, split the individuals sample in two nodes, based on average value for that SNP [0,1], or [0]

5: Repeat from 1 for every node until:

a) The individuals count in that node is < minimum size OR

b) None of the SNP splits has a loss function smaller than the node loss function


6) When a node stops, label the node with the average fenotype value of the individuals in the node.
Constant Summary
Constants inherited from Tree
Tree::NODE_SPLIT_01_2, Tree::NODE_SPLIT_0_12
Instance Attribute Summary
Attributes inherited from Tree
#generalization_error, #id_to_fenotype, #importances, #individuals, #node_min_size, #predictions, #snp_sample_size, #snp_total_count, #structure, #used_snps
Instance Method Summary collapse

#build_node(individuals_ids, y_hat) ⇒ Object
Creates a node by taking a random sample of the SNPs and computing the loss function for every split by SNP of that sample.

#estimate_importances(oob_ids) ⇒ Object
Estimation of importance for every SNP.

#generalization_error_from_oob(oob_ids) ⇒ Object
Compute generalization error for the tree.

#seed(all_individuals, individuals_sample, ids_fenotypes) ⇒ Object
Creates the structure of the tree, as a hash of SNP splits and values.
Methods inherited from Tree
Constructor Details
This class inherits a constructor from Nimbus::Tree
Instance Method Details
#build_node(individuals_ids, y_hat) ⇒ Object
Creates a node by taking a random sample of the SNPs and computing the loss function for every split by SNP of that sample.

If SNP_min is the SNP with smaller loss function and it is < the loss function of the node, it splits the individuals sample in two:
(the average of the 0,1,2 values for the SNP_min in the individuals is computed, and they are splitted in [<=avg], [>avg]) then it builds these 2 new nodes.

Otherwise every individual in the node gets labeled with the average of the fenotype values of all of them.
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 
# File 'lib/nimbus/regression_tree.rb', line 33 def build_node(individuals_ids, y_hat) # General loss function value for the node individuals_count = individuals_ids.size return label_node(y_hat, individuals_ids) if individuals_count < @node_min_size node_loss_function = Nimbus::LossFunctions.quadratic_loss individuals_ids, @id_to_fenotype, y_hat # Finding the SNP that minimizes loss function snps = snps_random_sample min_loss, min_SNP, split, split_type, means = node_loss_function, nil, nil, nil, nil snps.each do snp individuals_split_by_snp_value, node_split_type = split_by_snp_avegare_value individuals_ids, snp mean_0 = Nimbus::LossFunctions.average individuals_split_by_snp_value[0], @id_to_fenotype mean_1 = Nimbus::LossFunctions.average individuals_split_by_snp_value[1], @id_to_fenotype loss_0 = Nimbus::LossFunctions.mean_squared_error individuals_split_by_snp_value[0], @id_to_fenotype, mean_0 loss_1 = Nimbus::LossFunctions.mean_squared_error individuals_split_by_snp_value[1], @id_to_fenotype, mean_1 loss_snp = (loss_0 + loss_1) / individuals_count min_loss, min_SNP, split, split_type, means = loss_snp, snp, individuals_split_by_snp_value, node_split_type, [mean_0, mean_1] if loss_snp < min_loss end return build_branch(min_SNP, split, split_type, means, y_hat) if min_loss < node_loss_function return label_node(y_hat, individuals_ids) end 
#estimate_importances(oob_ids) ⇒ Object
Estimation of importance for every SNP.
The importance of any SNP in the tree is calculated using the OOB sample. For every SNP, every individual in the sample is pushed down the tree but with the value of that SNP permuted with other individual in the sample.
That way the difference between the regular prediction and the prediction with the SNP value modified can be estimated for any given SNP.
This method computes importance estimations for every SNPs used in the tree (for any other SNP it would be 0).
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 
# File 'lib/nimbus/regression_tree.rb', line 83 def estimate_importances(oob_ids) return nil if (@generalization_error.nil? && generalization_error_from_oob(oob_ids).nil?) oob_individuals_count = oob_ids.size @importances = {} @used_snps.uniq.each do current_snp shuffled_ids = oob_ids.shuffle permutated_snp_error = 0.0 oob_ids.each_with_index {oobi, index permutated_prediction = traverse_with_permutation @structure, individuals[oobi].snp_list, current_snp, individuals[shuffled_ids[index]].snp_list permutated_snp_error += Nimbus::LossFunctions.squared_difference @id_to_fenotype[oobi], permutated_prediction } @importances[current_snp] = ((permutated_snp_error / oob_individuals_count)  @generalization_error).round(5) end @importances end 
#generalization_error_from_oob(oob_ids) ⇒ Object
Compute generalization error for the tree.
Traversing the ‘out of bag’ (OOB) sample (those individuals of the training set not used in the building of this tree) through the tree, and comparing the prediction with the real fenotype of the individual (and then averaging) is possible to calculate the unbiased generalization error for the tree.
64 65 66 67 68 69 70 71 72 
# File 'lib/nimbus/regression_tree.rb', line 64 def generalization_error_from_oob(oob_ids) return nil if (@structure.nil?  @individuals.nil?  @id_to_fenotype.nil?) oob_errors = {} oob_ids.each do oobi oob_prediction = Tree.traverse @structure, individuals[oobi].snp_list oob_errors[oobi] = Nimbus::LossFunctions.squared_difference oob_prediction, @id_to_fenotype[oobi] end @generalization_error = Nimbus::LossFunctions.average oob_ids, oob_errors end 
#seed(all_individuals, individuals_sample, ids_fenotypes) ⇒ Object
Creates the structure of the tree, as a hash of SNP splits and values.
It just initializes the needed variables and then defines the first node of the tree. The rest of the structure of the tree is computed recursively building every node calling build_node
.
23 24 25 26 
# File 'lib/nimbus/regression_tree.rb', line 23 def seed(all_individuals, individuals_sample, ids_fenotypes) super @structure = build_node individuals_sample, Nimbus::LossFunctions.average(individuals_sample, @id_to_fenotype) end 