Class: GeneValidator::LengthClusterValidation
- Inherits:
-
ValidationTest
- Object
- ValidationTest
- GeneValidator::LengthClusterValidation
- Defined in:
- lib/genevalidator/validation_length_cluster.rb
Overview
This class contains the methods necessary for length validation by hit length clusterization
Instance Attribute Summary collapse
-
#clusters ⇒ Object
readonly
Returns the value of attribute clusters.
-
#max_density_cluster ⇒ Object
readonly
Returns the value of attribute max_density_cluster.
Attributes inherited from ValidationTest
#cli_name, #description, #header, #hits, #prediction, #run_time, #short_header, #type, #validation_report
Instance Method Summary collapse
-
#clusterization_by_length(_debug = false, lst = @hits, predicted_seq = @prediction) ⇒ Object
- Clusterization by length from a list of sequences Params:
debug(optional) - true to display debug information, false by default
lst - array of
Sequenceobjectspredicted_seq Sequenceobjetc Output output 1- array of Cluster objects output 2
-
the index of the most dense cluster.
- array of
- true to display debug information, false by default
- Clusterization by length from a list of sequences Params:
-
#initialize(prediction, hits) ⇒ LengthClusterValidation
constructor
Initilizes the object Params:
type: type of the predicted sequence (:nucleotide or :protein)prediction: aSequenceobject representing the blast queryhits: a vector ofSequenceobjects (representing blast hits)dilename:Stringwith the name of the fasta file. -
#plot_histo_clusters(output = "#{@plot_path}_len_clusters.json", clusters = @clusters, max_density_cluster = @max_density_cluster, prediction = @prediction) ⇒ Object
Generates a json file containing data used for plotting the histogram of the length distribution given a lust of Cluster objects
output: plot_path where to save the graphclusters: array ofClusterobjectsmax_density_cluster: index of the most dense clusterprediction:Sequenceobject Output:Plotobject. -
#run ⇒ Object
Validates the length of the predicted gene by comparing the length of the prediction to the most dense cluster The most dense cluster is obtained by hierarchical clusterization Plots are generated if required (see
plotvariable) Output:LengthClusterValidationOutputobject.
Constructor Details
#initialize(prediction, hits) ⇒ LengthClusterValidation
Initilizes the object Params: type: type of the predicted sequence (:nucleotide or :protein) prediction: a Sequence object representing the blast query hits: a vector of Sequence objects (representing blast hits) dilename: String with the name of the fasta file
80 81 82 83 84 85 86 87 88 89 |
# File 'lib/genevalidator/validation_length_cluster.rb', line 80 def initialize(prediction, hits) super @short_header = 'LengthCluster' @header = 'Length Cluster' @description = 'Check whether the prediction length fits most of the' \ ' BLAST hit lengths, by 1D hierarchical clusterization.' \ ' Meaning of the output displayed: Query_length' \ ' [Main Cluster Length Interval]' @cli_name = 'lenc' end |
Instance Attribute Details
#clusters ⇒ Object (readonly)
Returns the value of attribute clusters.
70 71 72 |
# File 'lib/genevalidator/validation_length_cluster.rb', line 70 def clusters @clusters end |
#max_density_cluster ⇒ Object (readonly)
Returns the value of attribute max_density_cluster.
71 72 73 |
# File 'lib/genevalidator/validation_length_cluster.rb', line 71 def max_density_cluster @max_density_cluster end |
Instance Method Details
#clusterization_by_length(_debug = false, lst = @hits, predicted_seq = @prediction) ⇒ Object
Clusterization by length from a list of sequences Params:
debug(optional)-
true to display debug information, false by default
lst-
array of
Sequenceobjects predicted_seq-
Sequenceobjetc
Output
- output 1
-
array of Cluster objects
- output 2
-
the index of the most dense cluster
144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
# File 'lib/genevalidator/validation_length_cluster.rb', line 144 def clusterization_by_length(_debug = false, lst = @hits, predicted_seq = @prediction) fail TypeError unless lst[0].is_a?(Sequence) && predicted_seq.is_a?(Sequence) contents = lst.map { |x| x.length_protein.to_i }.sort { |a, b| a <=> b } hc = HierarchicalClusterization.new(contents) clusters = hc.hierarchical_clusterization max_density = 0 max_density_cluster_idx = 0 clusters.each_with_index do |item, i| next unless item.density > max_density max_density = item.density max_density_cluster_idx = i end [clusters, max_density_cluster_idx] rescue TypeError => error error_location = error.backtrace[0].scan(%r{([^/]+:\d+):.*})[0][0] $stderr.puts "Type error at #{error_location}." $stderr.puts ' Possible cause: one of the arguments of the' \ ' "clusterization_by_length" method has not the proper type.' exit 1 end |
#plot_histo_clusters(output = "#{@plot_path}_len_clusters.json", clusters = @clusters, max_density_cluster = @max_density_cluster, prediction = @prediction) ⇒ Object
Generates a json file containing data used for plotting the histogram of the length distribution given a lust of Cluster objects output: plot_path where to save the graph clusters: array of Cluster objects max_density_cluster: index of the most dense cluster prediction: Sequence object Output: Plot object
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
# File 'lib/genevalidator/validation_length_cluster.rb', line 182 def plot_histo_clusters(output = "#{@plot_path}_len_clusters.json", clusters = @clusters, max_density_cluster = @max_density_cluster, prediction = @prediction) data = clusters.each_with_index.map { |cluster, i| cluster.lengths.collect { |k, v| { 'key' => k, 'value' => v, 'main' => (i == max_density_cluster) } } } Plot.new(data, :bars, 'Length Cluster Validation: Distribution of BLAST hit lengths', 'Query Sequence, black;Most Dense Cluster,red;Other Hits, blue', 'Sequence Length', 'Number of Sequences', prediction.length_protein) end |
#run ⇒ Object
Validates the length of the predicted gene by comparing the length of the prediction to the most dense cluster The most dense cluster is obtained by hierarchical clusterization Plots are generated if required (see plot variable) Output: LengthClusterValidationOutput object
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 |
# File 'lib/genevalidator/validation_length_cluster.rb', line 98 def run fail NotEnoughHitsError unless hits.length >= 5 fail Exception unless prediction.is_a?(Sequence) && hits[0].is_a?(Sequence) start = Time.now # get [clusters, max_density_cluster_idx] clusterization = clusterization_by_length @clusters = clusterization[0] @max_density_cluster = clusterization[1] limits = @clusters[@max_density_cluster].get_limits query_length = @prediction.length_protein @validation_report = LengthClusterValidationOutput.new(@short_header, @header, @description, query_length, limits) plot1 = plot_histo_clusters @validation_report.plot_files.push(plot1) @validation_report.run_time = Time.now - start @validation_report rescue NotEnoughHitsError @validation_report = ValidationReport.new('Not enough evidence', :warning, @short_header, @header, @description) rescue Exception @validation_report = ValidationReport.new('Unexpected error', :error, @short_header, @header, @description) @validation_report.errors.push 'Unexpected Error' end |