Module: Statsample::Codification
- Defined in:
- lib/statsample/codification.rb
Overview
This module aids to code open questions
-
Select one or more vectors of a dataset, to create a yaml files, on which each vector is a hash, which keys and values are the vector’s factors . If data have Statsample::SPLIT_TOKEN on a value, each value will be separated on two or more hash keys.
-
Edit the yaml and replace the values of hashes with your codes. If you need to create two or mores codes for an answer, use the separator (default Statsample::SPLIT_TOKEN)
-
Recode the vectors, loading the yaml file:
-
recode_dataset_simple!() : The new vectors have the same name of the original plus “_recoded”
-
recode_dataset_split!() : Create equal number of vectors as values. See Vector.add_vectors_by_split() for arguments
-
Usage:
recode_file="recodification.yaml"
phase=:first # flag
if phase==:first
File.open(recode_file,"w") {|fp|
Statsample::Codification.create_yaml(ds,%w{vector1 vector2}, ",",fp)
}
# Edit the file recodification.yaml and verify changes
elsif phase==:second
File.open(recode_file,"r") {|fp|
Statsample::Codification.verify(fp,['vector1'])
}
# Add new vectors to the dataset
elsif phase==:third
File.open(recode_file,"r") {|fp|
Statsample::Codification.recode_dataset_split!(ds,fp,"*")
}
end
Class Method Summary collapse
- ._recode_dataset(dataset, h, sep = Statsample::SPLIT_TOKEN, split = false) ⇒ Object
-
.create_excel(dataset, vectors, filename, sep = Statsample::SPLIT_TOKEN) ⇒ Object
Create a excel to create a dictionary, based on vectors.
-
.create_hash(dataset, vectors, sep = Statsample::SPLIT_TOKEN) ⇒ Object
Create a hash, based on vectors, to create the dictionary.
-
.create_yaml(dataset, vectors, io = nil, sep = Statsample::SPLIT_TOKEN) ⇒ Object
Create a yaml to create a dictionary, based on vectors The keys will be vectors name on dataset and the values will be hashes, with keys = values, for recodification.
- .dictionary(h, sep = Statsample::SPLIT_TOKEN) ⇒ Object
-
.excel_to_recoded_hash(filename) ⇒ Object
From a excel generates a dictionary hash to use on recode_dataset_simple!() or recode_dataset_split!().
- .inverse_hash(h, sep = Statsample::SPLIT_TOKEN) ⇒ Object
- .recode_dataset_simple!(dataset, dictionary_hash, sep = Statsample::SPLIT_TOKEN) ⇒ Object
- .recode_dataset_split!(dataset, dictionary_hash, sep = Statsample::SPLIT_TOKEN) ⇒ Object
- .recode_vector(v, h, sep = Statsample::SPLIT_TOKEN) ⇒ Object
- .verify(h, v_names = nil, sep = Statsample::SPLIT_TOKEN, io = $>) ⇒ Object
Class Method Details
._recode_dataset(dataset, h, sep = Statsample::SPLIT_TOKEN, split = false) ⇒ Object
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
# File 'lib/statsample/codification.rb', line 145 def _recode_dataset(dataset, h , sep=Statsample::SPLIT_TOKEN, split=false) v_names||=h.keys v_names.each do |v_name| raise Exception, "Vector #{v_name} doesn't exists on Dataset" if !dataset.vectors.include? v_name recoded = Daru::Vector.new( recode_vector(dataset[v_name], h[v_name],sep).collect do |c| if c.nil? nil else c.join(sep) end end ) if split recoded.split_by_separator(sep).each {|k,v| dataset[(v_name.to_s + "_" + k).to_sym] = v } else dataset[(v_name.to_s + "_recoded").to_sym] = recoded end end end |
.create_excel(dataset, vectors, filename, sep = Statsample::SPLIT_TOKEN) ⇒ Object
Create a excel to create a dictionary, based on vectors. Raises an error if filename exists The rows will be:
-
field: name of vector
-
original: original name
-
recoded: new code
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
# File 'lib/statsample/codification.rb', line 76 def create_excel(dataset, vectors, filename, sep=Statsample::SPLIT_TOKEN) require 'spreadsheet' if File.exist?(filename) raise "Exists a file named #{filename}. Delete ir before overwrite." end book = Spreadsheet::Workbook.new sheet = book.create_worksheet sheet.row(0).concat(%w(field original recoded)) i = 1 create_hash(dataset, vectors, sep).sort.each do |field, inner_hash| inner_hash.sort.each do |k,v| sheet.row(i).concat([field.to_s,k.to_s,v.to_s]) i += 1 end end book.write(filename) end |
.create_hash(dataset, vectors, sep = Statsample::SPLIT_TOKEN) ⇒ Object
Create a hash, based on vectors, to create the dictionary. The keys will be vectors name on dataset and the values will be hashes, with keys = values, for recodification
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# File 'lib/statsample/codification.rb', line 35 def create_hash(dataset, vectors, sep=Statsample::SPLIT_TOKEN) raise ArgumentError,"Array should't be empty" if vectors.size==0 pro_hash = vectors.inject({}) do |h,v_name| v_name = v_name.is_a?(Numeric) ? v_name : v_name.to_sym raise Exception, "Vector #{v_name} doesn't exists on Dataset" if !dataset.vectors.include?(v_name) v = dataset[v_name] split_data = v.splitted(sep) .flatten .collect { |c| c.to_s } .find_all{ |c| !c.nil? } factors = split_data.uniq .compact .sort .inject({}) { |ac,val| ac[val] = val; ac } h[v_name] = factors h end pro_hash end |
.create_yaml(dataset, vectors, io = nil, sep = Statsample::SPLIT_TOKEN) ⇒ Object
Create a yaml to create a dictionary, based on vectors The keys will be vectors name on dataset and the values will be hashes, with keys = values, for recodification
v1 = Daru::Vector.new(%w{a,b b,c d})
ds = Daru::DataFrame.new({:v1 => v1})
Statsample::Codification.create_yaml(ds,[:v1])
=> "--- \nv1: \n a: a\n b: b\n c: c\n d: d\n"
65 66 67 68 |
# File 'lib/statsample/codification.rb', line 65 def create_yaml(dataset, vectors, io=nil, sep=Statsample::SPLIT_TOKEN) pro_hash=create_hash(dataset, vectors, sep) YAML.dump(pro_hash,io) end |
.dictionary(h, sep = Statsample::SPLIT_TOKEN) ⇒ Object
123 124 125 |
# File 'lib/statsample/codification.rb', line 123 def dictionary(h, sep=Statsample::SPLIT_TOKEN) h.inject({}) { |a,v| a[v[0]]=v[1].split(sep); a } end |
.excel_to_recoded_hash(filename) ⇒ Object
From a excel generates a dictionary hash to use on recode_dataset_simple!() or recode_dataset_split!().
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
# File 'lib/statsample/codification.rb', line 97 def excel_to_recoded_hash(filename) require 'spreadsheet' h={} book = Spreadsheet.open filename sheet= book.worksheet 0 row_i=0 sheet.each do |row| row_i += 1 next if row_i == 1 or row[0].nil? or row[1].nil? or row[2].nil? key = row[0].to_sym h[key] ||= {} h[key][row[1]] = row[2] end h end |
.inverse_hash(h, sep = Statsample::SPLIT_TOKEN) ⇒ Object
113 114 115 116 117 118 119 120 121 |
# File 'lib/statsample/codification.rb', line 113 def inverse_hash(h, sep=Statsample::SPLIT_TOKEN) h.inject({}) do |a,v| v[1].split(sep).each do |val| a[val]||=[] a[val].push(v[0]) end a end end |
.recode_dataset_simple!(dataset, dictionary_hash, sep = Statsample::SPLIT_TOKEN) ⇒ Object
138 139 140 |
# File 'lib/statsample/codification.rb', line 138 def recode_dataset_simple!(dataset, dictionary_hash ,sep=Statsample::SPLIT_TOKEN) _recode_dataset(dataset,dictionary_hash ,sep,false) end |
.recode_dataset_split!(dataset, dictionary_hash, sep = Statsample::SPLIT_TOKEN) ⇒ Object
141 142 143 |
# File 'lib/statsample/codification.rb', line 141 def recode_dataset_split!(dataset, dictionary_hash, sep=Statsample::SPLIT_TOKEN) _recode_dataset(dataset, dictionary_hash, sep,true) end |
.recode_vector(v, h, sep = Statsample::SPLIT_TOKEN) ⇒ Object
127 128 129 130 131 132 133 134 135 136 137 |
# File 'lib/statsample/codification.rb', line 127 def recode_vector(v,h,sep=Statsample::SPLIT_TOKEN) dict = dictionary(h,sep) new_data = v.splitted(sep) new_data.collect do |c| if c.nil? nil else c.collect{|value| dict[value] }.flatten.uniq end end end |
.verify(h, v_names = nil, sep = Statsample::SPLIT_TOKEN, io = $>) ⇒ Object
169 170 171 172 173 174 175 176 177 178 179 |
# File 'lib/statsample/codification.rb', line 169 def verify(h, v_names=nil,sep=Statsample::SPLIT_TOKEN,io=$>) require 'pp' v_names||=h.keys v_names.each{|v_name| inverse=inverse_hash(h[v_name],sep) io.puts "- Field: #{v_name}" inverse.sort{|a,b| -(a[1].count<=>b[1].count)}.each {|k,v| io.puts " - \"#{k}\" (#{v.count}) :\n -'"+v.join("\n -'")+"'" } } end |