Module: Veritable::Util
- Defined in:
- lib/veritable/util.rb
Overview
Encapsulates utilities for working with data
Methods
-
read_csv
– reads a .csv from disk into an Array of row Hashes -
write_csv
– writes an Array of row Hashes to disk as .csv -
split_rows
– splits an Array of row Hashes into two sets -
make_schema
– makes a new analysis schema from a schema rule -
validate_data
– validates an Array of row Hashes against a schema -
clean_data
– cleans an Array of row Hashes to conform to a schema -
validate_predictions
– validates a single predictions request Hash against a schema -
clean_predictions
– cleans a predictions request Hash to conform to a schema -
validate_schema
– validates a schema -
check_id
– checks that a unique ID is valid -
check_row
– checks that a row Hash is well-formed -
check_datatype
– checks that a datatype is valid -
query_params
– helper function for HTTP form encoding -
make_table_id
– autogenerates a new valid ID for a table -
make_analysis_id
– autogenerates a new valid ID for an analysis
See also: dev.priorknowledge.com/docs/client/ruby
Class Method Summary collapse
-
.check_datatype(datatype, msg = nil) ⇒ Object
Checks tht a given datatype is valid.
-
.check_id(id) ⇒ Object
Checks that a unique ID is valid.
-
.check_row(row) ⇒ Object
Checks that a given row is well-formed.
-
.clean_data(rows, schema, opts = {}) ⇒ Object
Cleans up an Array of row Hashes in accordance with an analysis schema.
-
.clean_predictions(predictions, schema, opts = {}) ⇒ Object
Cleans up a predictions request in accordance with an analysis schema Automatically renames ‘_id’ to ‘_request_id’, otherwise assigns a new ‘_request_id’ to each row.
-
.make_analysis_id ⇒ Object
Autogenerate a new analysis ID.
-
.make_schema(schema_rule, opts = {}) ⇒ Object
Makes a new analysis schema from a schema rule.
-
.make_table_id ⇒ Object
Autogenerate a new table ID.
-
.query_params(params, parent = nil) ⇒ Object
Helper function for HTTP form encoding.
-
.read_csv(filename, id_col = nil, na_vals = ['']) ⇒ Object
Reads a .csv with headers in as an Array of row Hashes.
-
.split_rows(rows, frac) ⇒ Object
Splits an array of row Hashes into two sets.
-
.validate_data(rows, schema) ⇒ Object
Validates an Array of row Hashes against an analysis schema.
-
.validate_predictions(predictions, schema) ⇒ Object
Validates a predictions request against an analysis schema.
-
.validate_schema(schema) ⇒ Object
Validates a schema.
-
.write_csv(rows, filename) ⇒ Object
Writes an Array of row Hashes out to .csv.
Class Method Details
.check_datatype(datatype, msg = nil) ⇒ Object
Checks tht a given datatype is valid
Raises a VeritableError if the datatype is invalid.
96 97 98 99 100 101 102 103 104 105 106 |
# File 'lib/veritable/util.rb', line 96 def check_datatype(datatype, msg=nil) if not DATATYPES.include? datatype begin datatype.to_s rescue raise VeritableError.new("#{msg}Invalid data type.") else raise VeritableError.new("#{msg}Invalid data type '#{datatype}'.") end end end |
.check_id(id) ⇒ Object
Checks that a unique ID is valid
Raises a VeritableError if the ID is invalid.
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
# File 'lib/veritable/util.rb', line 54 def check_id(id) if not id.is_a? String begin id.to_s rescue raise VeritableError.new("Invalid id -- strings only.") else raise VeritableError.new("Invalid id '#{id}' -- strings only.") end elsif not id =~ Regexp.new('\A[a-zA-Z0-9][-_a-zA-Z0-9]*\z') raise VeritableError.new("Invalid id '#{id}' -- must contain only alphanumerics, underscores, and dashes.") elsif id[0] == '_' or id[0] == '-' raise VeritableError.new("Invalid id '#{id}' -- may not begin with a dash or underscore.") end end |
.check_row(row) ⇒ Object
Checks that a given row is well-formed
Raises a VeritableError if the row Hash is not well-formed
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
# File 'lib/veritable/util.rb', line 73 def check_row(row) if not row.is_a? Hash begin row.to_s rescue raise VeritableError.new("Invalid row -- Must provide a hash of column name-value pairs.") else raise VeritableError.new("Invalid row #{row} -- Must provide a hash of column name-value pairs.") end elsif not row.has_key? '_id' raise VeritableError.new("Invalid row #{row} -- rows must contain unique row ids in the '_id' field.") else begin check_id row['_id'] rescue VeritableError => e raise VeritableError.new("Invalid row #{row} -- #{e}") end end end |
.clean_data(rows, schema, opts = {}) ⇒ Object
Cleans up an Array of row Hashes in accordance with an analysis schema
This method mutates its rows
argument. If clean_data raises an exception, values in some rows may be converted while others are left in their original state.
Arguments
-
rows
– the Array of Hashes to clean up -
schema
– a Schema specifying the types of the columns appearing in the rows being cleaned -
opts
– a Hash optionally containing the keys:-
convert_types
– controls whether clean_data will attempt to convert cells in a column to be of the correct type (default:true
) -
remove_nones
– controls whether clean_data will automatically remove cells containing the valuenil
(default:true
) -
remove_invalids
– controls whether clean_data will automatically remove cells that are invalid for a given column (default:true
) -
reduce_categories
– controls whether clean_data will automatically reduce the number of categories in categorical columns with too many categories (default:true
) Iftrue
, the largest categories in a column will be preserved, up to the allowable limit, and the other categories will be binned as"Other"
. -
assign_ids
– controls whether clean_data will automatically assign new ids to the rows (default: +false=) Iftrue
, rows will be numbered sequentially. If the rows have an existing'_id'
column,remove_extra_fields
must also be set totrue
to avoid raising a Veritable::VeritableError. -
remove_extra_fields
– controls whether clean_data will automatically remove columns that are not contained in the schema (default:false
) Ifassign_ids
istrue
(default), will also remove the'_id'
column. -
rename_columns
– an array of two-valued arrays[[old_col_1, new_col_1],[old_col_2, new_col_2],...]
of column names to rename. Ifrename_columns
isfalse
(default), no columns are renamed.
-
Raises
A Veritable::VeritableError containing further details if the data does not validate against the schema.
Returns
nil
on success (mutates rows
argument)
See also: dev.priorknowledge.com/docs/client/ruby
250 251 252 253 254 255 256 257 258 259 260 261 262 263 |
# File 'lib/veritable/util.rb', line 250 def clean_data(rows, schema, opts={}) validate(rows, schema, { 'convert_types' => opts.has_key?('convert_types') ? opts['convert_types'] : true, 'allow_nones' => false, 'remove_nones' => opts.has_key?('remove_nones') ? opts['remove_nones'] : true, 'remove_invalids' => opts.has_key?('remove_invalids') ? opts['remove_invalids'] : true, 'reduce_categories' => opts.has_key?('reduce_categories') ? opts['reduce_categories'] : true, 'has_ids' => '_id', 'assign_ids' => opts.has_key?('assign_ids') ? opts['assign_ids'] : false, 'allow_extra_fields' => true, 'remove_extra_fields' => opts.has_key?('remove_extra_fields') ? opts['remove_extra_fields'] : false, 'allow_empty_columns' => false, 'rename_columns' => opts.has_key?('rename_columns') ? opts['rename_columns'] : false}) end |
.clean_predictions(predictions, schema, opts = {}) ⇒ Object
Cleans up a predictions request in accordance with an analysis schema Automatically renames ‘_id’ to ‘_request_id’, otherwise assigns a new ‘_request_id’ to each row
This method mutates its predictions
argument. If clean_predictions raises an exception, values in some columns may be converted while others are left in their original state.
Arguments
-
predictions
– the predictions request to clean up -
schema
– a Schema specifying the types of the columns appearing in the predictions request -
opts
– a Hash optionally containing the keys:-
convert_types
– controls whether clean_data will attempt to convert cells in a column to be of the correct type (default:true
) -
remove_invalids
– controls whether clean_data will automatically remove cells that are invalid for a given column (default:true
) -
remove_extra_fields
– controls whether clean_data will automatically remove columns that are not contained in the schema (default:true
) -
rename_columns
– an array of two-valued arrays[[old_col_1, new_col_1],[old_col_2, new_col_2],...]
of column names to rename. Ifrename_columns
isfalse
, no columns are renamed. (default:[['_id','_request_id']]
)
-
Raises
A Veritable::VeritableError containing further details if the predictions request does not validate against the schema
Returns
nil
on success (mutates predictions
argument)
See also: dev.priorknowledge.com/docs/client/ruby
314 315 316 317 318 319 320 321 322 323 324 325 326 327 |
# File 'lib/veritable/util.rb', line 314 def clean_predictions(predictions, schema, opts={}) validate(predictions, schema, { 'convert_types' => opts.has_key?('convert_types') ? opts['convert_types'] : true, 'allow_nones' => true, 'remove_nones' => false, 'remove_invalids' => opts.has_key?('remove_invalids') ? opts['remove_invalids'] : true, 'reduce_categories' => false, 'has_ids' => '_request_id', 'assign_ids' => true, 'allow_extra_fields' => false, 'remove_extra_fields' => opts.has_key?('remove_extra_fields') ? opts['remove_extra_fields'] : true, 'allow_empty_columns' => true, 'rename_columns' => [['_id','_request_id']]}) end |
.make_analysis_id ⇒ Object
Autogenerate a new analysis ID
Users should not call directly
40 |
# File 'lib/veritable/util.rb', line 40 def make_analysis_id; UUID.new.generate :compact ; end |
.make_schema(schema_rule, opts = {}) ⇒ Object
Makes a new analysis schema from a schema rule
Arguments
-
schema_rule
– a Hash or Array of two-valued Arrays, whose keys or first values should be regexes to match against column names, and whose values should be the appropriate datatype to assign to matching columns, for instance:[['a_regex_to_match', {'type' => 'continuous'}], ['another_regex', {'type' => 'count'}], ...]
-
opts
– a Hash which must contain either:-
the key
'headers'
, whose value should be an Array of column names from which to construct the schema -
or the key
'rows'
, whose value should be an Array of row Hashes from whose columns the schema is to be constructed
-
Returns
A new Veritable::Schema
See also: dev.priorknowledge.com/docs/client/ruby
146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
# File 'lib/veritable/util.rb', line 146 def make_schema(schema_rule, opts={}) if ((not opts.has_key?('headers')) and (not opts.has_key?('rows'))) raise VeritableError.new("Either 'headers' or 'rows' must be provided!") end headers = opts.has_key?('headers') ? opts['headers'] : nil if headers.nil? headers = Set.new opts['rows'].each {|row| headers.merge(row.keys)} headers = headers.to_a.sort end schema = {} headers.each do |c| schema_rule.each do |r, t| if r === c schema[c] = t break end end end return Veritable::Schema.new(schema) end |
.make_table_id ⇒ Object
Autogenerate a new table ID
Users should not call directly
35 |
# File 'lib/veritable/util.rb', line 35 def make_table_id; UUID.new.generate :compact ; end |
.query_params(params, parent = nil) ⇒ Object
Helper function for HTTP form encoding
Users should not call directly
45 46 47 48 49 |
# File 'lib/veritable/util.rb', line 45 def query_params(params, parent=nil) flatten_params(params).collect {|x| "#{x[0]}=#{x[1]}" }.join("&") end |
.read_csv(filename, id_col = nil, na_vals = ['']) ⇒ Object
Reads a .csv with headers in as an Array of row Hashes
All values are kept as strings, except empty strings, which are omitted. To clean data and convert types in accordance with a given schema, use the clean_data and validate_data functions.
Arguments
-
filename
– a path to the .csv file to read in from -
id_col
– optionally specify the column to rename to ‘_id’. Ifnil
(default) and a column named ‘_id’ is present, that column is used. Ifnil
and no ‘_id’ column is present, then ‘_id’ will be automatically generated. -
na_cols
– a list of string values to omit; defaults to [”].
Returns
An Array of row Hashes
See also: dev.priorknowledge.com/docs/client/ruby
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 |
# File 'lib/veritable/util.rb', line 205 def read_csv(filename, id_col=nil, na_vals=['']) rows = CSV.read(filename) header = rows.shift header = header.collect {|h| (h == id_col ? '_id' : h).strip} if header.include?('_id') id_col = '_id' end rid = 0 rows = rows.collect do |raw_row| rid = rid + 1 row = {} (0...raw_row.length).each do |i| row[header[i]] = ( na_vals.include?(raw_row[i]) ? nil : raw_row[i] ) end if id_col.nil? row['_id'] = rid.to_s end row end return rows end |
.split_rows(rows, frac) ⇒ Object
Splits an array of row Hashes into two sets
Arguments
-
rows
– an Array of valid row Hashes -
frac
– the fraction of the rows to include in the first set
Returns
An array [train_dataset, test_dataset]
, each of whose members is an Array of row Hashes.
See also: dev.priorknowledge.com/docs/client/ruby
118 119 120 121 122 123 124 125 126 |
# File 'lib/veritable/util.rb', line 118 def split_rows(rows, frac) rows = rows.to_a n = rows.size inds = (0...n).to_a.shuffle border_ind = (n * frac).floor.to_i train_dataset = (0...border_ind).collect {|i| rows[inds[i]] } test_dataset = (border_ind...n).collect {|i| rows[inds[i]] } return [train_dataset, test_dataset] end |
.validate_data(rows, schema) ⇒ Object
Validates an Array of row Hashes against an analysis schema
Arguments
-
rows
– the Array of Hashes to clean up -
schema
– a Schema specifying the types of the columns appearing in the rows being cleaned
Raises
A Veritable::VeritableError containing further details if the data does not validate against the schema.
Returns
nil
on success
See also: dev.priorknowledge.com/docs/client/ruby
278 279 280 281 282 283 284 285 286 287 288 289 290 291 |
# File 'lib/veritable/util.rb', line 278 def validate_data(rows, schema) validate(rows, schema, { 'convert_types' => false, 'allow_nones' => false, 'remove_nones' => false, 'remove_invalids' => false, 'reduce_categories' => false, 'has_ids' => '_id', 'assign_ids' => false, 'allow_extra_fields' => true, 'remove_extra_fields' => false, 'allow_empty_columns' => false, 'rename_columns' => false}) end |
.validate_predictions(predictions, schema) ⇒ Object
Validates a predictions request against an analysis schema
Arguments
-
predictions
– the predictions request to clean up -
schema
– a Schema specifying the types of the columns appearing in the predictions request
Raises
A Veritable::VeritableError containing further details if the predictions request does not validate against the schema.
Returns
nil
on success
See also: dev.priorknowledge.com/docs/client/ruby
342 343 344 345 346 347 348 349 350 351 352 353 354 355 |
# File 'lib/veritable/util.rb', line 342 def validate_predictions(predictions, schema) validate(predictions, schema, { 'convert_types' => false, 'allow_nones' => true, 'remove_nones' => false, 'remove_invalids' => false, 'reduce_categories' => false, 'has_ids' => '_request_id', 'assign_ids' => false, 'allow_extra_fields' => false, 'remove_extra_fields' => false, 'allow_empty_columns' => true, 'rename_columns' => false}) end |
.validate_schema(schema) ⇒ Object
Validates a schema
Checks that a Veritable::Schema or Hash of the appropriate form is well-formed.
131 |
# File 'lib/veritable/util.rb', line 131 def validate_schema(schema); schema.is_a? Veritable::Schema ? schema.validate : Veritable::Schema.new(schema).validate; end |
.write_csv(rows, filename) ⇒ Object
Writes an Array of row Hashes out to .csv
Arguments
-
rows
– an Array of valid row Hashes -
filename
– a path to the .csv file to write out
Returns
nil
on success.
See also: dev.priorknowledge.com/docs/client/ruby
178 179 180 181 182 183 184 185 186 187 188 189 190 |
# File 'lib/veritable/util.rb', line 178 def write_csv(rows, filename) headers = Set.new rows.each {|row| headers.merge(row.keys)} headers = headers.to_a.sort CSV.open(filename, "w") do |csv| csv << headers rows.each do |row| out_row = headers.collect {|h| row.keys.include?(h) ? row[h] : ''} csv << out_row end end nil end |