Module: Veritable::Util

Defined in:
lib/veritable/util.rb

Overview

Encapsulates utilities for working with data

Methods

  • read_csv – reads a .csv from disk into an Array of row Hashes

  • write_csv – writes an Array of row Hashes to disk as .csv

  • split_rows – splits an Array of row Hashes into two sets

  • make_schema – makes a new analysis schema from a schema rule

  • validate_data – validates an Array of row Hashes against a schema

  • clean_data – cleans an Array of row Hashes to conform to a schema

  • validate_predictions – validates a single predictions request Hash against a schema

  • clean_predictions – cleans a predictions request Hash to conform to a schema

  • validate_schema – validates a schema

  • check_id – checks that a unique ID is valid

  • check_row – checks that a row Hash is well-formed

  • check_datatype – checks that a datatype is valid

  • query_params – helper function for HTTP form encoding

  • make_table_id – autogenerates a new valid ID for a table

  • make_analysis_id – autogenerates a new valid ID for an analysis

See also: dev.priorknowledge.com/docs/client/ruby

Class Method Summary collapse

Class Method Details

.check_datatype(datatype, msg = nil) ⇒ Object

Checks tht a given datatype is valid

Raises a VeritableError if the datatype is invalid.



96
97
98
99
100
101
102
103
104
105
106
# File 'lib/veritable/util.rb', line 96

def check_datatype(datatype, msg=nil)
  if not DATATYPES.include? datatype
    begin
      datatype.to_s
    rescue
      raise VeritableError.new("#{msg}Invalid data type.")
    else
      raise VeritableError.new("#{msg}Invalid data type '#{datatype}'.")
    end
  end
end

.check_id(id) ⇒ Object

Checks that a unique ID is valid

Raises a VeritableError if the ID is invalid.



54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# File 'lib/veritable/util.rb', line 54

def check_id(id)
  if not id.is_a? String
    begin
      id.to_s
    rescue
      raise VeritableError.new("Invalid id -- strings only.")
    else
      raise VeritableError.new("Invalid id '#{id}' -- strings only.")
    end
  elsif not id =~ Regexp.new('\A[a-zA-Z0-9][-_a-zA-Z0-9]*\z')
    raise VeritableError.new("Invalid id '#{id}' -- must contain only alphanumerics, underscores, and dashes.")
  elsif id[0] == '_' or id[0] == '-'
    raise VeritableError.new("Invalid id '#{id}' -- may not begin with a dash or underscore.")
  end
end

.check_row(row) ⇒ Object

Checks that a given row is well-formed

Raises a VeritableError if the row Hash is not well-formed



73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
# File 'lib/veritable/util.rb', line 73

def check_row(row)
  if not row.is_a? Hash
    begin
      row.to_s
    rescue
      raise VeritableError.new("Invalid row -- Must provide a hash of column name-value pairs.")
    else
      raise VeritableError.new("Invalid row #{row} -- Must provide a hash of column name-value pairs.")
    end
  elsif not row.has_key? '_id'
    raise VeritableError.new("Invalid row #{row} -- rows must contain unique row ids in the '_id' field.")
  else
    begin
      check_id row['_id']
    rescue VeritableError => e
      raise VeritableError.new("Invalid row #{row} -- #{e}")
    end
  end
end

.clean_data(rows, schema, opts = {}) ⇒ Object

Cleans up an Array of row Hashes in accordance with an analysis schema

This method mutates its rows argument. If clean_data raises an exception, values in some rows may be converted while others are left in their original state.

Arguments

  • rows – the Array of Hashes to clean up

  • schema – a Schema specifying the types of the columns appearing in the rows being cleaned

  • opts – a Hash optionally containing the keys:

    • convert_types – controls whether clean_data will attempt to convert cells in a column to be of the correct type (default: true)

    • remove_nones – controls whether clean_data will automatically remove cells containing the value nil (default: true)

    • remove_invalids – controls whether clean_data will automatically remove cells that are invalid for a given column (default: true)

    • reduce_categories – controls whether clean_data will automatically reduce the number of categories in categorical columns with too many categories (default: true) If true, the largest categories in a column will be preserved, up to the allowable limit, and the other categories will be binned as "Other".

    • assign_ids – controls whether clean_data will automatically assign new ids to the rows (default: +false=) If true, rows will be numbered sequentially. If the rows have an existing '_id' column, remove_extra_fields must also be set to true to avoid raising a Veritable::VeritableError.

    • remove_extra_fields – controls whether clean_data will automatically remove columns that are not contained in the schema (default: false) If assign_ids is true (default), will also remove the '_id' column.

    • rename_columns – an array of two-valued arrays [[old_col_1, new_col_1],[old_col_2, new_col_2],...] of column names to rename. If rename_columns is false (default), no columns are renamed.

Raises

A Veritable::VeritableError containing further details if the data does not validate against the schema.

Returns

nil on success (mutates rows argument)

See also: dev.priorknowledge.com/docs/client/ruby



250
251
252
253
254
255
256
257
258
259
260
261
262
263
# File 'lib/veritable/util.rb', line 250

def clean_data(rows, schema, opts={})
  validate(rows, schema, {
    'convert_types' => opts.has_key?('convert_types') ? opts['convert_types'] : true,
    'allow_nones' => false,
    'remove_nones' => opts.has_key?('remove_nones') ? opts['remove_nones'] : true,
    'remove_invalids' => opts.has_key?('remove_invalids') ? opts['remove_invalids'] : true,
    'reduce_categories' => opts.has_key?('reduce_categories') ? opts['reduce_categories'] : true,
    'has_ids' => '_id',
    'assign_ids' => opts.has_key?('assign_ids') ? opts['assign_ids'] : false,
    'allow_extra_fields' => true,
    'remove_extra_fields' => opts.has_key?('remove_extra_fields') ? opts['remove_extra_fields'] : false,
    'allow_empty_columns' => false,
    'rename_columns' => opts.has_key?('rename_columns') ? opts['rename_columns'] : false})
end

.clean_predictions(predictions, schema, opts = {}) ⇒ Object

Cleans up a predictions request in accordance with an analysis schema Automatically renames ‘_id’ to ‘_request_id’, otherwise assigns a new ‘_request_id’ to each row

This method mutates its predictions argument. If clean_predictions raises an exception, values in some columns may be converted while others are left in their original state.

Arguments

  • predictions – the predictions request to clean up

  • schema – a Schema specifying the types of the columns appearing in the predictions request

  • opts – a Hash optionally containing the keys:

    • convert_types – controls whether clean_data will attempt to convert cells in a column to be of the correct type (default: true)

    • remove_invalids – controls whether clean_data will automatically remove cells that are invalid for a given column (default: true)

    • remove_extra_fields – controls whether clean_data will automatically remove columns that are not contained in the schema (default: true)

    • rename_columns – an array of two-valued arrays [[old_col_1, new_col_1],[old_col_2, new_col_2],...] of column names to rename. If rename_columns is false, no columns are renamed. (default: [['_id','_request_id']])

Raises

A Veritable::VeritableError containing further details if the predictions request does not validate against the schema

Returns

nil on success (mutates predictions argument)

See also: dev.priorknowledge.com/docs/client/ruby



314
315
316
317
318
319
320
321
322
323
324
325
326
327
# File 'lib/veritable/util.rb', line 314

def clean_predictions(predictions, schema, opts={})
  validate(predictions, schema, {
    'convert_types' => opts.has_key?('convert_types') ? opts['convert_types'] : true,
    'allow_nones' => true,
    'remove_nones' => false,
    'remove_invalids' => opts.has_key?('remove_invalids') ? opts['remove_invalids'] : true,
    'reduce_categories' => false,
    'has_ids' => '_request_id',
    'assign_ids' => true,
    'allow_extra_fields' => false,
    'remove_extra_fields' => opts.has_key?('remove_extra_fields') ? opts['remove_extra_fields'] : true,
    'allow_empty_columns' => true,
    'rename_columns' => [['_id','_request_id']]})
end

.make_analysis_idObject

Autogenerate a new analysis ID

Users should not call directly



40
# File 'lib/veritable/util.rb', line 40

def make_analysis_id; UUID.new.generate :compact ; end

.make_schema(schema_rule, opts = {}) ⇒ Object

Makes a new analysis schema from a schema rule

Arguments

  • schema_rule – a Hash or Array of two-valued Arrays, whose keys or first values should be regexes to match against column names, and whose values should be the appropriate datatype to assign to matching columns, for instance:

    [['a_regex_to_match', {'type' => 'continuous'}], ['another_regex', {'type' => 'count'}], ...]
    
  • opts – a Hash which must contain either:

    • the key 'headers', whose value should be an Array of column names from which to construct the schema

    • or the key 'rows', whose value should be an Array of row Hashes from whose columns the schema is to be constructed

Returns

A new Veritable::Schema

See also: dev.priorknowledge.com/docs/client/ruby



146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
# File 'lib/veritable/util.rb', line 146

def make_schema(schema_rule, opts={})
  if ((not opts.has_key?('headers')) and (not opts.has_key?('rows')))
    raise VeritableError.new("Either 'headers' or 'rows' must be provided!")
  end
  headers = opts.has_key?('headers') ? opts['headers'] : nil
  if headers.nil?
    headers = Set.new
    opts['rows'].each {|row| headers.merge(row.keys)}
    headers = headers.to_a.sort
  end
  schema = {}
  headers.each do |c|
    schema_rule.each do |r, t|
      if r === c
        schema[c] = t
        break
      end
    end
  end
  return Veritable::Schema.new(schema)
end

.make_table_idObject

Autogenerate a new table ID

Users should not call directly



35
# File 'lib/veritable/util.rb', line 35

def make_table_id; UUID.new.generate :compact ; end

.query_params(params, parent = nil) ⇒ Object

Helper function for HTTP form encoding

Users should not call directly



45
46
47
48
49
# File 'lib/veritable/util.rb', line 45

def query_params(params, parent=nil)
  flatten_params(params).collect {|x|
    "#{x[0]}=#{x[1]}"
  }.join("&")
end

.read_csv(filename, id_col = nil, na_vals = ['']) ⇒ Object

Reads a .csv with headers in as an Array of row Hashes

All values are kept as strings, except empty strings, which are omitted. To clean data and convert types in accordance with a given schema, use the clean_data and validate_data functions.

Arguments

  • filename – a path to the .csv file to read in from

  • id_col – optionally specify the column to rename to ‘_id’. If nil (default) and a column named ‘_id’ is present, that column is used. If nil and no ‘_id’ column is present, then ‘_id’ will be automatically generated.

  • na_cols – a list of string values to omit; defaults to [”].

Returns

An Array of row Hashes

See also: dev.priorknowledge.com/docs/client/ruby



205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
# File 'lib/veritable/util.rb', line 205

def read_csv(filename, id_col=nil, na_vals=[''])
  rows = CSV.read(filename)
  header = rows.shift
  header = header.collect {|h| (h == id_col ? '_id' : h).strip}
  if header.include?('_id')
    id_col = '_id'
  end
  rid = 0
  rows = rows.collect do |raw_row|
    rid = rid + 1
    row = {}
    (0...raw_row.length).each do |i|
      row[header[i]] = ( na_vals.include?(raw_row[i]) ? nil : raw_row[i] )
    end
    if id_col.nil? 
      row['_id'] = rid.to_s
    end
    row
  end
  return rows
end

.split_rows(rows, frac) ⇒ Object

Splits an array of row Hashes into two sets

Arguments

  • rows – an Array of valid row Hashes

  • frac – the fraction of the rows to include in the first set

Returns

An array [train_dataset, test_dataset], each of whose members is an Array of row Hashes.

See also: dev.priorknowledge.com/docs/client/ruby



118
119
120
121
122
123
124
125
126
# File 'lib/veritable/util.rb', line 118

def split_rows(rows, frac)
  rows = rows.to_a
  n = rows.size
  inds = (0...n).to_a.shuffle
  border_ind = (n * frac).floor.to_i
  train_dataset = (0...border_ind).collect {|i| rows[inds[i]] }
  test_dataset = (border_ind...n).collect {|i| rows[inds[i]] }
  return [train_dataset, test_dataset]
end

.validate_data(rows, schema) ⇒ Object

Validates an Array of row Hashes against an analysis schema

Arguments

  • rows – the Array of Hashes to clean up

  • schema – a Schema specifying the types of the columns appearing in the rows being cleaned

Raises

A Veritable::VeritableError containing further details if the data does not validate against the schema.

Returns

nil on success

See also: dev.priorknowledge.com/docs/client/ruby



278
279
280
281
282
283
284
285
286
287
288
289
290
291
# File 'lib/veritable/util.rb', line 278

def validate_data(rows, schema)
  validate(rows, schema, {
    'convert_types' => false,
    'allow_nones' => false,
    'remove_nones' => false,
    'remove_invalids' => false,
    'reduce_categories' => false,
    'has_ids' => '_id',
    'assign_ids' => false,
    'allow_extra_fields' => true,
    'remove_extra_fields' => false,
    'allow_empty_columns' => false,
    'rename_columns' => false})
end

.validate_predictions(predictions, schema) ⇒ Object

Validates a predictions request against an analysis schema

Arguments

  • predictions – the predictions request to clean up

  • schema – a Schema specifying the types of the columns appearing in the predictions request

Raises

A Veritable::VeritableError containing further details if the predictions request does not validate against the schema.

Returns

nil on success

See also: dev.priorknowledge.com/docs/client/ruby



342
343
344
345
346
347
348
349
350
351
352
353
354
355
# File 'lib/veritable/util.rb', line 342

def validate_predictions(predictions, schema)
  validate(predictions, schema, {
    'convert_types' => false,
    'allow_nones' => true,
    'remove_nones' => false,
    'remove_invalids' => false,
    'reduce_categories' => false,
    'has_ids' => '_request_id',
    'assign_ids' => false,
    'allow_extra_fields' => false,
    'remove_extra_fields' => false,
    'allow_empty_columns' => true,
    'rename_columns' => false})
end

.validate_schema(schema) ⇒ Object

Validates a schema

Checks that a Veritable::Schema or Hash of the appropriate form is well-formed.



131
# File 'lib/veritable/util.rb', line 131

def validate_schema(schema); schema.is_a? Veritable::Schema ? schema.validate : Veritable::Schema.new(schema).validate; end

.write_csv(rows, filename) ⇒ Object

Writes an Array of row Hashes out to .csv

Arguments

  • rows – an Array of valid row Hashes

  • filename – a path to the .csv file to write out

Returns

nil on success.

See also: dev.priorknowledge.com/docs/client/ruby



178
179
180
181
182
183
184
185
186
187
188
189
190
# File 'lib/veritable/util.rb', line 178

def write_csv(rows, filename)
  headers = Set.new
  rows.each {|row| headers.merge(row.keys)}
  headers = headers.to_a.sort
  CSV.open(filename, "w") do |csv|
    csv << headers
    rows.each do |row|
      out_row = headers.collect {|h| row.keys.include?(h) ? row[h] : ''}
      csv << out_row
    end
  end
  nil
end