Module: Veritable::Util

Defined in:: lib/veritable/util.rb

Overview

Encapsulates utilities for working with data

Methods

read_csv – reads a .csv from disk into an Array of row Hashes
write_csv – writes an Array of row Hashes to disk as .csv
split_rows – splits an Array of row Hashes into two sets
make_schema – makes a new analysis schema from a schema rule
validate_data – validates an Array of row Hashes against a schema
clean_data – cleans an Array of row Hashes to conform to a schema
validate_predictions – validates a single predictions request Hash against a schema
clean_predictions – cleans a predictions request Hash to conform to a schema
validate_schema – validates a schema
check_id – checks that a unique ID is valid
check_row – checks that a row Hash is well-formed
check_datatype – checks that a datatype is valid
query_params – helper function for HTTP form encoding
make_table_id – autogenerates a new valid ID for a table
make_analysis_id – autogenerates a new valid ID for an analysis

Class Method Summary collapse

.check_datatype(datatype, msg = nil) ⇒ Object

Checks tht a given datatype is valid.
.check_id(id) ⇒ Object

Checks that a unique ID is valid.
.check_row(row) ⇒ Object

Checks that a given row is well-formed.
.clean_data(rows, schema, opts = {}) ⇒ Object

Cleans up an Array of row Hashes in accordance with an analysis schema.
.clean_predictions(predictions, schema, opts = {}) ⇒ Object

Cleans up a predictions request in accordance with an analysis schema Automatically renames ‘_id’ to ‘_request_id’, otherwise assigns a new ‘_request_id’ to each row.
.make_analysis_id ⇒ Object

Autogenerate a new analysis ID.
.make_schema(schema_rule, opts = {}) ⇒ Object

Makes a new analysis schema from a schema rule.
.make_table_id ⇒ Object

Autogenerate a new table ID.
.query_params(params, parent = nil) ⇒ Object

Helper function for HTTP form encoding.
.read_csv(filename, id_col = nil, na_vals = ['']) ⇒ Object

Reads a .csv with headers in as an Array of row Hashes.
.split_rows(rows, frac) ⇒ Object

Splits an array of row Hashes into two sets.
.validate_data(rows, schema) ⇒ Object

Validates an Array of row Hashes against an analysis schema.
.validate_predictions(predictions, schema) ⇒ Object

Validates a predictions request against an analysis schema.
.validate_schema(schema) ⇒ Object

Validates a schema.
.write_csv(rows, filename) ⇒ Object

Writes an Array of row Hashes out to .csv.

Class Method Details

.check_datatype(datatype, msg = nil) ⇒ `Object`

Checks tht a given datatype is valid

Raises a VeritableError if the datatype is invalid.

# File 'lib/veritable/util.rb', line 96

def check_datatype(datatype, msg=nil)
  if not DATATYPES.include? datatype
    begin
      datatype.to_s
    rescue
      raise VeritableError.new("#{msg}Invalid data type.")
    else
      raise VeritableError.new("#{msg}Invalid data type '#{datatype}'.")
    end
  end
end

.check_id(id) ⇒ `Object`

Checks that a unique ID is valid

Raises a VeritableError if the ID is invalid.

# File 'lib/veritable/util.rb', line 54

def check_id(id)
  if not id.is_a? String
    begin
      id.to_s
    rescue
      raise VeritableError.new("Invalid id -- strings only.")
    else
      raise VeritableError.new("Invalid id '#{id}' -- strings only.")
    end
  elsif not id =~ Regexp.new('\A[a-zA-Z0-9][-_a-zA-Z0-9]*\z')
    raise VeritableError.new("Invalid id '#{id}' -- must contain only alphanumerics, underscores, and dashes.")
  elsif id[0] == '_' or id[0] == '-'
    raise VeritableError.new("Invalid id '#{id}' -- may not begin with a dash or underscore.")
  end
end

.check_row(row) ⇒ `Object`

Checks that a given row is well-formed

Raises a VeritableError if the row Hash is not well-formed

# File 'lib/veritable/util.rb', line 73

def check_row(row)
  if not row.is_a? Hash
    begin
      row.to_s
    rescue
      raise VeritableError.new("Invalid row -- Must provide a hash of column name-value pairs.")
    else
      raise VeritableError.new("Invalid row #{row} -- Must provide a hash of column name-value pairs.")
    end
  elsif not row.has_key? '_id'
    raise VeritableError.new("Invalid row #{row} -- rows must contain unique row ids in the '_id' field.")
  else
    begin
      check_id row['_id']
    rescue VeritableError => e
      raise VeritableError.new("Invalid row #{row} -- #{e}")
    end
  end
end

.clean_data(rows, schema, opts = {}) ⇒ `Object`

Cleans up an Array of row Hashes in accordance with an analysis schema

This method mutates its rows argument. If clean_data raises an exception, values in some rows may be converted while others are left in their original state.

Arguments

rows – the Array of Hashes to clean up
schema – a Schema specifying the types of the columns appearing in the rows being cleaned
opts – a Hash optionally containing the keys:
- convert_types – controls whether clean_data will attempt to convert cells in a column to be of the correct type (default: true)
- remove_nones – controls whether clean_data will automatically remove cells containing the value nil (default: true)
- remove_invalids – controls whether clean_data will automatically remove cells that are invalid for a given column (default: true)
- reduce_categories – controls whether clean_data will automatically reduce the number of categories in categorical columns with too many categories (default: true) If true, the largest categories in a column will be preserved, up to the allowable limit, and the other categories will be binned as "Other".
- assign_ids – controls whether clean_data will automatically assign new ids to the rows (default: +false=) If true, rows will be numbered sequentially. If the rows have an existing '_id' column, remove_extra_fields must also be set to true to avoid raising a Veritable::VeritableError.
- remove_extra_fields – controls whether clean_data will automatically remove columns that are not contained in the schema (default: false) If assign_ids is true (default), will also remove the '_id' column.
- rename_columns – an array of two-valued arrays [[old_col_1, new_col_1],[old_col_2, new_col_2],...] of column names to rename. If rename_columns is false (default), no columns are renamed.

Raises

A Veritable::VeritableError containing further details if the data does not validate against the schema.

Returns

nil on success (mutates rows argument)

# File 'lib/veritable/util.rb', line 250

def clean_data(rows, schema, opts={})
  validate(rows, schema, {
    'convert_types' => opts.has_key?('convert_types') ? opts['convert_types'] : true,
    'allow_nones' => false,
    'remove_nones' => opts.has_key?('remove_nones') ? opts['remove_nones'] : true,
    'remove_invalids' => opts.has_key?('remove_invalids') ? opts['remove_invalids'] : true,
    'reduce_categories' => opts.has_key?('reduce_categories') ? opts['reduce_categories'] : true,
    'has_ids' => '_id',
    'assign_ids' => opts.has_key?('assign_ids') ? opts['assign_ids'] : false,
    'allow_extra_fields' => true,
    'remove_extra_fields' => opts.has_key?('remove_extra_fields') ? opts['remove_extra_fields'] : false,
    'allow_empty_columns' => false,
    'rename_columns' => opts.has_key?('rename_columns') ? opts['rename_columns'] : false})
end

.clean_predictions(predictions, schema, opts = {}) ⇒ `Object`

Cleans up a predictions request in accordance with an analysis schema Automatically renames ‘_id’ to ‘_request_id’, otherwise assigns a new ‘_request_id’ to each row

This method mutates its predictions argument. If clean_predictions raises an exception, values in some columns may be converted while others are left in their original state.

Arguments

predictions – the predictions request to clean up
schema – a Schema specifying the types of the columns appearing in the predictions request
opts – a Hash optionally containing the keys:
- convert_types – controls whether clean_data will attempt to convert cells in a column to be of the correct type (default: true)
- remove_invalids – controls whether clean_data will automatically remove cells that are invalid for a given column (default: true)
- remove_extra_fields – controls whether clean_data will automatically remove columns that are not contained in the schema (default: true)
- rename_columns – an array of two-valued arrays [[old_col_1, new_col_1],[old_col_2, new_col_2],...] of column names to rename. If rename_columns is false, no columns are renamed. (default: [['_id','_request_id']])

Raises

A Veritable::VeritableError containing further details if the predictions request does not validate against the schema

Returns

nil on success (mutates predictions argument)

# File 'lib/veritable/util.rb', line 314

def clean_predictions(predictions, schema, opts={})
  validate(predictions, schema, {
    'convert_types' => opts.has_key?('convert_types') ? opts['convert_types'] : true,
    'allow_nones' => true,
    'remove_nones' => false,
    'remove_invalids' => opts.has_key?('remove_invalids') ? opts['remove_invalids'] : true,
    'reduce_categories' => false,
    'has_ids' => '_request_id',
    'assign_ids' => true,
    'allow_extra_fields' => false,
    'remove_extra_fields' => opts.has_key?('remove_extra_fields') ? opts['remove_extra_fields'] : true,
    'allow_empty_columns' => true,
    'rename_columns' => [['_id','_request_id']]})
end

.make_analysis_id ⇒ `Object`

Autogenerate a new analysis ID

Users should not call directly

40	# File 'lib/veritable/util.rb', line 40 def make_analysis_id; UUID.new.generate :compact ; end

.make_schema(schema_rule, opts = {}) ⇒ `Object`

Makes a new analysis schema from a schema rule

Arguments

schema_rule – a Hash or Array of two-valued Arrays, whose keys or first values should be regexes to match against column names, and whose values should be the appropriate datatype to assign to matching columns, for instance:
```
[['a_regex_to_match', {'type' => 'continuous'}], ['another_regex', {'type' => 'count'}], ...]
```
opts – a Hash which must contain either:
- the key 'headers', whose value should be an Array of column names from which to construct the schema
- or the key 'rows', whose value should be an Array of row Hashes from whose columns the schema is to be constructed

Returns

A new Veritable::Schema

# File 'lib/veritable/util.rb', line 146

def make_schema(schema_rule, opts={})
  if ((not opts.has_key?('headers')) and (not opts.has_key?('rows')))
    raise VeritableError.new("Either 'headers' or 'rows' must be provided!")
  end
  headers = opts.has_key?('headers') ? opts['headers'] : nil
  if headers.nil?
    headers = Set.new
    opts['rows'].each {|row| headers.merge(row.keys)}
    headers = headers.to_a.sort
  end
  schema = {}
  headers.each do |c|
    schema_rule.each do |r, t|
      if r === c
        schema[c] = t
        break
      end
    end
  end
  return Veritable::Schema.new(schema)
end

.make_table_id ⇒ `Object`

Autogenerate a new table ID

Users should not call directly

35	# File 'lib/veritable/util.rb', line 35 def make_table_id; UUID.new.generate :compact ; end

.query_params(params, parent = nil) ⇒ `Object`

Helper function for HTTP form encoding

Users should not call directly

# File 'lib/veritable/util.rb', line 45

def query_params(params, parent=nil)
  flatten_params(params).collect {|x|
    "#{x[0]}=#{x[1]}"
  }.join("&")
end

.read_csv(filename, id_col = nil, na_vals = ['']) ⇒ `Object`

Reads a .csv with headers in as an Array of row Hashes

All values are kept as strings, except empty strings, which are omitted. To clean data and convert types in accordance with a given schema, use the clean_data and validate_data functions.

Arguments

filename – a path to the .csv file to read in from
id_col – optionally specify the column to rename to ‘_id’. If nil (default) and a column named ‘_id’ is present, that column is used. If nil and no ‘_id’ column is present, then ‘_id’ will be automatically generated.
na_cols – a list of string values to omit; defaults to [”].

Returns

An Array of row Hashes

# File 'lib/veritable/util.rb', line 205

def read_csv(filename, id_col=nil, na_vals=[''])
  rows = CSV.read(filename)
  header = rows.shift
  header = header.collect {|h| (h == id_col ? '_id' : h).strip}
  if header.include?('_id')
    id_col = '_id'
  end
  rid = 0
  rows = rows.collect do |raw_row|
    rid = rid + 1
    row = {}
    (0...raw_row.length).each do |i|
      row[header[i]] = ( na_vals.include?(raw_row[i]) ? nil : raw_row[i] )
    end
    if id_col.nil? 
      row['_id'] = rid.to_s
    end
    row
  end
  return rows
end

.split_rows(rows, frac) ⇒ `Object`

Splits an array of row Hashes into two sets

Arguments

rows – an Array of valid row Hashes
frac – the fraction of the rows to include in the first set

Returns

An array [train_dataset, test_dataset], each of whose members is an Array of row Hashes.

# File 'lib/veritable/util.rb', line 118

def split_rows(rows, frac)
  rows = rows.to_a
  n = rows.size
  inds = (0...n).to_a.shuffle
  border_ind = (n * frac).floor.to_i
  train_dataset = (0...border_ind).collect {|i| rows[inds[i]] }
  test_dataset = (border_ind...n).collect {|i| rows[inds[i]] }
  return [train_dataset, test_dataset]
end

.validate_data(rows, schema) ⇒ `Object`

Validates an Array of row Hashes against an analysis schema

Arguments

rows – the Array of Hashes to clean up
schema – a Schema specifying the types of the columns appearing in the rows being cleaned

Raises

A Veritable::VeritableError containing further details if the data does not validate against the schema.

Returns

nil on success

# File 'lib/veritable/util.rb', line 278

def validate_data(rows, schema)
  validate(rows, schema, {
    'convert_types' => false,
    'allow_nones' => false,
    'remove_nones' => false,
    'remove_invalids' => false,
    'reduce_categories' => false,
    'has_ids' => '_id',
    'assign_ids' => false,
    'allow_extra_fields' => true,
    'remove_extra_fields' => false,
    'allow_empty_columns' => false,
    'rename_columns' => false})
end

.validate_predictions(predictions, schema) ⇒ `Object`

Validates a predictions request against an analysis schema

Arguments

predictions – the predictions request to clean up
schema – a Schema specifying the types of the columns appearing in the predictions request

Raises

A Veritable::VeritableError containing further details if the predictions request does not validate against the schema.

Returns

nil on success

# File 'lib/veritable/util.rb', line 342

def validate_predictions(predictions, schema)
  validate(predictions, schema, {
    'convert_types' => false,
    'allow_nones' => true,
    'remove_nones' => false,
    'remove_invalids' => false,
    'reduce_categories' => false,
    'has_ids' => '_request_id',
    'assign_ids' => false,
    'allow_extra_fields' => false,
    'remove_extra_fields' => false,
    'allow_empty_columns' => true,
    'rename_columns' => false})
end

.validate_schema(schema) ⇒ `Object`

Validates a schema

Checks that a Veritable::Schema or Hash of the appropriate form is well-formed.

131	# File 'lib/veritable/util.rb', line 131 def validate_schema(schema); schema.is_a? Veritable::Schema ? schema.validate : Veritable::Schema.new(schema).validate; end

.write_csv(rows, filename) ⇒ `Object`

Writes an Array of row Hashes out to .csv

Arguments

rows – an Array of valid row Hashes
filename – a path to the .csv file to write out

Returns

nil on success.

# File 'lib/veritable/util.rb', line 178

def write_csv(rows, filename)
  headers = Set.new
  rows.each {|row| headers.merge(row.keys)}
  headers = headers.to_a.sort
  CSV.open(filename, "w") do |csv|
    csv << headers
    rows.each do |row|
      out_row = headers.collect {|h| row.keys.include?(h) ? row[h] : ''}
      csv << out_row
    end
  end
  nil
end

Module: Veritable::Util

Overview

Methods

Class Method Summary collapse

Class Method Details

.check_datatype(datatype, msg = nil) ⇒ Object

.check_id(id) ⇒ Object

.check_row(row) ⇒ Object

.clean_data(rows, schema, opts = {}) ⇒ Object

Arguments

Raises

Returns

.clean_predictions(predictions, schema, opts = {}) ⇒ Object

Arguments

Raises

Returns

.make_analysis_id ⇒ Object

.make_schema(schema_rule, opts = {}) ⇒ Object

Arguments

Returns

.make_table_id ⇒ Object

.query_params(params, parent = nil) ⇒ Object

.read_csv(filename, id_col = nil, na_vals = ['']) ⇒ Object

Arguments

Returns

.split_rows(rows, frac) ⇒ Object

Arguments

Returns

.validate_data(rows, schema) ⇒ Object

Arguments

Raises

Returns

.validate_predictions(predictions, schema) ⇒ Object

Arguments

Raises

Returns

.validate_schema(schema) ⇒ Object

.write_csv(rows, filename) ⇒ Object

Arguments

Returns

.check_datatype(datatype, msg = nil) ⇒ `Object`

.check_id(id) ⇒ `Object`

.check_row(row) ⇒ `Object`

.clean_data(rows, schema, opts = {}) ⇒ `Object`

.clean_predictions(predictions, schema, opts = {}) ⇒ `Object`

.make_analysis_id ⇒ `Object`

.make_schema(schema_rule, opts = {}) ⇒ `Object`

.make_table_id ⇒ `Object`

.query_params(params, parent = nil) ⇒ `Object`

.read_csv(filename, id_col = nil, na_vals = ['']) ⇒ `Object`

.split_rows(rows, frac) ⇒ `Object`

.validate_data(rows, schema) ⇒ `Object`

.validate_predictions(predictions, schema) ⇒ `Object`

.validate_schema(schema) ⇒ `Object`

.write_csv(rows, filename) ⇒ `Object`