Module: Rsmart::ETL

Defined in:
lib/rsmart_toolbox/etl.rb

Overview

rSmart extract, transform and load methods.

Defined Under Namespace

Modules: GRM Classes: TextParseError

Class Method Summary collapse

Class Method Details

.encode(str, opt = { encoding: "UTF-8" }) ⇒ String

Encodes the input String and replaces invalid or undefined characters.

Parameters:

  • str (String)

    the String to be encoded and invalid characters replaced with valid characters.

  • opt (Hash) (defaults to: { encoding: "UTF-8" })

    a customizable set of options

Options Hash (opt):

  • :encoding (String)

    the character encoding to use.

Returns:

  • (String)

    the result of encoding the String and replacing invalid characters with valid characters.

See Also:

  • String#encode


108
109
110
111
112
# File 'lib/rsmart_toolbox/etl.rb', line 108

def self.encode(str, opt={ encoding: "UTF-8" } )
  opt[:encoding] = "UTF-8" if opt[:encoding].nil?
  str.encode( opt[:encoding], :invalid => :replace,
              :undef => :replace, :replace => "" )
end

.error(e) ⇒ Exception

Prepares an Exception for consistent error handling.

Parameters:

  • e (String, Exception)

    the error to handle

Returns:

  • (Exception)

    an Exception with a message formatted with $INPUT_LINE_NUMBER.

Raises:

  • (ArgumentError)

    if an invalid argument is passed.



30
31
32
33
34
35
36
37
38
39
# File 'lib/rsmart_toolbox/etl.rb', line 30

def self.error(e)
  if e.kind_of? String
    # default to TextParseError
    return TextParseError.new "ERROR: Line #{$INPUT_LINE_NUMBER}: #{e}"
  end
  if e.kind_of? Exception
    return e.exception "ERROR: Line #{$INPUT_LINE_NUMBER}: #{e}"
  end
  raise ArgumentError, "Unsupported error type: #{e.class}"
end

.escape_single_quotes(str) ⇒ String?

Prepares a String for a SQL statement where single quotes need to be escaped.

Parameters:

  • str (String)

    the String to be escaped.

Returns:

  • (String, nil)

    the resulting String with single quotes escaped with a backslash. If a nil is passed, nil is returned.



148
149
150
151
152
153
# File 'lib/rsmart_toolbox/etl.rb', line 148

def self.escape_single_quotes(str)
  if str.nil?
    return nil
  end
  return str.to_s.gsub("'", "\\\\'")
end

.mutate_sql_stmt!(insert_str, column_name, values_str, value) ⇒ void

This method returns an undefined value.

Mutates two sides of a SQL insert statement: insert_str and values_str with column_name and value respectively. Proper SQL value quoting will be performed based on object type.

Parameters:

  • insert_str (String)

    the left side of the insert statement (i.e. columns)

  • column_name (String)

    the column name to append to insert_str.

  • values_str (String)

    the right side of the insert statement (i.e. values)

  • value (Object)

    the value to append to values_str. Must respond to #to_s.



133
134
135
136
137
138
139
140
141
142
# File 'lib/rsmart_toolbox/etl.rb', line 133

def self.mutate_sql_stmt!(insert_str, column_name, values_str, value)
  insert_str.concat "#{column_name.upcase},"
  # TODO what are all of the valid types that should not be quoted?
  if value.kind_of? Integer
    values_str.concat "#{value},"
  else
    values_str.concat "'#{value}',"
  end
  return nil
end

.parse_boolean(str, opt = {}) ⇒ Boolean

Matches the input against a set of well known boolean patterns.

Parameters:

  • str (String)

    String to be matched against well known boolean patterns.

  • opt (Hash) (defaults to: {})

    a customizable set of options

Options Hash (opt):

  • :default (Boolean)

    the default return value if str is empty.

Returns:

  • (Boolean)

    the result of matching the str input against well known boolean patterns.

Raises:

  • (TextParseError)

    if none of the known boolean patterns could be matched.



88
89
90
91
92
93
94
95
96
97
98
99
100
101
# File 'lib/rsmart_toolbox/etl.rb', line 88

def self.parse_boolean(str, opt={})
  return true  if str == true
  return false if str == false
  b = parse_string str, opt
  return true  if b =~ /^(active|a|true|t|yes|y|1)$/i
  return false if b =~ /^(inactive|i|false|f|no|n|0)$/i
  if b.empty? && !opt[:default].nil?
    return opt[:default]
  end
  if b.empty?
    return nil
  end
  raise Rsmart::ETL::error TextParseError.new "invalid value for Boolean: '#{str}'"
end

.parse_csv_command_line_options(executable, args, opt = { csv_options: { headers: :first_row, header_converters: :symbol, skip_blanks: true, col_sep: ",", quote_char: '"' } }) ⇒ Hash

Parse common command line options for CSV –> SQL transformations.

Examples:

The most common usage:

opt = Rsmart::ETL.parse_csv_command_line_options (File.basename $0), ARGF.argv

Parameters:

  • executable (String)

    the name of the script from which we are executing. See example.

  • args (Array<String>)

    the command line args.

  • opt (Hash) (defaults to: { csv_options: { headers: :first_row, header_converters: :symbol, skip_blanks: true, col_sep: ",", quote_char: '"' } })

    a customizable set of options

Options Hash (opt):

  • :csv_filename (String)

    the input file from which the CSV will be read. Defaults to the first element of args Array.

  • :sql_filename (String)

    the output file to which the SQL will be written.

  • :csv_options (Hash)

    the options that will be used by the CSV parser.

Returns:

  • (Hash)

    a Hash containing the parsed command line results.



367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
# File 'lib/rsmart_toolbox/etl.rb', line 367

def self.parse_csv_command_line_options(
    executable, args, opt={ csv_options: { headers: :first_row,
                                           header_converters: :symbol,
                                           skip_blanks: true,
                                           col_sep: ",",
                                           quote_char: '"'
                                           }
                            } )
  optparse = OptionParser.new do |opts|
    opts.banner = "Usage: #{executable} [options] csv_file"
    opts.on( '-o' ,'--output SQL_FILE_OUTPUT', 'The file the SQL data will be writen to... (defaults to <csv_file>.sql)') do |f|
      opt[:sql_filename] = f
    end
    opts.on( '-s' ,'--separator SEPARATOR_CHARACTER', 'The character that separates each column of the CSV file.') do |s|
      opt[:csv_options][:col_sep] = s
    end
    opts.on( '-q' ,'--quote QUOTE_CHARACTER', 'The character used to quote fields.') do |q|
      opt[:csv_options][:quote_char] = q
    end
    opts.on( '-h', '--help', 'Display this screen' ) do
      puts opts
      exit 1
    end

    opt[:csv_filename] = args[0] unless opt[:csv_filename]
    if opt[:csv_filename].nil? || opt[:csv_filename].empty?
      puts opts
      exit 1
    end
  end
  optparse.parse!

  # construct a sensible default ouptput filename
  unless opt[:sql_filename]
    file_extension = File.extname opt[:csv_filename]
    dir_name = File.dirname opt[:csv_filename]
    base_name = File.basename opt[:csv_filename], file_extension
    opt[:sql_filename] = "#{dir_name}/#{base_name}.sql"
  end

  return opt
end

.parse_date(str, opt = { valid_values: /^$|(\d{4}\-\d{2}\-\d{2}){1}/ }) ⇒ String

Parse a SQL date from a String.

Parameters:

  • str (String)

    the String to be parsed.

  • opt (Hash) (defaults to: { valid_values: /^$|(\d{4}\-\d{2}\-\d{2}){1}/ })

    options Hash will be passed through to #parse_string.

Returns:

  • (String)

    the parsed date. nil or empty inputs will return ” by default.

See Also:



270
271
272
273
# File 'lib/rsmart_toolbox/etl.rb', line 270

def self.parse_date(str, opt={ valid_values: /^$|(\d{4}\-\d{2}\-\d{2}){1}/ })
  opt[:valid_values] = /^$|(\d{4}\-\d{2}\-\d{2}){1}/ if opt[:valid_values].nil?
  return parse_string str, opt
end

.parse_date!(row, insert_str, values_str, opt = {}) ⇒ void

This method returns an undefined value.

Helper method for #parse_date which finds the value by column :name and mutates the SQL statement accordingly.

Parameters:

  • row (CSV::Row)

    the CSV Row being parsed

  • insert_str (String)

    the left side of the insert statement (i.e. columns)

  • values_str (String)

    the right side of the insert statement (i.e. values)

  • opt (Hash) (defaults to: {})

    options Hash will be passed through to #parse_date.

Options Hash (opt):

  • :name (String)

    the name of the field being parsed. Required.

Raises:

  • (ArgumentError)

    :name is required.

See Also:



285
286
287
288
289
# File 'lib/rsmart_toolbox/etl.rb', line 285

def self.parse_date!(row, insert_str, values_str, opt={})
  raise ArgumentError, "opt[:name] is required!" unless opt[:name]
  date = parse_date( row[ to_symbol( opt[:name] ) ], opt )
  mutate_sql_stmt! insert_str, opt[:name], values_str, date
end

.parse_datetime(str, opt = { valid_values: /^$|(\d{4}\-\d{2}\-\d{2}){1}\s(\d{2}:\d{2}:\d{2})?/ }) ⇒ String

Parse a SQL datetime from a String.

Parameters:

  • str (String)

    the String to be parsed.

  • opt (Hash) (defaults to: { valid_values: /^$|(\d{4}\-\d{2}\-\d{2}){1}\s(\d{2}:\d{2}:\d{2})?/ })

    options Hash will be passed through to #parse_string.

Returns:

  • (String)

    the parsed datetime. nil or empty inputs will return ” by default.

See Also:



296
297
298
299
# File 'lib/rsmart_toolbox/etl.rb', line 296

def self.parse_datetime(str, opt={ valid_values: /^$|(\d{4}\-\d{2}\-\d{2}){1}\s(\d{2}:\d{2}:\d{2})?/ })
  opt[:valid_values] = /^$|(\d{4}\-\d{2}\-\d{2}){1}\s(\d{2}:\d{2}:\d{2})?/ if opt[:valid_values].nil?
  return parse_string str, opt
end

.parse_datetime!(row, insert_str, values_str, opt = {}) ⇒ void

This method returns an undefined value.

Helper method for #parse_datetime which finds the value by column :name and mutates the SQL statement accordingly.

Parameters:

  • row (CSV::Row)

    the CSV Row being parsed

  • insert_str (String)

    the left side of the insert statement (i.e. columns)

  • values_str (String)

    the right side of the insert statement (i.e. values)

  • opt (Hash) (defaults to: {})

    options Hash will be passed through to #parse_datetime.

Options Hash (opt):

  • :name (String)

    the name of the field being parsed. Required.

Raises:

  • (ArgumentError)

    :name is required.

See Also:



311
312
313
314
315
# File 'lib/rsmart_toolbox/etl.rb', line 311

def self.parse_datetime!(row, insert_str, values_str, opt={})
  raise ArgumentError, "opt[:name] is required!" unless opt[:name]
  datetime = parse_datetime( row[ to_symbol( opt[:name] ) ], opt )
  mutate_sql_stmt! insert_str, opt[:name], values_str, datetime
end

.parse_flag(str, opt = { length: 1, upcase: true }) ⇒ String

Useful for parsing “flag” like values; i.e. usually single characters.

Parameters:

  • str (String)

    the String to be parsed.

  • opt (Hash) (defaults to: { length: 1, upcase: true })

    options Hash will be passed through to #parse_string.

Options Hash (opt):

  • :length (Integer)

    the maximum supported length of the field.

  • :upcase (Boolean)

    if true upcase the results.

Returns:

  • (String)

    the parsed “flag”.

See Also:



349
350
351
352
353
354
355
# File 'lib/rsmart_toolbox/etl.rb', line 349

def self.parse_flag(str, opt={ length: 1, upcase: true })
  opt[:length] = 1 if opt[:length].nil?
  opt[:upcase] = true if opt[:upcase].nil?
  retval = parse_string str, opt
  retval = retval.upcase if opt[:upcase] == true
  return retval
end

.parse_float(str, opt = {}) ⇒ Float?

Note:

Note the behavioral difference versus #to_f.

Parse a Float from a String.

Examples:

Unlike #to_f, nil or empty inputs will return nil by default

nil == parse_float(nil) && nil == parse_float('') && 0.0 != parse_float(nil)

Parameters:

  • str (String)

    the String to be parsed.

  • opt (Hash) (defaults to: {})

    options Hash will be passed through to #parse_string.

Returns:

  • (Float, nil)

    the parsed Float. nil or empty inputs will return nil by default.

See Also:



256
257
258
259
260
261
262
263
# File 'lib/rsmart_toolbox/etl.rb', line 256

def self.parse_float(str, opt={})
  s = parse_string str, opt
  if s.empty?
    return nil;
  else
    return s.to_f
  end
end

.parse_integer(str, opt = {}) ⇒ Integer?

Note:

Note the behavioral difference versus #to_i.

Parse an Integer from a String.

Examples:

Unlike #to_i, nil or empty inputs will return nil by default

nil == parse_integer(nil) && nil == parse_integer('') && 0 != parse_integer(nil)

Parameters:

  • str (String)

    the String to be parsed.

  • opt (Hash) (defaults to: {})

    options Hash will be passed through to #parse_string.

Returns:

  • (Integer, nil)

    the parsed Integer. nil or empty inputs will return nil by default.

See Also:



223
224
225
226
227
228
229
230
# File 'lib/rsmart_toolbox/etl.rb', line 223

def self.parse_integer(str, opt={})
  s = parse_string str, opt
  if s.empty?
    return nil;
  else
    return s.to_i
  end
end

.parse_integer!(row, insert_str, values_str, opt = {}) ⇒ void

This method returns an undefined value.

Helper method which finds the value by column :name and mutates the SQL statement accordingly.

Parameters:

  • row (CSV::Row)

    the CSV Row being parsed

  • insert_str (String)

    the left side of the insert statement (i.e. columns)

  • values_str (String)

    the right side of the insert statement (i.e. values)

  • opt (Hash) (defaults to: {})

    options Hash will be passed through to #parse_integer.

Options Hash (opt):

  • :name (String)

    the name of the field being parsed. Required.

Raises:

  • (ArgumentError)

    :name is required.

See Also:



242
243
244
245
246
# File 'lib/rsmart_toolbox/etl.rb', line 242

def self.parse_integer!(row, insert_str, values_str, opt={})
  raise ArgumentError, "opt[:name] is required!" unless opt[:name]
  i = parse_integer( row[ to_symbol( opt[:name] ) ], opt )
  mutate_sql_stmt! insert_str, opt[:name], values_str, i
end

.parse_string(str, opt = { strict: true, required: false, escape_single_quotes: true }) ⇒ String

Parses a string using common parsing behavior with options. This method forms the foundation of all the other parsing methods.

Examples:

nil or empty inputs will return the empty String by default

'' == parse_string(nil) && '' == parse_string('')

Parameters:

  • str (String)

    the String to be parsed.

  • opt (Hash) (defaults to: { strict: true, required: false, escape_single_quotes: true })

    a customizable set of options

Options Hash (opt):

  • :default (String, #to_s)

    the default return value if str is empty. Must respond to #to_s

  • :escape_single_quotes (Boolean)

    escape single quote characters.

  • :length (Integer)

    raise a TextParseError if str.length > :length.

  • :name (String)

    the name of the field being parsed. Used only for error handling.

  • :required (Boolean)

    raise a TextParseError if str is empty.

  • :strict (Boolean)

    strict length checking will produce errors instead of warnings.

  • :valid_values (Array<Object>, Regexp)

    all of the possible valid values.

Returns:

  • (String)

    the parsed results. nil or empty inputs will return the empty String by default(i.e. ”).

Raises:

See Also:



173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
# File 'lib/rsmart_toolbox/etl.rb', line 173

def self.parse_string(str, opt={ strict: true, required: false, escape_single_quotes: true })
  opt[:strict] = true if opt[:strict].nil?
  opt[:escape_single_quotes] = true if opt[:escape_single_quotes].nil?
  retval = encode str.to_s.strip
  if opt[:required] && retval.empty?
    raise Rsmart::ETL::error TextParseError.new "Required data element '#{opt[:name]}' not found: '#{str}'"
  end
  if opt[:default] && retval.empty?
    retval = opt[:default].to_s
  end
  if opt[:length] && retval.length > opt[:length].to_i
    detail = "#{opt[:name]}.length > #{opt[:length]}: '#{str}'-->'#{str[0..(opt[:length] - 1)]}'"
    if opt[:strict]
      raise Rsmart::ETL::error TextParseError.new "Data exceeds maximum field length: #{detail}"
    end
    Rsmart::ETL::warning "Data will be truncated: #{detail}"
  end
  if opt[:valid_values] && ! valid_value(retval, opt[:valid_values], opt)
    raise Rsmart::ETL::error TextParseError.new "Illegal #{opt[:name]}: value '#{str}' not found in: #{opt[:valid_values]}"
  end
  if opt[:escape_single_quotes]
    retval = escape_single_quotes retval
  end
  return retval
end

.parse_string!(row, insert_str, values_str, opt = {}) ⇒ void

This method returns an undefined value.

Helper method which finds the value by column :name and mutates the SQL statement accordingly.

Parameters:

  • row (CSV::Row)

    the CSV Row being parsed

  • insert_str (String)

    the left side of the insert statement (i.e. columns)

  • values_str (String)

    the right side of the insert statement (i.e. values)

  • opt (Hash) (defaults to: {})

    options Hash will be passed through to #parse_string.

Options Hash (opt):

  • :name (String)

    the name of the field being parsed. Required.

Raises:

  • (ArgumentError)

    :name is required.

See Also:



209
210
211
212
213
# File 'lib/rsmart_toolbox/etl.rb', line 209

def self.parse_string!(row, insert_str, values_str, opt={})
  raise ArgumentError, "opt[:name] is required!" unless opt[:name]
  str = parse_string( row[ to_symbol( opt[:name] ) ], opt )
  mutate_sql_stmt! insert_str, opt[:name], values_str, str
end

.parse_timestamp(str, opt = { valid_values: /^$|(\d{4}\-\d{2}\-\d{2}){1}\s(\d{2}:\d{2}:\d{2})?/ }) ⇒ String

Parse a SQL timestamp from a String.

Parameters:

  • str (String)

    the String to be parsed.

  • opt (Hash) (defaults to: { valid_values: /^$|(\d{4}\-\d{2}\-\d{2}){1}\s(\d{2}:\d{2}:\d{2})?/ })

    options Hash will be passed through to #parse_string.

Returns:

  • (String)

    the parsed timestamp. nil or empty inputs will return ” by default.

See Also:



322
323
324
# File 'lib/rsmart_toolbox/etl.rb', line 322

def self.parse_timestamp(str, opt={ valid_values: /^$|(\d{4}\-\d{2}\-\d{2}){1}\s(\d{2}:\d{2}:\d{2})?/ })
  return parse_datetime str, opt
end

.parse_timestamp!(row, insert_str, values_str, opt = {}) ⇒ void

This method returns an undefined value.

Helper method for #parse_timestamp which finds the value by column :name and mutates the SQL statement accordingly.

Parameters:

  • row (CSV::Row)

    the CSV Row being parsed

  • insert_str (String)

    the left side of the insert statement (i.e. columns)

  • values_str (String)

    the right side of the insert statement (i.e. values)

  • opt (Hash) (defaults to: {})

    options Hash will be passed through to #parse_timestamp.

Options Hash (opt):

  • :name (String)

    the name of the field being parsed. Required.

Raises:

  • (ArgumentError)

    :name is required.

See Also:



336
337
338
339
340
# File 'lib/rsmart_toolbox/etl.rb', line 336

def self.parse_timestamp!(row, insert_str, values_str, opt={})
  raise ArgumentError, "opt[:name] is required!" unless opt[:name]
  timestamp = parse_datetime( row[ to_symbol( opt[:name] ) ], opt )
  mutate_sql_stmt! insert_str, opt[:name], values_str, timestamp
end

.to_symbol(str) ⇒ Symbol

Matches the MRI CSV specification: The header String is downcased, spaces are replaced with underscores, non-word characters are dropped, and finally to_sym() is called.

Parameters:

  • str (String)

    the String to be symbolized.

Returns:

  • (Symbol)

    String is downcased, spaces are replaced with underscores, non-word characters are dropped

Raises:

  • (ArgumentError)

    if str is nil or empty.



121
122
123
124
# File 'lib/rsmart_toolbox/etl.rb', line 121

def self.to_symbol(str)
  raise ArgumentError, "Illegal symbol name: '#{str}'" if str.nil? || str.empty?
  encode( str.downcase.gsub(/\s+/, "_").gsub(/\W+/, "") ).to_sym
end

.valid_value(subject, valid_values, opt = {}) ⇒ Boolean

Tests whether the subject matches one of the valid values.

Parameters:

  • subject (String, #match)

    used for validity checking.

  • valid_values (Array<Object>, Regexp)

    all of the possible valid values.

  • opt (Hash) (defaults to: {})

    a customizable set of options

Options Hash (opt):

  • :case_sensitive (Boolean)

    performs case sensitive matching

Returns:

  • (Boolean)

    true if the subject matches valid_values. FYI valid_values must respond to #casecmp.

Raises:

  • (ArgumentError)

    if valid_values is nil or empty.

  • (ArgumentError)

    case sensitive matching only works for objects that respond to #casecmp; primarily String objects.



65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# File 'lib/rsmart_toolbox/etl.rb', line 65

def self.valid_value(subject, valid_values, opt={})
  raise ArgumentError, "valid_values must not be nil!" if valid_values.nil?
  if valid_values.kind_of? Regexp
    return true if subject =~ valid_values
  end
  if valid_values.kind_of? Array
    raise ArgumentError, "valid_values must have at least one element!" unless valid_values.length > 0
    if opt[:case_sensitive] == false # case insensitive comparison requested
      raise ArgumentError, "Object must respond to #casecmp" unless subject.respond_to? 'casecmp'
      valid_values.each do |valid_value|
        return true if valid_value.casecmp(subject) == 0
      end
    end
    return true if valid_values.include? subject # default to == equality
  end
  return false
end

.warning(e) ⇒ Exception

Prepares an Exception for consistent warning handling.

Parameters:

  • e (String, Exception)

    the warning to handle

Returns:

  • (Exception)

    an Exception with a message formatted with $INPUT_LINE_NUMBER.

Raises:

  • (ArgumentError)

    if an invalid argument is passed.



45
46
47
48
49
50
51
52
53
54
# File 'lib/rsmart_toolbox/etl.rb', line 45

def self.warning(e)
  if e.kind_of? String
    # default to TextParseError
    return TextParseError.new "WARN:  Line #{$INPUT_LINE_NUMBER}: #{e}"
  end
  if e.kind_of? Exception
    return e.exception "WARN:  Line #{$INPUT_LINE_NUMBER}: #{e}"
  end
  raise ArgumentError, "Unsupported error type: #{e.class}"
end