Class: ExtCsv

Inherits:
OpenStruct
  • Object
show all
Includes:
Comparable, Enumerable
Defined in:
lib/extcsv.rb

Overview

CSV-like Data processing made easy

(see project page: rubyforge.org/projects/extcsv)

The extcsv package should enable you to navigate and operate on csv-like data as easy and comfortable as possible. The main restriction is, that the columns are named, i.e. the first line of a data file has to contain a header with string-like entries.

Data can be read from files, strings, hashes or arrays.

Have a look at my other projects for correlation and spectral filtering.

Author: Ralf Mueller

License: BSD - see license file

Constant Summary collapse

VERSION =
'0.12.2'
TYPES =

Allowed data types

%w{csv ssv tsv psv bsv txt plain}
MODES =

Allowed input modes, db and url are not supported, yet

%w{file hash array string}
DOUBLE_COLUMNS =

column names from different file type, which that have the same meaning

{}
METADATA =

Non-Data fields

%w{mode datatype datacolumns cellsep rowsep filename filemtime}
ShunkSize =

ShunkSize for handling large objects with MRI

65536

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(mode, datatype, params) ⇒ ExtCsv

mode can be one of the allowed MODES datatype can be one of the TYPES

Example

ExtCsv.new("file","txt","Data.txt")
ExtCsv.new("file","csv","Ergebniss.csv")


62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
# File 'lib/extcsv.rb', line 62

def initialize(mode, datatype, params)
  obj_hash               = {}
  obj_hash[:mode]        = mode
  obj_hash[:datatype]    = datatype
  obj_hash[:datacolumns] = []

  if not MODES.include?(mode) or not TYPES.include?(datatype)
    puts "use '#{MODES.join("','")}' for first " +
         "and '#{TYPES.join(",")}' for second parameter " +
         "datatype was '#{datatype}', mode was '#{mode}'"
    raise 
  end

  # Grep data from the given source, e.g. database or file
  case obj_hash[:mode]
  when "string"
    set_separators(obj_hash)
    parse_content(params,obj_hash)
  when "file"
    if File.exist?(params)
      obj_hash[:filename] = params
    else
      $stdout << "The input file '#{params}' cannot be found!\n"
      $stdout << "Please check path and filename." << "\n"
      return
    end
    obj_hash[:filemtime] = File.mtime(obj_hash[:filename]).strftime("%Y-%m-%d %H:%M:%S")
    set_separators(obj_hash)
    parse_content(IO.read(obj_hash[:filename]),obj_hash)
  when "hash"
    obj_hash = params
    # update the metacolumns
    #test $stdout << obj_hash.keys.join("\t")
    obj_hash[:datacolumns] = (obj_hash.keys.collect {|dc| dc.to_s} - METADATA)
  when "array"
    params.each {|v|
      key = v[0]
      obj_hash[:datacolumns] << key
      obj_hash[key] = v[1..-1]
    }
  end
  super(obj_hash)
end

Class Method Details

.combine(obj, obj_ = nil) ⇒ Object



700
701
702
# File 'lib/extcsv.rb', line 700

def ExtCsv.combine(obj, obj_=nil)
  obj.combine(obj_)
end

.concat(*ary_of_objs) ⇒ Object

Objects in ary_of_objs are glues in a new ExtCsv object. They should have the same datatype TODO: if at least two objects have different columns, the composite objetc should have empty values at the corresponding dataset. So be carefull with this version of concat!



685
686
687
688
689
690
691
692
693
694
695
696
697
698
# File 'lib/extcsv.rb', line 685

def ExtCsv.concat(*ary_of_objs)
  return unless ary_of_objs.collect{|obj| obj.datatype}.uniq.size == 1
  ary_of_objs.flatten! if ary_of_objs[0].kind_of?(Array)
  new_obj_hash = {}
  ary_of_objs.each {|obj|
    obj.to_hash.each {|k,v|
      new_obj_hash[k] = v.class.new unless new_obj_hash[k].kind_of?(v.class)
      new_obj_hash[k] += v 
    }
  }
  new_obj_hash[:filename] = ary_of_objs.collect{|td| td.filename}
  new_obj_hash[:filemtime] = ary_of_objs.collect{|td| td.filemtime}
  ExtCsv.new("hash","plain",new_obj_hash)
end

Instance Method Details

#&Object



678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
# File 'lib/extcsv.rb', line 678

def combine(other)
  return self unless other.kind_of?(self.class)
  1.times do 
    warn "Both object should have the same number of datasets to be combined"
    warn "Size of first Object (#{filename}): #{rsize}"
    warn "Size of second Object (#{other.filename}): #{other.rsize}"
    return nil
  end unless rsize == other.rsize
  objects, datatypes =  [self, other],[datatype,other.datatype]
  udatatypes = datatypes.uniq
  # 
  case udatatypes.size
  when 1
    hash = marshal_dump.merge(other.marshal_dump)
  else
    if datatypes.include?("ssv") or datatypes.include?("csv")
      csv_index  = datatypes.index("ssv") || datatypes.index("csv")
      qpol_index = csv_index - 1
      objects[csv_index].modyfy_time_column
      hash = objects[csv_index].marshal_dump.merge(objects[qpol_index].marshal_dump)
      hash[:filename] = []
      hash[:filename] << objects[csv_index].filename << objects[qpol_index].filename
    else
      hash = marshal_dump.merge(other.marshal_dump)
      hash[:filename] = []
      hash[:filename] << other.filename << filename
    end
  end
  # preserving the filenames 
  hash[:filemtime] = [self.filemtime.to_s, other.filemtime.to_s].min
  ExtCsv.new("hash","plain",hash)
end

#+Object



643
644
645
# File 'lib/extcsv.rb', line 643

def concat(other)
  ExtCsv.concat(self,other)
end

#<<Object



644
645
646
# File 'lib/extcsv.rb', line 644

def concat(other)
  ExtCsv.concat(self,other)
end

#<=>(other) ⇒ Object



614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
# File 'lib/extcsv.rb', line 614

def <=>(other)
  compare = (self.size <=> other.size)
  #$stdout << compare.to_s << "\n"
  compare = (datacolumns.size <=> other.datacolumns.size) if compare.zero?
  #$stdout << compare.to_s << "\n"# if compare.zero?
  #compare = (self.datasets(* self.datacolumns.sort) <=> other.datasets(* other.dataacolumns.sort)) if compare.zero?
  #$stdout << compare.to_s << "\n"# if compare.zero?
  compare = (to_s.size <=> other.to_s.size) if compare.zero?
  #
  #$stdout << compare.to_s << "\n" if compare.zero?
  compare = (to_s <=> other.to_s) if compare.zero?
  #$stdout << compare.to_s << "\n" if compare.zero?
  #$stdout << "##################################\n"
  compare
end

#[](*argv) ⇒ Object Also known as: slice



633
634
635
636
637
# File 'lib/extcsv.rb', line 633

def [](*argv)
  copy = @table.dup
  copy.each {|k,v| copy[k] = (argv.size == 1 and argv[0].kind_of?(Fixnum)) ? [v[*argv]] : v[*argv] if v.kind_of?(Array) }
  ExtCsv.new("hash","plain",copy)
end

#add(name, value) ⇒ Object



541
542
543
544
545
546
# File 'lib/extcsv.rb', line 541

def add(name, value)
  new_ostruct_member(name)
  self.send(name.to_s+"=", value)
  self.datacolumns << name.to_s unless self.datacolumns.include?(name.to_s)
 return
end

#clearObject



398
399
400
# File 'lib/extcsv.rb', line 398

def clear
  @table.each {|k,v| @table[k] = [] if v.kind_of?(Array)}
end

#closest_to(key, value) ⇒ Object

Find the dataset, with the values of key closest to he value-parameter



340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
# File 'lib/extcsv.rb', line 340

def closest_to(key, value)
  # try to select directly
  _ret = selectBy(key => value)
  return _ret unless _ret.empty?

  # grabbing for numerics
  # the operation '<=' and '>=' can be left out, because, they would have
  # been matcher before
  _smaller = selectBy(key => " < #{value}")[-1]
  _greater = selectBy(key => " > #{value}")[0]

  _smaller_diff = (_smaller.send(key)[0].to_f - value).abs
  _greater_diff = (_greater.send(key)[0].to_f - value).abs
  return (_smaller_diff < _greater_diff) ? _smaller : _greater
end

#columns(*columns) ⇒ Object



393
394
395
396
397
# File 'lib/extcsv.rb', line 393

def columns(*columns)
  h = {}
  columns.each{|col| h[col] = self.send(col)}
  return self.class.new("hash","plain",h)
end

#combine(other) ⇒ Object



646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
# File 'lib/extcsv.rb', line 646

def combine(other)
  return self unless other.kind_of?(self.class)
  1.times do 
    warn "Both object should have the same number of datasets to be combined"
    warn "Size of first Object (#{filename}): #{rsize}"
    warn "Size of second Object (#{other.filename}): #{other.rsize}"
    return nil
  end unless rsize == other.rsize
  objects, datatypes =  [self, other],[datatype,other.datatype]
  udatatypes = datatypes.uniq
  # 
  case udatatypes.size
  when 1
    hash = marshal_dump.merge(other.marshal_dump)
  else
    if datatypes.include?("ssv") or datatypes.include?("csv")
      csv_index  = datatypes.index("ssv") || datatypes.index("csv")
      qpol_index = csv_index - 1
      objects[csv_index].modyfy_time_column
      hash = objects[csv_index].marshal_dump.merge(objects[qpol_index].marshal_dump)
      hash[:filename] = []
      hash[:filename] << objects[csv_index].filename << objects[qpol_index].filename
    else
      hash = marshal_dump.merge(other.marshal_dump)
      hash[:filename] = []
      hash[:filename] << other.filename << filename
    end
  end
  # preserving the filenames 
  hash[:filemtime] = [self.filemtime.to_s, other.filemtime.to_s].min
  ExtCsv.new("hash","plain",hash)
end

#concat(other) ⇒ Object



640
641
642
# File 'lib/extcsv.rb', line 640

def concat(other)
  ExtCsv.concat(self,other)
end

#datasets(*columns) ⇒ Object

Return an array of datasets, which contain of the values of the gives columns in order of these columns, e.g.

[col0_val0,col1_val0,…],…,[col0_valN, col1_valN,…]


384
385
386
387
388
389
390
391
392
# File 'lib/extcsv.rb', line 384

def datasets(*columns)
  retval = [] 

  # preset the selected columns to select
  columns = datacolumns if columns.empty?

  columns.each {|col| retval << @table[col.to_sym]}
  retval.transpose
end

#deep_split(columns, retval) ⇒ Object

really perform the splitting necessary for split



525
526
527
528
529
530
531
532
533
534
# File 'lib/extcsv.rb', line 525

def deep_split(columns, retval)
  case
  when (columns.nil? or columns.empty? or size == 1)
    retval << self
  when (columns.size == 1 and send(columns[0]).uniq.size == 1)
    retval << self
  else
    each_obj(columns[0]) {|obj| obj.deep_split(columns[1..-1], retval)}
  end
end

#diff(other) ⇒ Object



601
602
603
604
605
606
607
608
609
610
611
612
# File 'lib/extcsv.rb', line 601

def diff(other)
  diffdatatype = [self.datatype, other.datatype]
  return diffdatatype unless diffdatatype.uniq.size == 1

  diffdatacolums = self.datacolumns.complement(other.datacolumns)
  return [self.diffdatacolums,other.datacolumns] unless diffdatacolums.empty?
  
  datacolumns.each {|c| 
    diffcolumn = send(c).complement(other.send(c))
    return diffcolumn unless diffcolumn.empty?
  }
end

#each(&block) ⇒ Object

Iteration over datasets containing values of all columns



469
470
471
472
473
# File 'lib/extcsv.rb', line 469

def each(&block)
  objects = []
  (0...size).each {|i| objects << selectBy_index([i])}
  objects.each(&block)
end

#each_by(key, sort_uniq = true, &block) ⇒ Object

iterator over different values of key



477
478
479
480
481
482
483
# File 'lib/extcsv.rb', line 477

def each_by(key,sort_uniq=true, &block)
  if sort_uniq
    send(key).uniq.sort.each(&block)
  else
    send(key).each(&block)
  end
end

#each_obj(key, &block) ⇒ Object

each_obj iterates over the subobject of the receiver, which belong to the certain value of key



488
489
490
491
492
493
494
495
496
497
498
499
# File 'lib/extcsv.rb', line 488

def each_obj(key, &block)
  key = key.to_sym
  retval = []
  send(key).sort.uniq.each {|value|
    retval << selectBy(key => value)
  }
  if block_given?
    retval.each(&block)
  else
    retval
  end
end

#empty?Boolean

Returns:

  • (Boolean)


401
402
403
404
405
406
407
408
409
# File 'lib/extcsv.rb', line 401

def empty?
  return true if @table.empty?
  @table.each {|k,v| 
    if ( v.kind_of?(Array) and v == [])
      return true
    end
  }
  false
end

#eql?(other) ⇒ Boolean

Equality if the datacolumns have the save values, i.e. as float for numeric data and as strings otherwise

Returns:

  • (Boolean)


591
592
593
594
595
596
597
598
599
# File 'lib/extcsv.rb', line 591

def eql?(other)
  return false unless ( self.datatype == other.datatype or self.datatype  == other.datatype)

  return false unless self.datacolumns.sort == other.datacolumns.sort

  datacolumns.each {|c| return false unless send(c) == other.send(c) }
  
  return true
end

#globalsizeObject



427
428
429
# File 'lib/extcsv.rb', line 427

def globalsize
  numberOfRows*numberOfColumns
end

#hashObject

has to be defined for using eql? in uniq



631
# File 'lib/extcsv.rb', line 631

def hash;0;end

#indexObject

Create an auto index



193
194
195
# File 'lib/extcsv.rb', line 193

def index
  (0...rsize).to_a
end

#is_regexp?(pattern, key) ⇒ Boolean

Selection can be made by regular expressions. This method decides, with method is used.

Returns:

  • (Boolean)


213
214
215
216
217
218
219
220
221
222
# File 'lib/extcsv.rb', line 213

def is_regexp?(pattern, key)
  return false unless /(<|<=|>=|>)\s*/.match(pattern).nil?
  case key 
  when "zeit"
    pattern = pattern.gsub(/(-|\.\d)/,'')
  else
    pattern = pattern.gsub(/\.\d/,'')
  end
  pattern != Regexp.escape(pattern)
end

#numberOfColumnsObject Also known as: csize



422
423
424
# File 'lib/extcsv.rb', line 422

def numberOfColumns
  datacolumns.size
end

#numberOfRowsObject Also known as: rsize



417
418
419
# File 'lib/extcsv.rb', line 417

def numberOfRows
  @table[datacolumns[-1].to_sym].size
end

#operate_on(column, operation) ⇒ Object

Perform a change on a object copy. column can be any attribute of the object and the operation has to be a string, which can be evaluated by the interpreter, e.g. “+ 0.883” or “*Math.sin(#myvar)”



452
453
454
# File 'lib/extcsv.rb', line 452

def operate_on(column, operation)
  self.class.new("hash","plain",deep_copy).operate_on!(column,operation)
end

#operate_on!(column, operation) ⇒ Object

Perform a persistent change on the receiver. Usage like change.



439
440
441
442
443
444
445
446
# File 'lib/extcsv.rb', line 439

def operate_on!(column, operation)
  values = send(column)
  send(column).each_index {|i|
    newval          = eval("#{values[i]} #{operation}")
    send(column)[i] = newval.to_s unless newval.nil?
  }
  self
end

#plot(*args) ⇒ Object



704
705
706
# File 'lib/extcsv.rb', line 704

def plot(*args)
  ExtCsvDiagram.plot(self,*args)
end

#selectBy(selection) ⇒ Object

This Function uses a hash parameter, where the key must be the name of an instance variable, i.g. params =

  • => “4”, :col2 => “100”, :col3> “80”

  • => /(4|5)/, :col2 => “<500”, :col3> “>=80”

Searching can be done directly, which uses ‘==’ to match, via regular expressions of by simple mathematical operarions:

  • <

  • <=

  • >

  • >=



234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
# File 'lib/extcsv.rb', line 234

def selectBy(selection)
  operations = %w{<= >= == < > !=}
  type = nil

  # transform selection keys into symbols. This make the further usage
  # a lot easyer and allows to take strings or symbols for columns
  # names
  # ATTENTION: DO NOT MIX THE USAGE OF STRING AND SYMBOLS!
  #   This can lead to a data loss, because e.g. {:k => 4, "k" => 3} will be
  #   transformed into {:k=>3}
  selection.keys {|k| 
    if k.kind_of?(String)
      v                   = selection.delete(k)
      selection[k.to_sym] = v
    end
  }
  vars = selection.keys
  # test for unknown selection variables
  vars.each {|attribute|
    unless @table.has_key?(attribute)
      $stdout << "Object does NOT hav the attribute '#{attribute}'!"
      raise 
    end
  }
  # default is the lookup in the whole array of values for each var
  lookup = (0...@table[vars[0]].size).to_a

  vars.each { |var|
    operation = nil
    value     = nil
    # needle can be a real value, a math. comparision or a regular expression
    needle = selection[var]

    if needle.kind_of?(Numeric)
      operation = "=="
      value     = needle
      type      = :numeric
        #test stdout << needle << " #### #{needle.class} ####\n"
        #test stdout << type.to_s << "\n"
    elsif needle.kind_of?(Regexp)
      operation = Regexp.new(needle)
      type      = :regexp
        #test stdout << needle << " #### #{needle.class} ####\n"
        #test stdout << type.to_s << "\n"
    elsif needle.kind_of?(String)
      if (md = /(#{operations.join("|")})([^=].*)/.match(needle); not md.nil?)
        # separate the operation
        operation = md[1]
        value     = md[2].strip
      else
        operation = '=='
        value     = needle
      end
      if (value == "")
        # value is missing
        $stdout << "value for variable '#{var}' is missing\n"
        raise
      elsif ( (value != "0" and (value.to_f.to_s == value or value.to_i.to_s == value)) or (value == "0") )
        # A: numerical compare
        value = value.to_f
        type      = :numeric
        #test stdout << value << " #### #{value.class} ####\n"
        #test stdout << type.to_s << "\n"
      else
        # B: String-like compare
        # quoted if not allready quoted
        value = "'" + value + "'" unless ( /'(.*[^']?.*)'/.match(value) or /"(.*[^"]?.*)"/.match(value) )
        type      = :string
        #test $stdout << value << " #### #{value.class} ####\n"
        #test $stdout << type.to_s << "\n"
      end
    else
      $stdout << "The Parameter '#{needle}' has the wrong Type. " + 
                 "Please use numeric values, stings or regular expressions (e.g. /(^50$|200)/)\n"
      raise
    end
    #test stdout << "\n NEW VALUE :::::::::::::::\n"
    obj_values  = @table[var]
    size        = @table[var].size
    checkValues = [(0...size).to_a, obj_values].transpose
    if ShunkSize < size
      container = []
      (0..size/ShunkSize).collect {|i|
        checkValues.values_at(*(lookup[i*ShunkSize,ShunkSize]))
      }.each {|v| v.each {|vv| container << vv} }
      checkValues = container
    else
      checkValues = checkValues.values_at(*lookup)
    end

    if operation.kind_of?(Regexp)
      lookup = lookup & checkValues.find_all {|i,v| operation.match(v.to_s)}.transpose[0].to_a
    else
      lookup = lookup & checkValues.find_all {|i,v|
        next if v.nil?
        next if v.empty? if v.respond_to?(:empty?)
        v = "'" + v + "'" if type == :string
        #test $stdout <<[v,operation,value].join(" ") << "\n"
        eval([v,operation,value].join(" "))
      }.transpose[0].to_a
    end
  }
  selectBy_index(lookup) 
end

#selectBy_index(indexes) ⇒ Object

Do a selection by the index of the dataset inside the receiver. This does not change the receiver.



199
200
201
202
203
204
205
206
207
208
209
# File 'lib/extcsv.rb', line 199

def selectBy_index(indexes)
  new_table = {}
  @table.each {|key, value|
    if METADATA.include?(key.to_s) or not value.kind_of?(Array)
      new_table[key] = value
    else
      new_table[key] = value.values_at(*indexes) 
    end
  }
  self.class.new("hash","plain",new_table)
end

#set_column(column, expression) ⇒ Object



463
464
465
# File 'lib/extcsv.rb', line 463

def set_column(column, expression)
  self.class.new("hash","plain",deep_copy).set_column!(column,expression)
end

#set_column!(column, expression) ⇒ Object



456
457
458
459
460
461
462
# File 'lib/extcsv.rb', line 456

def set_column!(column, expression)
  values = send(column)
  send(column).each_index {|i|
    send(column)[i] = eval(expression).to_s
  }
  self
end

#sizeObject

Different size definitions



413
414
415
# File 'lib/extcsv.rb', line 413

def size
  @table[datacolumns[0].to_sym].size
end

#split(*columns, &block) ⇒ Object

:call-seq: split.(:col0,…,:colN) {|obj| …} split.(:col0,…,:coln) -> [obj0,…,objM]

split is a multi-key-version of each_obj. the receiver is splitted into subobject, which have constant values in all given columns

eg. obj.split(:kv, :focus) {|little_obj| little_obj.kv == little_kv.uniq}

or

obj.split(:kv, :focus) = [obj_0,...,obj_N]



514
515
516
517
518
519
520
521
522
# File 'lib/extcsv.rb', line 514

def split(*columns, &block)
  retval = []
  deep_split(columns, retval)
  if block_given?
    retval.each(&block)
  else
    retval
  end
end

#to_aryObject

array representatio nof the data



549
550
551
# File 'lib/extcsv.rb', line 549

def to_ary
  @table.to_a
end

#to_file(filename, filetype = "txt") ⇒ Object



583
584
585
586
587
# File 'lib/extcsv.rb', line 583

def to_file(filename, filetype="txt")
  File.open(filename,"w") do |f|
    f << to_string(filetype)
  end
end

#to_hashObject

hash representation of the data



537
538
539
# File 'lib/extcsv.rb', line 537

def to_hash
  @table
end

#to_string(stype, sort = true) ⇒ Object

String output. See ExtCsvExporter.to_string



576
577
578
579
580
581
582
# File 'lib/extcsv.rb', line 576

def to_string(stype,sort=true)
  header = sort ? datacolumns.sort : datacolumns
    ExtCsvExporter.new("extcsv",
                          ([header] + 
                             datasets(*header)).transpose
                         ).to_string(stype)
end

#to_texTable(cols, col_align = "c", math = false) ⇒ Object

Texcode for the table with vertical and horzontal lines, which contains values of the given columns



555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
# File 'lib/extcsv.rb', line 555

def to_texTable(cols,col_align="c",math=false)
  hline = '\\hline'
#      tex << '$' + cols.each {|col| col.sub(/(.+)_(.+)/,"\\1_\{\\2\}")}.join("$&$") + '$' + "\\\\\n"
  tex = ''
  tab_align = ''
  cols.size.times { tab_align << '|' + col_align }
  tab_align << '|'
  tex << '\begin{tabular}{' + tab_align + '}' + hline + "\n"
  if math
    tex << '$' + cols.join("$&$").gsub(/(\w+)_(\w+)/,"\\1_\{\\2\}") + '$' + '\\\\' + hline + "\n"
  else 
    tex << cols.join(" & ") + '\\\\' + hline +"\n"
  end
  datasets(cols).each {|dataset|
    tex << dataset.join(" & ") + '\\\\' + hline + "\n"
  }
  tex << '\end{tabular}' + "\n"
  tex
end