Class: Jinx::Csv::Joiner

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/jinx/csv/joiner.rb

Overview

Merges two CSV files on common fields.

Instance Method Summary collapse

Constructor Details

#initialize(source, target = nil, output = nil) ⇒ Joiner

Returns a new instance of Joiner.

Parameters:

  • source (String, IO)

    the join source

  • target (String, IO) (defaults to: nil)

    the join target (default stdin)

  • output (String, IO, nil) (defaults to: nil)

    the output file name or device (default stdout)



12
13
14
15
16
# File 'lib/jinx/csv/joiner.rb', line 12

def initialize(source, target=nil, output=nil)
  @source = source
  @target = target || STDIN
  @output = output || STDOUT
end

Instance Method Details

#compare(source, target) ⇒ -1, ... (private)

Compares the given source and target buffers with result as follows:

  • If source and target are nil, then nil

  • If source is nil and target is not nil, then -1

  • If target is nil and source is not nil, then 1

  • Otherwise, the pair-wise comparison of the source and target keys

Parameters:

  • the (:key)

    key holder

Returns:

  • (-1, 0, 1, nil)

    the comparison result



181
182
183
184
185
186
187
188
189
190
191
192
193
# File 'lib/jinx/csv/joiner.rb', line 181

def compare(source, target)
  return target.nil? ? nil : 1 if source.nil?
  return -1 if target.nil?
  source.key.each_with_index do |v1, i|
    v2 = target.key[i]
    next if v1.nil? and v2.nil?
    return -1 if v1.nil?
    return 1 if v2.nil?
    cmp = v1 <=> v2
    return cmp unless cmp == 0
  end
  0
end

#join(*fields) {|rec| ... } ⇒ Object

Joins the source to the target and writes the output. The source fields used are given by the fields argument, if given. By default, all source fields are used.

The output fields consist of the qualified source fields and all target fields. The output fields are in the following order:

  1. The common fields, in order of occurrence in the source file.

  2. The qualified source-specific fields, in order of occurrence in the source file.

  3. The target-specific fields, in order of occurrence in the target file.

The match is on the common qualified source and target fields. Both files must be sorted in order of the common fields, sequenced by their occurence in the source header.

If an output argument is given, then the joined record is written to the output. If a block is given, then the block is called on each record prior to writing the record to the output. If the block returns nil, then the record is not written.

Parameters:

  • fields (<String>)

    the optional source fields to merge (default is all source fields)

Yields:

  • (rec)

    process the output record and return the record to write

Yield Parameters:

  • rec (FasterCSV::Record)

    the output record



40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# File 'lib/jinx/csv/joiner.rb', line 40

def join(*fields, &block)
  CsvIO.open(@target) do |tgt|
    CsvIO.open(@source) do |src|
      # all source fields (unordered)
      usflds = src.field_names.to_set
      fields.each do |fld|
        unless usflds.include?(fld) then
          raise ArgumentError.new("CSV join field #{fld} not found in the source file #{@source}.")
        end
      end
      # the qualified source fields (ordered)
      qsflds = fields.empty? ? src.field_names : fields
      tflds = tgt.field_names
      @common = qsflds & tflds
      # The headers consist of the common fields followed by the qualified
      # source-specific fields followed by the target-specific fields.
      hdrs = @common | qsflds | tflds
      CsvIO.open(@output, :mode => 'w', :headers => hdrs) do |out|
        merge(src, tgt, out, &block)
      end
    end
  end
    
  alias :each :join
end

#look_ahead(csvio, buf = nil) ⇒ Buffer? (private)

Returns the modified look-ahead, or nil if end of file.

Parameters:

  • csvio (CsvIO)

    the CSV file stream

  • the (Buffer, nil)

    look-ahead buffer

Returns:

  • (Buffer, nil)

    the modified look-ahead, or nil if end of file



165
166
167
168
169
170
171
# File 'lib/jinx/csv/joiner.rb', line 165

def look_ahead(csvio, buf=nil)
  rec = csvio.next || return
  buf ||= Buffer.new
  buf.record = rec
  buf.key = @common.map { |k| rec[k] }
  buf
end

#merge(source, target, output) {|rec| ... } ⇒ Object (private)

Merges the given source into the target as the output. The output headers must be in the order specified by #join.

Parameters:

  • source (CsvIO)

    the source CSV IO

  • target (CsvIO)

    the target CSV IO

  • output (CsvIO)

    the merged output CSV IO

Yields:

  • (rec)

    process the output record and return the record to write

Yield Parameters:

  • rec (FasterCSV::Record)

    the output record

See Also:



79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# File 'lib/jinx/csv/joiner.rb', line 79

def merge(source, target, output)
  # the qualified source field accessors
  sflds = source.accessors & output.accessors
  # the target field accessors
  tflds = target.accessors
  # the common fields
  @common = sflds & tflds
  # The target-specific accessors
  trest = tflds - @common
  # The source-specific accessors
  srest = output.accessors - trest - @common
  # The output record
  obuf = Array.new(output.accessors.size)
  # The source/target current/next (key, record) buffers
  # Read the first and second records into the buffers
  sbuf = shift(source)
  tbuf = shift(target)
  # Compare the source and target.
  while cmp = compare(sbuf, tbuf) do
    # Fill the output record in three sections: the common, source and target fields.
    obuf.fill do |i|
      if i < @common.size then
        cmp <= 0 ? sbuf.key[i] : tbuf.key[i]
      elsif i < sflds.size then
        # Only fill the output record with source values if there is a current source
        # record and the target does not precede the source.
        sbuf.record[srest[i - @common.size]] if sbuf and cmp <= 0
      elsif tbuf and cmp >= 0
        # Only fill the output record with target values if there is a current target
        # record and the source does not precede the target.
        tbuf.record[trest[i - sflds.size]]
      end
    end
    orec = block_given? ? yield(obuf) : obuf 
    # Emit the output record.
    output << orec if orec
    # Shift the buffers as necessary.
    ss, ts = shift?(sbuf, tbuf, cmp), shift?(tbuf, sbuf, -cmp)
    sbuf = shift(source, sbuf) if ss
    tbuf = shift(target, tbuf) if ts
  end
end

#shift(csvio, buf = nil) ⇒ Buffer? (private)

Reads a record into the given buffers.

Parameters:

  • the (CsvIO)

    open CSV stream to read

  • cbuf (Buffer, nil)

    the current record buffer

Returns:

  • (Buffer, nil)

    the next current buffer, or nil if end of file



148
149
150
151
152
153
154
155
156
157
158
159
160
# File 'lib/jinx/csv/joiner.rb', line 148

def shift(csvio, buf=nil)
  if buf then
    return if buf.lookahead.nil?
  else
    # prime the look-ahead
    buf = Buffer.new(nil, nil, look_ahead(csvio))
    return shift(csvio, buf)
  end
  buf.record = buf.lookahead.record
  buf.key = buf.lookahead.key
  buf.lookahead = look_ahead(csvio, buf.lookahead)
  buf
end

#shift?(buf, other, order) ⇒ Boolean (private)

Returns whether to shift the given buffer as follows:

  • If the buffer precedes the other buffer, then true.

  • If the buffer succeeds the other buffer, then false.

  • Otherwise, if the lookahead record has the same key as the buffer record then true.

  • Otherwise, if the other lookahead record has a different key than the other record, then true.

Parameters:

  • buf (Buffer)

    the record buffer to check

  • other (Buffer)

    the other record buffer

  • order (-1, 0, 1)

    the buffer comparison

Returns:

  • (Boolean)

    whether to shift the buffer



132
133
134
135
136
137
138
139
140
141
# File 'lib/jinx/csv/joiner.rb', line 132

def shift?(buf, other, order)
  case order
  when -1 then
    true
  when 1 then
    false
  when 0 then
    compare(buf, buf.lookahead) == 0 or compare(other, other.lookahead) != 0
  end
end