Class: Nodepile::TabularRecordSource

Inherits:
Object
  • Object
show all
Includes:
Enumerable
Defined in:
lib/nodepile/rec_source.rb

Overview

Generates “Factories” for harvesting tabular data from a source stream/file. Includes facilities for parsing common file formats (CSV/TSV). Includes facilities for handling common problems encountered when parsing manually-created tabular data files such as: relevant tabular data is not aligned “top-left”, tabular data includes blank or repeated columns, tabular data ends before end of file summary rows appear in the tabular data that need to be ignored.

Constant Summary collapse

DEFAULT_LOADING_GUIDELINES =
{
 mandatory_headers: [], # this can be extremely important to correctly finding tables
 format: :csv||:tsv||:guess, #assume CSV unless told otherwise 
 allow_leading_skip_rows: 10, # arbitrary content that may appear before table
 allow_gap_rows: 2||nil,  # entirely blank rows appearing mid-tabl, nil indicates allow infinite
 allow_gap_columns: 1, # columns which have a blank header within the table
 allow_left_offset:  5, # blank columnns allowed left of table 
 duplicate_header_rule: :first||:last||:ignore||:rename||:fail, #keep the first
 ignored_header_char: '#', # header names starting with this are just plain ignored
 emit_blank_records: false, # unless true, entirely blank records are not returned
 trim_headers: true, #strip leading and trailing spaces
}.freeze

Instance Method Summary collapse

Constructor Details

#initialize(source, **loading_guidelines) ⇒ TabularRecordSource

Create a new RecordSource intended to read from the specified input and using the parsing strategy specified by the loading guidelines.



30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# File 'lib/nodepile/rec_source.rb', line 30

def initialize(source,**loading_guidelines)
    (loading_guidelines.keys - DEFAULT_LOADING_GUIDELINES.keys).tap{|x| raise <<~ERRMSG unless x.empty?}
           Unrecognized named parameters used for RecordSource creation #{x.inspect}
          ERRMSG
    @loading_guidelines = DEFAULT_LOADING_GUIDELINES.merge(loading_guidelines).freeze
    raise "The source must be non-nil" if source.nil?
    @source = source # will lazy load
    @is_mid_read = false  # only relevant for non-parallel sources
    @replayable_flag = if @source.is_a?(String) 
                          :parallel # simultaneous each() is okay
                       elsif @source.respond_to?(:rewind) 
                           :single # can't guarantee simultaneous each() safe
                       else
                           nil
                       end
end

Instance Method Details

#each {|Array| ... } ⇒ Integer, Enumerator

Yields the “records” of the first “table” encountered in the bound data source according to the parameters it was given. First row yielded is always the header. Raises an error if a header is not found. Beware… depending on the type of data source used at creation, it may not be possible to rewind or retrieve data in parallel. With that said, a filename or String both allow parallel retrieval.

Also note that blank strings will be passed through until the specified allow_gap_rows are exceeded. This can mean trailing blanks in long files.

Yield Parameters:

  • Array (Array)

    includes at least two elements. The first is an Array of “fields”. The second element is the record number within the source (zero index). It’s important to note if any field contains embedded newlines, the record number is not the same as the line number

Returns:

  • (Integer, Enumerator)

    Returns enumerator if no block is given. Otherwise returns the count of records yielded excluding the header line.



66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
# File 'lib/nodepile/rec_source.rb', line 66

def each(&block)
    return enum_for(:each) unless block_given?
    raise "This data source type may only be read once." if @source.nil?
    raise <<~ERRMSG if @is_mid_read && @replayable_flag != :parallel
        For this type of data source, you may not read simultaneously.
       ERRMSG
    @is_mid_read = true 
    scanner = self.class._make_record_stream(@source,format: @loading_guidelines[:format])
    scanner = self.class._reposition_to_header_rec(scanner,@loading_guidelines)
    raw_header,header_pos = scanner.next
    header_range = self.class._calc_header_range(raw_header,@loading_guidelines[:allow_gap_columns])
    # process the header line to create a "mask"
    yield [raw_header[header_range],header_pos]  # return the trimmed header
    rec_count = self.class._emit_rows(scanner,header_range,
                                      @loading_guidelines[:emit_blank_records],
                                      trim_headers: @loading_guidelines[:trim_headers],
                                      tolerate_blanks: @loading_guidelines[:allow_gap_rows],
                                      &block
                                     )
    @is_mid_read = false
    @source = nil if @replayable_flag.nil? # release resources
    return rec_count 
end