Class: Nodepile::TabularRecordSource
- Inherits:
-
Object
- Object
- Nodepile::TabularRecordSource
- Includes:
- Enumerable
- Defined in:
- lib/nodepile/rec_source.rb
Overview
Generates “Factories” for harvesting tabular data from a source stream/file. Includes facilities for parsing common file formats (CSV/TSV). Includes facilities for handling common problems encountered when parsing manually-created tabular data files such as: relevant tabular data is not aligned “top-left”, tabular data includes blank or repeated columns, tabular data ends before end of file summary rows appear in the tabular data that need to be ignored.
Constant Summary collapse
- DEFAULT_LOADING_GUIDELINES =
{ mandatory_headers: [], # this can be extremely important to correctly finding tables format: :csv||:tsv||:guess, #assume CSV unless told otherwise allow_leading_skip_rows: 10, # arbitrary content that may appear before table allow_gap_rows: 2||nil, # entirely blank rows appearing mid-tabl, nil indicates allow infinite allow_gap_columns: 1, # columns which have a blank header within the table allow_left_offset: 5, # blank columnns allowed left of table duplicate_header_rule: :first||:last||:ignore||:rename||:fail, #keep the first ignored_header_char: '#', # header names starting with this are just plain ignored emit_blank_records: false, # unless true, entirely blank records are not returned trim_headers: true, #strip leading and trailing spaces }.freeze
Instance Method Summary collapse
-
#each {|Array| ... } ⇒ Integer, Enumerator
Yields the “records” of the first “table” encountered in the bound data source according to the parameters it was given.
-
#initialize(source, **loading_guidelines) ⇒ TabularRecordSource
constructor
Create a new RecordSource intended to read from the specified input and using the parsing strategy specified by the loading guidelines.
Constructor Details
#initialize(source, **loading_guidelines) ⇒ TabularRecordSource
Create a new RecordSource intended to read from the specified input and using the parsing strategy specified by the loading guidelines.
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
# File 'lib/nodepile/rec_source.rb', line 30 def initialize(source,**loading_guidelines) (loading_guidelines.keys - DEFAULT_LOADING_GUIDELINES.keys).tap{|x| raise " Unrecognized named parameters used for RecordSource creation \#{x.inspect}\n ERRMSG\n @loading_guidelines = DEFAULT_LOADING_GUIDELINES.merge(loading_guidelines).freeze\n raise \"The source must be non-nil\" if source.nil?\n @source = source # will lazy load\n @is_mid_read = false # only relevant for non-parallel sources\n @replayable_flag = if @source.is_a?(String) \n :parallel # simultaneous each() is okay\n elsif @source.respond_to?(:rewind) \n :single # can't guarantee simultaneous each() safe\n else\n nil\n end\nend\n" unless x.empty?} |
Instance Method Details
#each {|Array| ... } ⇒ Integer, Enumerator
Yields the “records” of the first “table” encountered in the bound data source according to the parameters it was given. First row yielded is always the header. Raises an error if a header is not found. Beware… depending on the type of data source used at creation, it may not be possible to rewind or retrieve data in parallel. With that said, a filename or String both allow parallel retrieval.
Also note that blank strings will be passed through until the specified allow_gap_rows are exceeded. This can mean trailing blanks in long files.
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
# File 'lib/nodepile/rec_source.rb', line 66 def each(&block) return enum_for(:each) unless block_given? raise "This data source type may only be read once." if @source.nil? raise " For this type of data source, you may not read simultaneously.\n ERRMSG\n @is_mid_read = true \n scanner = self.class._make_record_stream(@source,format: @loading_guidelines[:format])\n scanner = self.class._reposition_to_header_rec(scanner,@loading_guidelines)\n raw_header,header_pos = scanner.next\n header_range = self.class._calc_header_range(raw_header,@loading_guidelines[:allow_gap_columns])\n # process the header line to create a \"mask\"\n yield [raw_header[header_range],header_pos] # return the trimmed header\n rec_count = self.class._emit_rows(scanner,header_range,\n @loading_guidelines[:emit_blank_records],\n trim_headers: @loading_guidelines[:trim_headers],\n tolerate_blanks: @loading_guidelines[:allow_gap_rows],\n &block\n )\n @is_mid_read = false\n @source = nil if @replayable_flag.nil? # release resources\n return rec_count \nend\n" if @is_mid_read && @replayable_flag != :parallel |