Class: DataHut::DataWarehouse
- Inherits:
-
Object
- Object
- DataHut::DataWarehouse
- Defined in:
- lib/data_hut/data_warehouse.rb
Overview
The DataHut::DataWarehouse comprehensively manages all the heavy lifting of creating a data system for your analytics. So during extract and transform phases you don’t have to worry about the schema or the data types you’ll be using… just start scraping and playing with the data extraction, DataHut will take care of introspecting your final data records and creating or altering the DataHut schema for you, auto-magically.
Class Method Summary collapse
-
.connect(name) ⇒ DataHut::DataWarehouse
creates or opens an existing connection to a DataHut data store.
Instance Method Summary collapse
-
#dataset ⇒ Sequel::Model
access the DataHut dataset.
-
#extract(data) {|record, element| ... } ⇒ Object
used to extract data from whatever source you wish.
-
#fetch_meta(key) ⇒ Object
retrieves previously stored metadata by key.
-
#logger=(logger) ⇒ Object
attach a Logger to the underlying Sequel database so that you can debug or monitor database actions.
-
#store_meta(key, value) ⇒ Object
stores metadata.
-
#transform(forced = false) {|record| ... } ⇒ Object
used to transform data already extracted into a DataHut.
-
#transform_complete ⇒ Object
marks all the records in the DataHut as ‘processed’.
Class Method Details
.connect(name) ⇒ DataHut::DataWarehouse
creates or opens an existing connection to a DataHut data store.
54 55 56 |
# File 'lib/data_hut/data_warehouse.rb', line 54 def self.connect(name) new(name) end |
Instance Method Details
#dataset ⇒ Sequel::Model
access the DataHut dataset. See Sequel::Dataset for available operations on the dataset.
62 63 64 |
# File 'lib/data_hut/data_warehouse.rb', line 62 def dataset Class.new(Sequel::Model(@db[:data_warehouse])) end |
#extract(data) {|record, element| ... } ⇒ Object
used to extract data from whatever source you wish. As long as the data forms an enumerable collection, you can pass it to extract along with a block that specifies how you which the DataHut record to be mapped from the source element of the collection.
85 86 87 88 89 90 91 92 93 |
# File 'lib/data_hut/data_warehouse.rb', line 85 def extract(data) raise(ArgumentError, "a block is required for extract.", caller) unless block_given? data.each do |d| r = OpenStruct.new yield r, d store(r) end end |
#fetch_meta(key) ⇒ Object
retrieves previously stored metadata by key
188 189 190 191 192 193 194 195 196 197 198 |
# File 'lib/data_hut/data_warehouse.rb', line 188 def (key) key = key.to_s if key.instance_of?(Symbol) begin r = @db[:data_warehouse_meta].where(key: key).first value = r[:value] unless r.nil? value = Marshal.load(value) unless value.nil? rescue Exception => e raise(ArgumentError, "DataHut: unable to fetch metadata key #{key}.", caller) end value end |
#logger=(logger) ⇒ Object
attach a Logger to the underlying Sequel database so that you can debug or monitor database actions. See Sequel::Database#logger=.
159 160 161 162 |
# File 'lib/data_hut/data_warehouse.rb', line 159 def logger=(logger) raise(ArgumentError, "logger must be a type of Logger.") unless logger.kind_of?(Logger) @db.logger = logger end |
#store_meta(key, value) ⇒ Object
stores metadata
170 171 172 173 174 175 176 177 178 179 180 181 182 |
# File 'lib/data_hut/data_warehouse.rb', line 170 def (key, value) key = key.to_s if key.instance_of?(Symbol) begin value = Sequel::SQL::Blob.new(Marshal.dump(value)) if (@db[:data_warehouse_meta].where(key: key).count > 0) @db[:data_warehouse_meta].where(key: key).update(value: value) else @db[:data_warehouse_meta].insert(key: key, value: value) end rescue Exception => e raise(ArgumentError, "DataHut: unable to store metadata value #{value.inspect}.", caller) end end |
#transform(forced = false) {|record| ... } ⇒ Object
used to transform data already extracted into a DataHut. You can also use transform to create new synthetic data fields from existing fields. You may create as many transform blocks (i.e. ‘passes’) as you like.
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
# File 'lib/data_hut/data_warehouse.rb', line 113 def transform(forced=false) raise(ArgumentError, "a block is required for transform.", caller) unless block_given? # now process all the records with the updated schema... @db[:data_warehouse].each do |h| # check for processed if not forced unless forced next if h[:dw_processed] == true end # then get rid of the internal id and processed flags dw_id = h.delete(:dw_id) h.delete(:dw_processed) # copy record fields to an openstruct r = OpenStruct.new(h) # and let the transformer modify it... yield r # now add any new transformation fields to the schema... adapt_schema(r) # get the update hash from the openstruct h = r.marshal_dump # and use it to update the record @db[:data_warehouse].where(dw_id: dw_id).update(h) end end |
#transform_complete ⇒ Object
marks all the records in the DataHut as ‘processed’. Useful as the last command in a sequence of extract and transform passes.
147 148 149 |
# File 'lib/data_hut/data_warehouse.rb', line 147 def transform_complete @db[:data_warehouse].update(:dw_processed => true) end |