Class: DataHut::DataWarehouse
- Inherits:
-
Object
- Object
- DataHut::DataWarehouse
- Defined in:
- lib/data_hut/data_warehouse.rb
Overview
The DataHut::DataWarehouse comprehensively manages all the heavy lifting of creating a data system for your analytics. So during extract and transform phases you don’t have to worry about the schema or the data types you’ll be using… just start scraping and playing with the data extraction, DataHut will take care of introspecting your final data records and creating or altering the DataHut schema for you, auto-magically.
Class Method Summary collapse
-
.connect(name) ⇒ DataHut::DataWarehouse
creates or opens an existing connection to a DataHut data store.
Instance Method Summary collapse
-
#dataset ⇒ Sequel::Model
access the DataHut dataset.
-
#extract(data) {|record, element| ... } ⇒ void
used to extract data from whatever source you wish.
-
#fetch_meta(key) ⇒ Object
retrieves any Ruby object stored as metadata.
-
#logger=(logger) ⇒ void
attach a Logger to the underlying Sequel database so that you can debug or monitor database actions.
-
#not_unique(hash) ⇒ Boolean
used to determine if the specified fields and values are unique in the datahut.
-
#store_meta(key, value) ⇒ void
stores any Ruby object as metadata in the datahut.
-
#transform(forced = false) {|record| ... } ⇒ void
used to transform data already extracted into a DataHut.
-
#transform_complete ⇒ void
marks all the records in the DataHut as ‘processed’.
Class Method Details
.connect(name) ⇒ DataHut::DataWarehouse
creates or opens an existing connection to a DataHut data store.
54 55 56 |
# File 'lib/data_hut/data_warehouse.rb', line 54 def self.connect(name) new(name) end |
Instance Method Details
#dataset ⇒ Sequel::Model
access the DataHut dataset. See Sequel::Dataset for available operations on the dataset.
62 63 64 |
# File 'lib/data_hut/data_warehouse.rb', line 62 def dataset Class.new(Sequel::Model(@db[:data_warehouse])) end |
#extract(data) {|record, element| ... } ⇒ void
Duplicate records (all fields and values must match) are automatically not inserted at the end of an extract iteration. You may also skip duplicate extracts early in the iteration by using #not_unique.
Fields with nil values in records are skipped because the underlying database defaults these to nil already. However you must have at least one non-nil value in order for the field to be automatically created, otherwise subsequent transform layers may report errors on trying to access the field.
This method returns an undefined value.
used to extract data from whatever source you wish. As long as the data forms an enumerable collection, you can pass it to extract along with a block that specifies how you which the DataHut record to be mapped from the source element of the collection.
91 92 93 94 95 96 97 98 99 |
# File 'lib/data_hut/data_warehouse.rb', line 91 def extract(data) raise(ArgumentError, "a block is required for extract.", caller) unless block_given? data.each do |d| r = OpenStruct.new yield r, d store(r) end end |
#fetch_meta(key) ⇒ Object
Because the datastore can support any Ruby object (including custom ones) it is up to the caller to make sure that custom classes are in context before storage and fetch. i.e. if you store a custom object and then fetch it in a context that doesn’t have that class loaded, you’ll get an error.
For this reason it is safest to use standard Ruby types (e.g. Array, Hash, etc.) that will always be present.
retrieves any Ruby object stored as metadata.
205 206 207 208 209 210 211 212 213 214 215 |
# File 'lib/data_hut/data_warehouse.rb', line 205 def (key) key = key.to_s if key.instance_of?(Symbol) begin r = @db[:data_warehouse_meta].where(key: key).first value = r[:value] unless r.nil? value = Marshal.load(value) unless value.nil? rescue Exception => e raise(RuntimeError, "DataHut: unable to fetch metadata key #{key}: #{e.}", caller) end value end |
#logger=(logger) ⇒ void
This method returns an undefined value.
attach a Logger to the underlying Sequel database so that you can debug or monitor database actions. See Sequel::Database#logger=.
169 170 171 172 |
# File 'lib/data_hut/data_warehouse.rb', line 169 def logger=(logger) raise(ArgumentError, "logger must be a type of Logger.") unless logger.kind_of?(Logger) @db.logger = logger end |
#not_unique(hash) ⇒ Boolean
exactly duplicate records are automatically skipped at the end of an extract iteration (see #extract). This method is useful if an extract iteration takes a long time and you want to skip duplicates early in the iteration.
used to determine if the specified fields and values are unique in the datahut.
231 232 233 |
# File 'lib/data_hut/data_warehouse.rb', line 231 def not_unique(hash) @db[:data_warehouse].where(hash).count > 0 rescue false end |
#store_meta(key, value) ⇒ void
Because the datastore can support any Ruby object (including custom ones) it is up to the caller to make sure that custom classes are in context before storage and fetch. i.e. if you store a custom object and then fetch it in a context that doesn’t have that class loaded, you’ll get an error.
For this reason it is safest to use standard Ruby types (e.g. Array, Hash, etc.) that will always be present.
This method returns an undefined value.
stores any Ruby object as metadata in the datahut.
183 184 185 186 187 188 189 190 191 192 193 194 195 |
# File 'lib/data_hut/data_warehouse.rb', line 183 def (key, value) key = key.to_s if key.instance_of?(Symbol) begin value = Sequel::SQL::Blob.new(Marshal.dump(value)) if (@db[:data_warehouse_meta].where(key: key).count > 0) @db[:data_warehouse_meta].where(key: key).update(value: value) else @db[:data_warehouse_meta].insert(key: key, value: value) end rescue Exception => e raise(ArgumentError, "DataHut: unable to store metadata value #{value.inspect}: #{e.}", caller) end end |
#transform(forced = false) {|record| ... } ⇒ void
This method returns an undefined value.
used to transform data already extracted into a DataHut. You can also use transform to create new synthetic data fields from existing fields. You may create as many transform blocks (i.e. ‘passes’) as you like.
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 |
# File 'lib/data_hut/data_warehouse.rb', line 120 def transform(forced=false) raise(ArgumentError, "a block is required for transform.", caller) unless block_given? # now process all the records with the updated schema... @db[:data_warehouse].each do |h| # check for processed if not forced unless forced next if h[:dw_processed] == true end # then get rid of the internal id and processed flags dw_id = h.delete(:dw_id) h.delete(:dw_processed) # copy record fields to an openstruct r = OpenStruct.new(h) # and let the transformer modify it... yield r # get the update hash from the openstruct h = ostruct_to_hash(r) # now add any new transformation fields to the schema... adapt_schema(h) # and use it to update the record @db[:data_warehouse].where(dw_id: dw_id).update(h) end end |
#transform_complete ⇒ void
This method returns an undefined value.
marks all the records in the DataHut as ‘processed’. Useful as the last command in a sequence of extract and transform passes.
156 157 158 |
# File 'lib/data_hut/data_warehouse.rb', line 156 def transform_complete @db[:data_warehouse].update(:dw_processed => true) end |