Class: DataHut::DataWarehouse
- Inherits:
-
Object
- Object
- DataHut::DataWarehouse
- Defined in:
- lib/data_hut/data_warehouse.rb
Overview
The DataHut::DataWarehouse comprehensively manages all the heavy lifting of creating a data system for your analytics. So during extract and transform phases you don't have to worry about the schema or the data types you'll be using... just start scraping and playing with the data extraction, DataHut will take care of introspecting your final data records and creating or altering the DataHut schema for you, auto-magically.
Class Method Summary collapse
-
.connect(name) ⇒ DataHut::DataWarehouse
creates or opens an existing connection to a DataHut data store.
Instance Method Summary collapse
-
#dataset ⇒ Sequel::Model
access the DataHut dataset.
-
#extract(data) {|record, element| ... } ⇒ void
used to extract data from whatever source you wish.
-
#fetch_meta(key) ⇒ Object
retrieves any Ruby object stored as metadata.
-
#logger=(logger) ⇒ void
attach a Logger to the underlying Sequel database so that you can debug or monitor database actions.
-
#not_unique(hash) ⇒ Boolean
used to determine if the specified fields and values are unique in the datahut.
-
#store_meta(key, value) ⇒ void
stores any Ruby object as metadata in the datahut.
-
#transform(forced = false) {|record| ... } ⇒ void
used to transform data already extracted into a DataHut.
-
#transform_complete ⇒ void
marks all the records in the DataHut as 'processed'.
Class Method Details
.connect(name) ⇒ DataHut::DataWarehouse
creates or opens an existing connection to a DataHut data store.
56 57 58 |
# File 'lib/data_hut/data_warehouse.rb', line 56 def self.connect(name) new(name) end |
Instance Method Details
#dataset ⇒ Sequel::Model
the resulting [Sequel::Model] additionally supports a #to_json method for JSON export of the dataset results.
access the DataHut dataset. See Sequel::Dataset for available operations on the dataset.
65 66 67 68 69 70 71 72 73 |
# File 'lib/data_hut/data_warehouse.rb', line 65 def dataset klass = Class.new(Sequel::Model(@db[:data_warehouse])) klass.class_eval do def to_json(*a) values.to_json(*a) end end klass end |
#extract(data) {|record, element| ... } ⇒ void
Duplicate records (all fields and values must match) are automatically not inserted at the end of an extract iteration. You may also skip duplicate extracts early in the iteration by using #not_unique.
Fields with nil values in records are skipped because the underlying database defaults these to nil already. However you must have at least one non-nil value in order for the field to be automatically created, otherwise subsequent transform layers may report errors on trying to access the field.
This method returns an undefined value.
used to extract data from whatever source you wish. As long as the data forms an enumerable collection, you can pass it to extract along with a block that specifies how you which the DataHut record to be mapped from the source element of the collection.
100 101 102 103 104 105 106 107 108 |
# File 'lib/data_hut/data_warehouse.rb', line 100 def extract(data) raise(ArgumentError, "a block is required for extract.", caller) unless block_given? data.each do |d| r = OpenStruct.new yield r, d store(r) end end |
#fetch_meta(key) ⇒ Object
Because the datastore can support any Ruby object (including custom ones) it is up to
the caller to make sure that custom classes are in context before storage and fetch. i.e. if you
store a custom object and then fetch it in a context that doesn't have that class loaded, you'll get an error.
For this reason it is safest to use standard Ruby types (e.g. Array, Hash, etc.) that will always be present.
retrieves any Ruby object stored as metadata.
214 215 216 217 218 219 220 221 222 223 224 |
# File 'lib/data_hut/data_warehouse.rb', line 214 def (key) key = key.to_s if key.instance_of?(Symbol) begin r = @db[:data_warehouse_meta].where(key: key).first value = r[:value] unless r.nil? value = Marshal.load(value) unless value.nil? rescue Exception => e raise(RuntimeError, "DataHut: unable to fetch metadata key #{key}: #{e.}", caller) end value end |
#logger=(logger) ⇒ void
This method returns an undefined value.
attach a Logger to the underlying Sequel database so that you can debug or monitor database actions. See Sequel::Database#logger=.
178 179 180 181 |
# File 'lib/data_hut/data_warehouse.rb', line 178 def logger=(logger) raise(ArgumentError, "logger must be a type of Logger.") unless logger.kind_of?(Logger) @db.logger = logger end |
#not_unique(hash) ⇒ Boolean
exactly duplicate records are automatically skipped at the end of an extract iteration (see #extract). This method is useful if an extract iteration takes a long time and you want to skip duplicates early in the iteration.
used to determine if the specified fields and values are unique in the datahut.
240 241 242 |
# File 'lib/data_hut/data_warehouse.rb', line 240 def not_unique(hash) @db[:data_warehouse].where(hash).count > 0 rescue false end |
#store_meta(key, value) ⇒ void
Because the datastore can support any Ruby object (including custom ones) it is up to
the caller to make sure that custom classes are in context before storage and fetch. i.e. if you
store a custom object and then fetch it in a context that doesn't have that class loaded, you'll get an error.
For this reason it is safest to use standard Ruby types (e.g. Array, Hash, etc.) that will always be present.
This method returns an undefined value.
stores any Ruby object as metadata in the datahut.
192 193 194 195 196 197 198 199 200 201 202 203 204 |
# File 'lib/data_hut/data_warehouse.rb', line 192 def (key, value) key = key.to_s if key.instance_of?(Symbol) begin value = Sequel::SQL::Blob.new(Marshal.dump(value)) if (@db[:data_warehouse_meta].where(key: key).count > 0) @db[:data_warehouse_meta].where(key: key).update(value: value) else @db[:data_warehouse_meta].insert(key: key, value: value) end rescue Exception => e raise(ArgumentError, "DataHut: unable to store metadata value #{value.inspect}: #{e.}", caller) end end |
#transform(forced = false) {|record| ... } ⇒ void
This method returns an undefined value.
used to transform data already extracted into a DataHut. You can also use transform to create new synthetic data fields from existing fields. You may create as many transform blocks (i.e. 'passes') as you like.
129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
# File 'lib/data_hut/data_warehouse.rb', line 129 def transform(forced=false) raise(ArgumentError, "a block is required for transform.", caller) unless block_given? # now process all the records with the updated schema... @db[:data_warehouse].each do |h| # check for processed if not forced unless forced next if h[:dw_processed] == TRUE_VALUE end # then get rid of the internal id and processed flags dw_id = h.delete(:dw_id) h.delete(:dw_processed) # copy record fields to an openstruct r = OpenStruct.new(h) # and let the transformer modify it... yield r # get the update hash from the openstruct h = ostruct_to_hash(r) # now add any new transformation fields to the schema... adapt_schema(h) # and use it to update the record @db[:data_warehouse].where(dw_id: dw_id).update(h) end end |
#transform_complete ⇒ void
This method returns an undefined value.
marks all the records in the DataHut as 'processed'. Useful as the last command in a sequence of extract and transform passes.
165 166 167 |
# File 'lib/data_hut/data_warehouse.rb', line 165 def transform_complete @db[:data_warehouse].update(:dw_processed => true) end |