Class: DataHut::DataWarehouse

Inherits:
Object
  • Object
show all
Defined in:
lib/data_hut/data_warehouse.rb

Overview

The DataHut::DataWarehouse comprehensively manages all the heavy lifting of creating a data system for your analytics. So during extract and transform phases you don’t have to worry about the schema or the data types you’ll be using… just start scraping and playing with the data extraction, DataHut will take care of introspecting your final data records and creating or altering the DataHut schema for you, auto-magically.

Examples:

require 'data_hut'
require 'pry'   # not necessary, but very useful

dh = DataHut.connect("scratch")
data = [{name: "barney", age: 27, login: DateTime.parse('2008-05-03') },
        {name: "phil", age: 31},
        {name: "fred", age: 44, login: DateTime.parse('2013-02-07')}]

# extract your data by iterating over your data format (from whatever source) and map it to a record model...
dh.extract(data) do |r, d|
  r.name = d[:name]
  r.age = d[:age]
  # you can do anything you need to within the extract block to ensure data quality if you want:
  d[:login] = DateTime.new unless d.has_key?(:login)
  r.last_active = d[:login]
  print 'v'
end

# transform your data by adding fields to it
dh.transform do |r|
  r.eligible = r.age < 30
  print '*'
end

# mark all the records as processed to avoid re-transforming them.
dh.transform_complete
ds = dh.dataset
binding.pry   # play with ds.
[1] pry(main)> ds.avg(:age)
=> 34.0
[2] pry(main)> ineligible = ds.where(eligible: false)
[3] pry(main)> ineligible.avg(:age)
=> 37.5 

Class Method Summary collapse

Instance Method Summary collapse

Class Method Details

.connect(name) ⇒ DataHut::DataWarehouse

creates or opens an existing connection to a DataHut data store.

Parameters:

  • name (String)

    name of the DataHut. This will also be the name of the sqlite3 file written to the current working directory (e.g. ‘./<name>.db’)

Returns:



54
55
56
# File 'lib/data_hut/data_warehouse.rb', line 54

def self.connect(name)
  new(name)
end

Instance Method Details

#datasetSequel::Model

access the DataHut dataset. See Sequel::Dataset for available operations on the dataset.

Returns:

  • (Sequel::Model)

    instance bound to the data warehouse. Use this handle to query and analyze the datahut.



62
63
64
# File 'lib/data_hut/data_warehouse.rb', line 62

def dataset
  Class.new(Sequel::Model(@db[:data_warehouse]))
end

#extract(data) {|record, element| ... } ⇒ Object

used to extract data from whatever source you wish. As long as the data forms an enumerable collection, you can pass it to extract along with a block that specifies how you which the DataHut record to be mapped from the source element of the collection.

Examples:

Extracting fields from a hash and assigning it to a field on a record

data = [{name: "barney", age: 27, login: DateTime.parse('2008-05-03') }]
dh.extract(data) do |r, d|
  r.name = d[:name]
  r.age  = d[:age]
end

Parameters:

  • data (Enumerable)

Yields:

  • (record, element)

    lets you control the mapping of data elements to record fields

Yield Parameters:

  • record

    an OpenStruct that allows you to create fields dynamically on the record as needed. These fields will automatically be added to the schema behind the DataHut using the ruby data type you assigned to the record. See Sequel Schema Modification Methods for more information about supported ruby data types you can use.

  • element

    an element from your data.

Raises:

  • (ArgumentError)

    if you don’t provide a block



85
86
87
88
89
90
91
92
93
# File 'lib/data_hut/data_warehouse.rb', line 85

def extract(data)
  raise(ArgumentError, "a block is required for extract.", caller) unless block_given?

  data.each do |d|
    r = OpenStruct.new
    yield r, d
    store(r)
  end
end

#fetch_meta(key) ⇒ Object

retrieves previously stored metadata by key

Parameters:

  • key (Symbol)

    to lookup the metadata by

Returns:

  • (Object)

    ruby object that was fetched



188
189
190
191
192
193
194
195
196
197
198
# File 'lib/data_hut/data_warehouse.rb', line 188

def fetch_meta(key)
  key = key.to_s if key.instance_of?(Symbol)
  begin
    r = @db[:data_warehouse_meta].where(key: key).first
    value = r[:value] unless r.nil?
    value = Marshal.load(value) unless value.nil?
  rescue Exception => e
    raise(ArgumentError, "DataHut: unable to fetch metadata key #{key}.", caller)
  end
  value
end

#logger=(logger) ⇒ Object

attach a Logger to the underlying Sequel database so that you can debug or monitor database actions. See Sequel::Database#logger=.

Examples:

dh.logger = Logger.new(STDOUT)

Parameters:

  • logger (Logger)

    a logger for the underlying Sequel actions.

Raises:

  • (ArgumentError)

    if passed a logger that is not a kind of Logger.



159
160
161
162
# File 'lib/data_hut/data_warehouse.rb', line 159

def logger=(logger)
  raise(ArgumentError, "logger must be a type of Logger.") unless logger.kind_of?(Logger)
  @db.logger = logger
end

#store_meta(key, value) ⇒ Object

stores metadata

Parameters:

  • key (Symbol)

    to lookup the metadata by

  • value (Object)

    ruby object to store



170
171
172
173
174
175
176
177
178
179
180
181
182
# File 'lib/data_hut/data_warehouse.rb', line 170

def store_meta(key, value)
  key = key.to_s if key.instance_of?(Symbol)
  begin 
    value = Sequel::SQL::Blob.new(Marshal.dump(value))
    if (@db[:data_warehouse_meta].where(key: key).count > 0)
      @db[:data_warehouse_meta].where(key: key).update(value: value)
    else
      @db[:data_warehouse_meta].insert(key: key, value: value)
    end
  rescue Exception => e
    raise(ArgumentError, "DataHut: unable to store metadata value #{value.inspect}.", caller)
  end
end

#transform(forced = false) {|record| ... } ⇒ Object

used to transform data already extracted into a DataHut. You can also use transform to create new synthetic data fields from existing fields. You may create as many transform blocks (i.e. ‘passes’) as you like.

Examples:

Defining ‘eligibility’ based on arbitrary age criteria.

dh.transform do |r|
  r.eligible = r.age < 30      # using extracted to create a synthetic boolean field
end

Parameters:

  • forced (defaults to: false)

    if set to ‘true’, this transform will iterate over records already marked processed. This can be useful for layers of transforms that deal with analytics where the analytical model may need to rapidly change as you explore the data. See the second transform in DataHut::DataWarehouse.file/READMEfile/README.mdfile/README.md#A_More_Ambitious_Example___.

Yields:

  • (record)

    lets you modify the DataHut record

Yield Parameters:

  • record

    an OpenStruct that fronts the DataHut record. You may access existing fields on this record or create new fields to store synthetic data from a transform pass. These fields will automatically be added to the schema behind the DataHut using the ruby data type you assigned to the record. See Sequel Schema Modification Methods for more information about supported ruby data types you can use.

Raises:

  • (ArgumentError)

    if you don’t provide a block



113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# File 'lib/data_hut/data_warehouse.rb', line 113

def transform(forced=false)
  raise(ArgumentError, "a block is required for transform.", caller) unless block_given?

  # now process all the records with the updated schema...
  @db[:data_warehouse].each do |h|
    # check for processed if not forced
    unless forced
      next if h[:dw_processed] == true
    end
    # then get rid of the internal id and processed flags
    dw_id = h.delete(:dw_id)
    h.delete(:dw_processed)
    # copy record fields to an openstruct
    r = OpenStruct.new(h)
    # and let the transformer modify it...
    yield r
    # now add any new transformation fields to the schema...
    adapt_schema(r)
    # get the update hash from the openstruct
    h = r.marshal_dump
    # and use it to update the record
    @db[:data_warehouse].where(dw_id: dw_id).update(h)
  end
end

#transform_completeObject

marks all the records in the DataHut as ‘processed’. Useful as the last command in a sequence of extract and transform passes.

Examples:

a simple log analysis system (pseudocode)

rake update
   extract apache logs  (only adds new logs since last update)
   transform logs into types of response (error, ok, met_SLA (service level agreement, etc.))  (only transforms unprocessed (new) logs)
   transform_complete (marks the update complete)
   dh.dataset is used to visualize graphs with d3.js
end


147
148
149
# File 'lib/data_hut/data_warehouse.rb', line 147

def transform_complete
  @db[:data_warehouse].update(:dw_processed => true)
end