Datacraft
Play with data like a Pro, and have fun like Minecraft.
what is it
Datacraft
is an ETL tool highly inspired by open source project kiba. But we need certain advanced features which current kiba does not provide, so we decided to roll out our own version. Here are the major ones are already implemented:
- build action after loading data (explained later)
- parallel processing
- lazy initialization
And we need more. Here are some ideas bouncing around in my head and might become new features.
- data flow, pipelining
- dry run with analytic results
- snapshot the data on change points and generate report for intuitive debugging.
Installation
Add this line to your application's Gemfile:
gem 'datacraft'
And then execute:
$ bundle
Or install it yourself as:
$ gem install datacraft
Usage
Here is an example of instruction script.
# data source class
class CsvFile
def initialize(csv_file)
@csv = CSV.open(csv_file, headers: true, header_converters: :symbol)
end
# mandatory method for data source class.
# Should always yield data rows.
def each
@csv.each do |row|
yield(row.to_hash)
end
@csv.close
end
end
# Data consumer class. Here we just output statistic
# information of the data set.
class ReportBuilder
def initialize
@total_employee = 0
@total_age = 0
end
# mandatory method for consume data
def <<(row)
@total_employee += 1
@total_age += row[:age].to_i
end
# optional method for build final product
def build
File.open('report.txt', 'w') do |f|
f.puts "Total Employee: #{@total_employee}"
f.puts "Total Age: #{@total_age}"
f.puts "Average Age: #{@total_age / @total_employee}" unless @total_employee == 0
end
end
end
retirement_age = 60
# define data source
from CsvFile, 'employees.csv'
total_rows = 0
# tweak each row
tweak do |row|
# eliminate the retired ones, return nil means to
# discard the data
total_rows += 1
row[:age].to_i < retirement_age ? row : nil
end
# define data consumer
to ReportBuilder
Run the script:
$ dcraft build inst.rb
parallel processing
To improve the performance of the script, you can enable multithreading by setting the option.
set :parallel, true # default is false
set :n_threads, 4 # default 8
Please notice, due to GIL, we are not able to take advantage of multicore system with parallel processing. But when the threads are block by heavy I/O, which is very common under this circumstance, then it can make considerable performance boost.
benchmark
Sometimes you just want to know: why is my script so slow? In order to efficiently spot the bottleneck of your script, you can enable benchmark mode
.
set :benchmark, true
Then run your script, you will see the detailed benchmark information.
hooks
Do something before or/and after the process, define hook blocks.
pre_build do
# do something
end
post_build do
# do something
end
It is pretty self explanatory.
Development
After checking out the repo, run bin/setup
to install dependencies. Then, run rake test
to run the tests. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/xiaoxinghu/datacraft. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.
License
The gem is available as open source under the terms of the MIT License.