Ruby Extract-Transform-Load (ETL) tool.
Requirements
-
Ruby 1.8.5 or higher
-
Rubygems
Online Documentation
Available at activewarehouse.rubyforge.org/docs/activewarehouse-etl.html
Features
Current supported features:
-
ETL Domain Specific Language (DSL) - Control files are specified in a Ruby-based DSL
-
Multiple source types. Current supported types:
-
Fixed-width and delimited text files
-
XML files through SAX
-
Apache combined log format
-
-
Multiple destination types - file and database destinations
-
Support for extracting from multiple sources in a single job
-
Support for writing to multiple destinations in a single job
-
A variety of built-in transformations are included:
-
Date-to-string, string-to-date, string-to-datetime, string-to-timestamp
-
Type transformation supporting strings, integers, floats and big decimals
-
Trim
-
SHA-1
-
Decode from an external decode file
-
Default replacement for empty values
-
Ordinalize
-
Hierarchy lookup
-
Foreign key lookup
-
Ruby blocks
-
Any custom transformation class
-
-
A variety of build-in row-level processors
-
Check exists processor to determine if the record already exists in the destination database
-
Check unique processor to determine whether a matching record was processed during this job execution
-
Copy field
-
Rename field
-
Hierarchy exploder which takes a tree structure defined through a parent id and explodes it into a hierarchy bridge table
-
Surrogate key generator including support for looking up the last surrogate key from the target table using a custom query
-
Sequence generator including support for context-sensitive sequences where the context can be defined as a combination of fields from the source data
-
New row-level processors can easily be defined and applied
-
-
Pre-processing
-
Truncate processor
-
-
Post-processing
-
Bulk import using native RDBMS bulk loader tools
-
-
Virtual fields - Add a field to the destination data which doesn’t exist in the source data
-
Built in job and record meta data
-
Support for type 1 and type 2 slowly changing dimensions
-
Automated effective date and end date time stamping for type 2
-
CRC checking
-
Dependencies
ActiveWarehouse ETL depends on the following gems:
-
ActiveSupport Gem
-
ActiveRecord Gem
-
FasterCSV Gem
-
AdapterExtensions Gem
Usage
Once the ActiveWarehouse ETL gem is installed jobs can be invoked using the included ‘etl` script. The etl script includes several command line options and can process multiple control files at a time.
Command line options:
-
--help, -h
: Display the usage message. -
--config, -c
: Specify a database.yml configuration file to use. -
--limit, -l
: Specify a limit to the number of rows to process. This option is currently only applicable to database sources. -
--offset, -o
: Specify the start offset for reading from the source. This option is currently only applicable to database sources. -
--newlog, -n
: Instruct the engine to create a new ETL log rather than append to the last ETL log. -
--skip-bulk-import, -s
: Skip any bulk imports. -
--read-locally
: Read from the local cache (skip source extraction)
Control File Examples
Control file examples can be found in the examples directory.
Running Tests
Current state:
-
11 failures on MySQL
-
1 failure on Postgres
The tests require:
-
gem install shoulda
-
gem install flexmock
-
gem install pg (if you want to run the tests on pg)
-
gem install spreadsheet
The tests subfolder contains examples database.yml for mysql and postgres.
To run the tests:
-
rake test DB=postgresql (for postgres)
-
otherwise just rake test
Feedback
This is a work in progress. Comments should be made on the activewarehouse-discuss mailing list at the moment. Contributions are always welcome.