SIP: SQL to Hive Importer
SIP is a ETL tool for extracting SQL databases and importing them into Hive. It was created because the ability to transform columns and partition data was an absolute requirement, and no other tool provided that functionality.
Unique features include:
-
The ability to transform columns (using Ruby code)
-
A single, simple, human readable configuration file
-
The ability to parition tables in Hive based on the value of any (possibly transformed) columns
Bug reports and pull requests welcome on github.
Requirements
-
A working Hadoop installation
-
Hive on the namenode
-
Ruby >= 1.8.6 on all datanodes
-
SQL Ruby libs
Installation, Configuration, and Use
gem install sip
note: if no primary key (default: id), must set incremental_index to blank
How it Works
Per table to be imported, SIP determines the queries necessary to perform an export and the creates scripts (one per datanode) that are then run individually in parallel. Each script:
-
Copies a transformation / partition (transpart) script to the datanode
-
Performs the SQL extraction, piping output through the transpart script
-
Uploads all partitions to HDFS
Then, all of the partitions are imported from HDFS into Hive. Easy squeezy.