SIP: SQL to Hive Importer

SIP is a ETL tool for extracting SQL databases and importing them into Hive. It was created because the ability to transform columns and partition data was an absolute requirement, and no other tool provided that functionality.

Unique features include:

  • The ability to transform columns (using Ruby code)

  • A single, simple, human readable configuration file

  • The ability to parition tables in Hive based on the value of any (possibly transformed) columns

Bug reports and pull requests welcome on github.

Requirements

  • A working Hadoop installation

  • Hive on the namenode

  • Ruby >= 1.8.6 on all datanodes

  • SQL Ruby libs

Installation, Configuration, and Use

gem install sip

note: if no primary key (default: id), must set incremental_index to blank

How it Works

Per table to be imported, SIP determines the queries necessary to perform an export and the creates scripts (one per datanode) that are then run individually in parallel. Each script:

  1. Copies a transformation / partition (transpart) script to the datanode

  2. Performs the SQL extraction, piping output through the transpart script

  3. Uploads all partitions to HDFS

Then, all of the partitions are imported from HDFS into Hive. Easy squeezy.