hivemeta

A ruby API for access to a Hive metastore running under MySQL.

Useful for querying columns in Hadoop map/reduce applications. Normally, a developer needs to handle both the splitting of incoming data and the assignment of numerically indexed fields to friendly variables like so:

fields = line.chomp.split /t/ item_id = fields inv_cnt = fields.to_i puts “#item_idt#inv_cnt”

This is not overly traumatic, however it’s susceptible to errors creeping in from file format changes. Ongoing maintenance can easily become a burden if there are many map/reduce programs reading the same changed data files. Code size increases as the column count increases.

With hivemeta, the process is streamlined. That same task is now:

row = inv_table.process_row line puts “#rowrow.item_idt#rowrow.inv_cntrow.inv_cnt.to_i”

The row object automagically knows its column names and they can be referenced in one of the following ways (in order of best to worst performance and coolness):

row.col_name row row

Also included is a demo application, hivemeta_query.rb, to spit out table information from the command-line via table name search or by the table’s location in HDFS.


Installation

gem install hivemeta

or (for jruby)

jgem install hivemeta


API Usage

streaming map/reduce code snippet (abstracted processing loop):

require ‘hivemeta’

h = HiveMeta::Connection.new # see below for detail

h.table(‘sample_inventory’).process do |row|

item_id = row.item_id # can access by method or [:sym] or ['str']
count   = row.inv_cnt.to_i
puts "#{item_id}\t#{count}" if count >= 1000

end

streaming map/reduce code snippet (normal STDIN processing loop):

require ‘hivemeta’

h = HiveMeta::Connection.new # see below for detail inv_table = h.table ‘sample_inventory’

STDIN.each_line do |line|

begin
  row = inv_table.process_row line
rescue HiveMeta::FieldCountError
  STDERR.puts "reporter:counter:bad_data:row_size,1"
  next
end
item_id = row.item_id # can access by method or [:sym] or ['str']
count   = row.inv_cnt.to_i
puts "#{item_id}\t#{count}" if count >= 1000

end

establishing a connection (in ruby code):

db_user = ‘hive’ db_pass = ‘hivepasshere’ db_host = ‘localhost’ db_name = ‘hivemeta’

h = HiveMeta::Connection.new(db_name, db_host, db_user, db_pass)

establishing a connection (environment variables):

# when no arguments are passed, the following env variables will be used: # # hivemeta_db_host # hivemeta_db_name # hivemeta_db_user # hivemeta_db_pass # # to set these in a streaming map/reduce job, use -D arguments like so: # # -D hivemeta.db_host=mydbhost \ # -D hivemeta.db_name=hivemeta \ # -D hivemeta.db_user=hive \ # -D hivemeta.db_pass=mydbpass \

# the connection will made with those env variables without any other code h = HiveMeta::Connection.new


hivemeta_query.rb Usage

# query by table names $ hivemeta_query.rb join_test_name join_test_name hdfs://namenode/tmp/join_test_name delimiter: “t” (ASCII 9) 0 userid # userid 1 name # username

# query by table name wildcards $ hivemeta_query.rb join_test% join_test_address hdfs://namenode/tmp/join_test_address delimiter: “,” (ASCII 44) 0 userid # uid 1 address 2 city 3 state

join_test_name hdfs://namenode/tmp/join_test_name delimiter: “t” (ASCII 9) 0 userid # userid 1 name # username

# list the tables using /tmp in HDFS $ hivemeta_query -l /tmp join_test_address join_test_work my_test_table

# view usage information $ hivemeta_query.rb -h usage: ./hivemeta_query.rb [options] table_name|hdfs_path

-h, --help
-c, --comments       # display comments along with field detail (default)
-C, --no-comments    # do not display comments with the field detail
-l, --list-tables    # list matching tables but no detail
-f, --list-file-path # list the table HDFS file locations
-w, --fit-width      # fit the text to the width of the screen (default)
-W, --no-fit-width   # do not fit the text to the width of the screen
-u, --db-user=arg    # hive metastore db user (requires read access)
-p, --db-pass=arg    # hive metastore db password
-H, --db-host=arg    # host running the hive meta db (default: localhost)
-d, --db-name=arg    # hive meta db name (default: hivemeta)