Databricks Gem

Overview

This gem is designed to allow access to the DBX APIs (Jobs and SQL) from ruby applications.

Installation

Add the following to your Gemfile to install

gem 'dbx-api', '~>0.2.0'

Usage

Set up your .env file (optional)

# .env
DBX_HOST=DBX_CONNECTION_URL
DBX_TOKEN=YOUR_TOKEN_HERE
DBX_WAREHOUSE_ID=WAREHOUSE_ID_HERE

running sql from a ruby script

require 'dbx'

# If using a .env file
sql_runner = DatabricksGateway.new
# If not using .env file...
sql_runner = DatabricksGateway.new(host: 'DBX_CONNECTION_STRING', token: 'DBX_ACCESS_TOKEN', warehouse: 'DBX_SQL_WAREHOUSE_ID')

# Basic sql
response = sql_runner.run_sql("SELECT 1")
response.results
# => [{"1"=>"1"}]

# Dummy data in public DBX table
response = sql_runner.run_sql("SELECT * FROM samples.nyctaxi.trips LIMIT 1")
response.results
# => [{"tpep_pickup_datetime"=>"2016-02-14T16:52:13.000Z",
#   "tpep_dropoff_datetime"=>"2016-02-14T17:16:04.000Z",
#   "trip_distance"=>"4.94",
#   "fare_amount"=>"19.0",
#   "pickup_zip"=>"10282",
#   "dropoff_zip"=>"10171"}]

run_sql returns an object of type DatabricksSQLResponse.

The response object has a few useful methods. For a complete list, see the class definition: lib/dbx/databricks/sql_response.rb

response = sql_runner.run_sql("SELECT 1")

# checking the status of a response
response.status # => SUCCEEDED | FAILED | PENDING | RUNNING
response.failed? # => Boolean
response.success? # => Boolean

# getting the results of a response
response.results # => Array of Hashes

# looking at the raw response
response.raw_response # => HTTP object
# or just the parsed body of the HTTP response
response.body

# checking error messages for failed responses
response.error_message # => String

This gem does not make an inference to how error handling should occur. run_sql always returns an array, even if the query fails (it will return [] if status.failed?). Users may wish to check the status of the response before attempting to access the results. For example:

require 'dbx'

sql_runner = DatabricksGateway.new
res = sql_runner.run_sql("SELECT 1")

# do something with the results if the query succeeded
return res.results if res.success?

# do something else if the query failed
puts "query failed: #{res.error_message}"

The result of DatabrucksSQLResponse.results is always a an array of hashes with string keys. You can map over this array using whatever ruby code you like. For example:

sql_response = sql_runnner.run_sql(my_query)

# count the number of results
sql_response.results.length

If you would prefer a symbolized hash, you can use the symbolize_keys option when calling results:

sql_response.results(symbolize_keys: true)

Since run_sql returns an instance of DatabricksSQLResponse, you can also chain methods together:

sql_runner.run_sql("SELECT 1").results

The run_sql method hanldes slow queries by pinging databricks at set time intervals (default 5 seconds) until the query either fails or succeeds. The default sleep_timer can be set to any desired time interval at initialization:

# changing the default sleep timer to 1 second
sql_runner = DatabricksGateway.new(sleep_timer: 1)

Development

  • After checking out the repo, run bin/setup to install dependencies.
  • Set up your .env file as described above.
  • Run rake spec to run the rspec tests.

Build

  • Run gem build dbx.gemspec to build the gem.
  • Run gem push dbx-api-0.2.0.gem to push the gem to rubygems.org
    • Requires logging in to rubygems.org first via gem login