Databricks Gem
Overview
This gem is designed to allow access to the DBX APIs (Jobs and SQL) from ruby applications.
Installation
Add the following to your Gemfile to install
gem 'dbx-api', '~>0.2.0'
Usage
Set up your .env file (optional)
# .env
DBX_HOST=DBX_CONNECTION_URL
DBX_TOKEN=YOUR_TOKEN_HERE
DBX_WAREHOUSE_ID=WAREHOUSE_ID_HERE
running sql from a ruby script
require 'dbx'
# If using a .env file
sql_runner = DatabricksGateway.new
# If not using .env file...
sql_runner = DatabricksGateway.new(host: 'DBX_CONNECTION_STRING', token: 'DBX_ACCESS_TOKEN', warehouse: 'DBX_SQL_WAREHOUSE_ID')
# Basic sql
response = sql_runner.run_sql("SELECT 1")
response.results
# => [{"1"=>"1"}]
# Dummy data in public DBX table
response = sql_runner.run_sql("SELECT * FROM samples.nyctaxi.trips LIMIT 1")
response.results
# => [{"tpep_pickup_datetime"=>"2016-02-14T16:52:13.000Z",
# "tpep_dropoff_datetime"=>"2016-02-14T17:16:04.000Z",
# "trip_distance"=>"4.94",
# "fare_amount"=>"19.0",
# "pickup_zip"=>"10282",
# "dropoff_zip"=>"10171"}]
run_sql
returns an object of type DatabricksSQLResponse.
The response object has a few useful methods. For a complete list, see the class definition: lib/dbx/databricks/sql_response.rb
response = sql_runner.run_sql("SELECT 1")
# checking the status of a response
response.status # => SUCCEEDED | FAILED | PENDING | RUNNING
response.failed? # => Boolean
response.success? # => Boolean
# getting the results of a response
response.results # => Array of Hashes
# looking at the raw response
response.raw_response # => HTTP object
# or just the parsed body of the HTTP response
response.body
# checking error messages for failed responses
response. # => String
This gem does not make an inference to how error handling should occur. run_sql
always returns an array, even if the query fails (it will return []
if status.failed?). Users may wish to check the status of the response before attempting to access the results. For example:
require 'dbx'
sql_runner = DatabricksGateway.new
res = sql_runner.run_sql("SELECT 1")
# do something with the results if the query succeeded
return res.results if res.success?
# do something else if the query failed
puts "query failed: #{res.}"
The result of DatabrucksSQLResponse.results
is always a an array of hashes with string keys. You can map over this array using whatever ruby code you like. For example:
sql_response = sql_runnner.run_sql(my_query)
# count the number of results
sql_response.results.length
If you would prefer a symbolized hash, you can use the symbolize_keys
option when calling results
:
sql_response.results(symbolize_keys: true)
Since run_sql
returns an instance of DatabricksSQLResponse
, you can also chain methods together:
sql_runner.run_sql("SELECT 1").results
The run_sql
method hanldes slow queries by pinging databricks at set time intervals (default 5 seconds) until the query either fails or succeeds. The default sleep_timer can be set to any desired time interval at initialization:
# changing the default sleep timer to 1 second
sql_runner = DatabricksGateway.new(sleep_timer: 1)
Development
- After checking out the repo, run
bin/setup
to install dependencies. - Set up your
.env
file as described above. - Run
rake spec
to run the rspec tests.
Build
- Run
gem build dbx.gemspec
to build the gem. - Run
gem push dbx-api-0.2.0.gem
to push the gem to rubygems.org- Requires logging in to rubygems.org first via
gem login
- Requires logging in to rubygems.org first via