Databricks is a data analytics platform that lets you easily integrate with open source libraries. It offers a simple collaborative environment to run interactive and scheduled data analysis workloads.

RudderStack supports Databricks as a source from which you can ingest data and route it to your desired downstream destinations.

Granting permissions

RudderStack requires you to grant certain user permissions on Databricks to successfully access data from it.

Follow the steps listed in the following sections in the exact order to grant these permissions:

Step 1: Add a user

Step 2: Creating the RudderStack schema and granting permissions

  1. Create a dedicated schema _rudderstack.
CREATE SCHEMA `_rudderstack`;
The _rudderstack schema is used by RudderStack for storing the state of each data sync. This name should not be changed.
  1. Grant full access to the schema _rudderstack for the user created in step 1.
GRANT ALL PRIVILEGES ON SCHEMA `_rudderstack` TO `user@example.com`

Replace user@example.com with the user created in step 1.

Setting up the Databricks source in RudderStack

To set up Databricks as a source in RudderStack, follow these steps:

Naming the source

  1. Log into your RudderStack dashboard.
  2. From the left panel, go to Source > New Source > Reverse ETL. Then, select Databricks, as shown:
Select Databricks source in RudderStack
  1. Assign a name to your source.

Configuring the connection credentials

  1. Enter the relevant settings from Databricks in the Connection Credentials section as shown below:
    • Host - Enter the server hostname.
    • Port - Enter the port number.
    • Path - Enter the HTTP path.
    • Token - Enter the personal access token.
For more information on getting these settings in Databricks, refer to the FAQ section below.
If you've already configured Databricks as a source before, your existing credentials will automatically appear under Use Existing Credentials.
  1. Click on Continue to proceed.

Schedule settings

  1. Specify the Schedule Settings to schedule the data syncs from your Databricks source.
RudderStack lets you schedule data syncs for your Reverse ETL sources and specify how and when the syncs will run. For more information on the Basic, CRON, and Manual schedule types, refer to the Sync Schedule Settings guide.
  1. After specifying the schedule type and run settings, click on Continue to finish the setup.

Databricks is now successfully configured as a source in your RudderStack dashboard. You can further connect this source to your preferred destination by clicking on Add Destination button, as shown:

Add destination in RudderStack
If you have already configured a destination in RudderStack, choose the Use Existing Destination option which will take you to the Schema tab in the source settings. To add a new destination from scratch, select the Create New Destination option which will take you to the destination configuration page.

Specifying the data to import

While connecting a destination to your Databricks source, you can use the default JSON mapping feature.

Note that the Visual Data Mapping feature is not supported for Databricks currently.
For more information on the data import settings, refer to the Importing Data using Tables guide.

FAQ

Where can I obtain the connection credentials for Databricks?

To obtain the Host, Path, and Port number, go to your Databricks account and follow these steps:

  1. Go to the Compute tab and select your Databricks cluster.
  2. Click on Advanced options > JDBC/ODBC tab to find the required settings:
Select Databricks source in RudderStack

To obtain the Token, go to the Settings > User Settings in your Databricks account and generate a new personal access token, as shown:

Select Databricks source in RudderStack
Refer to the Databricks documentation for more details on generating a personal access token.

What do the three validations under Verifying Credentials imply?

When setting up a Reverse ETL source, once you proceed after entering the connection credentials, you will see the following three validations under the Verifying Credentials option:

Validations

These options are explained below:

  • Verifying Connection: This option indicates that RudderStack is trying to connect to the warehouse with the information specified in the connection credentials.
If this option gives an error, it means that one or more fields specified in the connection credentials are incorrect. Verify your credentials in this case.
  • Able to List Schema: This option checks if RudderStack is able to fetch all the schema details using the provided credentials.
  • Able to Access RudderStack Schema: This option implies that RudderStack is able to access the _rudderstack schema you have created by successfully running all the commands in the Creating the RudderStack schema and granting permissions section.
If this option gives an error, verify if you have successfully created the _rudderstack schema and given RudderStack the required permissions to access it. For more information, refer to the Creating the RudderStack schema and granting permissions section.

Contact us

For more information on the topics covered on this page, email us or start a conversation in our Slack community.