A Unifying Tool For Deployment Of Databricks

Overview

Databricks Asset Bundles are a way to develop, package, version, and deploy Databricks workspace artifacts (like notebooks, workflows, libraries, etc.) using YAML-based configuration files. This allows for CI/CD integration and reproducible deployments across environments (dev/test/prod).

What are Databricks Asset Bundles

Databricks Asset Bundles are an infrastructure-as-code (IaC) approach to managing your Databricks projects.

Project including source control, code review, testing, and continuous integration and delivery (CI/CD).

A bundle includes the following parts:

-Required cloud infrastructure and workspace configurations

-Source files, such as notebooks and Python files, that include the business logic

-Definitions and settings for Databricks resources, such as Databricks jobs, Delta Live Tables pipelines, Model Serving endpoints, MLflow Experiments, and MLflow registered models

-Unit tests and integration tests

Why do we use Databricks Asset Bundles

-Consistency and Reproducibility: Bundles package notebooks, libraries, and configurations together, ensuring consistent deployments across various environments.

-Simplified Deployment: By consolidating all necessary assets into a single deployable unit, errors during deployment are minimized.

-Version Control: Enables tracking of changes and easy rollback to previous versions.

-Environment Isolation: Facilitates controlled deployment across different environments, enhancing stability and testing capabilities.

How do Databricks Asset Bundles work

-Bundle metadata is defined using YAML files that specify the artifacts, resources, and configuration of a Databricks Project. You can create a YAML file manually or generate one using the bundle template.

-The Databricks CLI can then be used to validate, deploy, and run bundles using these YAML files. You can run bundle projects from IDEs, Terminals, or within Databricks directly.

Deployment Process

Pre-requisites to Deploy a Bundle:

1. Azure Databricks Workspace Account

2. Install VS Code Studio

3. Python 3.8 or higher

4. Install databricks-cli in VS Code

5. Git or Azure DevOps Account

Log in to Databricks

-databricks auth login –host <<workspace_url>>

-Once the authentication is done, we will be redirected to the page provided in the Screenshot

Set the Configuration:

-databricks configure

-Provide the workspace URL and token

-PAT should be generated and provided to the token. You can generate the token in the workspace by following the steps below:

-Go to Settings -> User -> Developer -> Access tokens -> Manage -> Generate new token ->provide the comment and expiration days (Note: Leave empty for lifetime)

To check the profile names, use the command below:

-cat ~/.databrickscfg

Databricks bundle workflow

Create a bundle using the below command:

-databricks bundle init

-Provide the project name

-Choose the language as Python by clicking p

-Choose the options to have the src folder, the dlt pipeline, and the python packages

-Once the required options are provided, a bundle folder will be created.

After creating a bundle, the folder structure we get is as shown below:

Understanding the Structure and Configuration Files in a Databricks Asset Bundle

1. Databricks.yml:

This is the main configuration file used by DAB. It contains environment-specific configs, which are used when we deploy the same bundle to different environments.

Modifying Environment-Specific Configs:

If you look at the databricks.yml, it contains different environment information, but the workspace host remains the same for all because by default, it takes the host URL of the Databricks workspace configured during DAB creation. We need to change the host of the ACC and PROD sections to databricks.yml to point to the UAT and Production Databricks workspace.

2.src/notebook.ipynb: Contains the main project code.

3.resources/project_name.yml: This file contains the information needed to run Databricks jobs, such as tasks, job cluster configs etc.

4. Databricks folder: There is a hidden “.databricks” folder that gets created during DAB Deployment. It mainly contains the environment-specific bundle files, which are used to deploy in the respective Databricks workspace. The “.databricks” folder also contains the Terraform files, which are used by DAB internally to deploy it to the Databricks workspace.

Executing and Cleaning Up Databricks Asset Bundles

Validate the bundle:

To validate the bundle, use the command below:

-cd demo_v2

-databricks bundle validate

On validation, a databricks folder is visible

-Files are pushed to the workspace

-If no target is specified, the bundle will be deployed to the development env

Validate the bundle:

On Validation a bundle is created in the workspace but not with all the resources.

Deploy the bundle:

To deploy the bundle, use the command below:

1. Databricks bundle deploy

Once the deployment is completed, we can see the complete structure with the content available in the workspace

A job is created in the workflow with a prefix [dev_user_name] bundle name(project name)

Run a job or pipeline

To run a specific job or pipeline, use the bundle run command. You must specify the resource key of the job or pipeline declared within the bundle configuration files. By default, the environment declared within the bundle configuration files is used. For example, to run a job demo_v2 in the default environment, run the following command:

2. Databricks bundle run demo_v2

To run a job with a key demo_v2 within the context of a target declared with the name dev or upper env’s specify the target by using the command below:

3. Databricks bundle run -t dev demo_v2

Destroy a bundle

To delete jobs, pipelines, and artifacts that were previously deployed, run the bundle destroy command. The following command deletes all previously deployed jobs, pipelines, and artifacts that are defined in the bundle configuration files:

Note: Destroying a bundle permanently deletes a bundle’s previously deployed jobs, pipelines, and artifacts. This action cannot be undone.

4. Databricks bundle destroy

By default, you are prompted to confirm permanent deletion of the previously-deployed jobs, pipelines, and artifacts. To skip these prompts and perform automatic permanent deletion, add the –auto-approve option to the bundle destroy command.

Best Practices

-Keep bundle files under version control (Git).

-Use meaningful naming conventions for jobs and workflows.

-Use environment variable substitution to avoid hardcoding secrets.

-Structure notebooks and scripts cleanly within subfolders

Use Cases

-Automating job deployments between environments.

-Consistent CI/CD pipelines using GitHub Actions, Azure DevOps, etc.

-Collaborating on Databricks development projects with version control.

– Chilukamari Durga Bhavani
Data Engineer

#azuredatabricks, #cicd, #cicdintegration, #dab, #databricks, #databricksassetbundle, #devops, #github, #microsoftazure, #python

A Unifying Tool For Deployment Of Databricks