A Secure & Scalable Oracle Connection Strategy in Databricks Using OJDBC and Azure Key Vault

Overview

In enterprise environments, accessing on-premises Oracle databases securely from cloud platforms like Databricks is a common requirement. This blog explains how to establish that connection using the Oracle JDBC driver, securely manage credentials with Azure Key Vault, and use a custom PySpark function to simplify the connection process.

What is the OJDBC Driver?

The OJDBC (Oracle JDBC) driver is a Java-based driver that allows Java applications (and Spark, via JVM) to interact with Oracle databases using JDBC (Java Database Connectivity).

ojdbc6-11.2.0.4.jar specifically supports:
Java 6 (compatible with higher versions too).
Oracle DB 11g and later.
Oracle Database 11g is a widely used relational database management system (RDBMS) . The “g” in 11g stands for Grid Computing, which was Oracle’s focus at the time – offering more flexible and scalable infrastructure.
Used widely for compatibility with legacy systems and stable enterprise environments.

Step 1: Uploading the OJDBC Driver to Databricks

To connect to Oracle from Databricks, you need to install the JDBC driver:

Upload ojdbc6-11.2.0.4.jar and install it on the desired cluster.
a. Alternatively, place the JAR in Volumes and load it using the cluster configuration.
Go to your Databricks workspace → Compute → Libraries.
Click “Install new” and select “Upload”.

Step 2: Secure Credentials Using Azure Key Vault

You should never hardcode Oracle DB credentials. Instead:

Create a secret version within Azure Key Vault
1. Add your Oracle connection string in Azure Key Vault:
Just for your reference:
host=192.168.*.***;port=15**;sid=orcl;username=******;password=******

Step 3: Modular Function for Oracle Connection

The following pyspark function modularizes connection logic using the secret:

def connectionEstablishmentOracleConnection(scope, key):
    connection_string = dbutils.secrets.get(scope=scope, key=key)
    connection_details = {}
    for pair in connection_string.split(“;”):
        if “=” in pair:
            key, value = pair.split(“=”, 1)
            connection_details[key.strip()] = value.strip()
    database_host = connection_details.get(“host”)
    sid = connection_details.get(“sid”)
    username = connection_details.get(“username”)
    password = connection_details.get(“password”)
    database_port = connection_details.get(“port”)

if not all([database_host, database_port, sid]):
raise ValueError(“One or more connection details (host, port, sid) are missing.”)

    jdbc_url = f”jdbc:oracle:thin:@{database_host}:{database_port}:{sid}”
    properties = {
        “user”: username,
        “password”: password,
        “driver”: “oracle.jdbc.OracleDriver”
    }
    return jdbc_url, username, password

In the pyspark code above, the jdbc_url is for connecting to an Oracle database, and the individual components of the URL are as follows:

jdbc:

This is the protocol prefix that indicates to the system that you’re using Java Database Connectivity (JDBC).

oracle:

Specifies the database vendor, in this case, Oracle.

thin:

Refers to the “thin” JDBC driver, which is a lightweight, platform-independent Oracle JDBC driver.
It doesn’t require any Oracle client installation—just the .jar file (like ojdbc6-11.2.0.4.jar).

@{database_host}:{database_port}:{sid}

This part contains the Oracle database connection details:
- database_host: IP address or hostname of the on-prem Oracle server (e.g., 192.168.1.100)
- database_port: Usually 1521 (default port for Oracle DB)
- sid: Stands for System Identifier—a unique name for the Oracle database instance (e.g., ORCL)

This function connects to an on-premises Oracle database by retrieving a secure connection string from Databricks secrets. It parses the string into key-value pairs to extract details like host, port, SID, username, and password.

After validating that essential fields are present, it builds a JDBC connection URL using these values. Finally, it returns the JDBC URL along with the username and password, which can be used to establish a database connection in Spark.

This approach ensures credentials are securely managed and not hardcoded in the code.

Step 4: Reading Data from Oracle Using PySpark

Once the function is set up, use it to fetch credentials and read a table:

scope = “oracle_scope”
key = “oracle_connection”
jdbc_url, username, password = connectionEstablishmentOracleConnection(“oracle_scope”, “oracle_connection”)
df = spark.read.format(“jdbc”)\
    .option(“url”, jdbc_url)\
    .option(“dbtable”, “table_name”)\
    .option(“user”, username)\
    .option(“password”, password)\
    .option(“driver”, “oracle.jdbc.OracleDriver”)\
    .load()
df.dispaly()

Advantages

Secure Credential Management:
Using Azure Key Vault ensures sensitive information (like DB usernames/passwords) is encrypted and not hardcoded in notebooks.
Reusable Modular Code:
The connectionEstablishmentOracleConnection function promotes clean, reusable, and maintainable connection logic.
Seamless Integration with Oracle DB:
The ojdbc6-11.2.0.4.jar driver enables JDBC communication with on-prem Oracle 11g databases, a common enterprise setup.
Cloud-to-On-Prem Flexibility:
Supports hybrid architecture where cloud compute (Databricks) accesses on-prem legacy systems without full data migration.
Cluster-Wide Availability:
Installing the driver at the cluster level ensures it’s available across all notebooks and jobs without additional setup.

Disadvantages

Driver Management Overhead:
You need to manually upload and manage the ojdbc driver version, which can be error-prone or inconsistent across clusters.
Secret Access Permissions:
Misconfiguring Azure Key Vault access policies or Databricks secret scopes could block connections or expose credentials.
Network Restrictions:
Connecting to an on-prem database may require VPN, firewall exceptions, or private endpoint setup, adding complexity.
Legacy Dependency:
Using ojdbc6 ties you to older versions of Java and Oracle, potentially missing out on optimizations in newer versions (like ojdbc8 or Oracle 19c/23c).
Performance Constraints:
Latency between Databricks (cloud) and on-prem Oracle (local) can affect read/write speeds, especially with large datasets

Conclusion

Connecting to an on-prem Oracle database from Databricks involves integrating multiple secure and scalable components:

Oracle JDBC driver for database connectivity.
Azure Key Vault for secret management.
Custom modular PySpark function to handle parsing and connection logic.
Databricks Cluster Library setup to ensure the necessary drivers are available.

This approach ensures your connection is secure, reusable, and scalable, especially in environments with multiple databases or changing credentials. It separates credentials from code, supports role-based access control, and aligns with cloud security best practices.

-Sharan Kumar D
Data Engineer

#azure, #bigdata, #data, #dataanalysis, #dataanalytics, #databricks, #oracle, #pyspark

A Secure & Scalable Oracle Connection Strategy in Databricks Using OJDBC and Azure Key Vault