Category: Databricks
-
Verify, Trust, Comply: The Future of Responsible AI on Databricks
Regulators expect timely, accurate disclosures; investors demand transparent ESG performance; customers reward brands that do the right thing and prove it. Yet inside most enterprises, compliance is chaotic, with internal data scattered across finance, supply chain, HR, and operations. Databricks helps break down these silos, unifying enterprise data on a single platform so organizations can…
-
Talk Data to Me: Conversational AI Meets the Lakehouse with Databricks
In today’s data-driven world, businesses sit on mountains of data, but turning raw data into actionable insights remains a major challenge. Multiple siloed systems, fragmented datasets, and the sheer complexity of analysis often leave organizations paralyzed, unable to extract meaningful insights promptly. Decision-making slows, opportunities are missed, and teams are bogged down in manual data…
-
Seamless Ingestion from Google Sheets to Databricks: A Step-by-Step Guide
In today’s data-driven world, enterprises handle massive amounts of continuously arriving data from various sources. Google Sheets often serves as a quick and easy way for teams to manage and share data, especially for smaller datasets or collaborative efforts. However, when it comes to advanced analytics, larger datasets, or integration with other complex data sources,…
-
Deep Copy vs Shallow Copy in Databricks Delta Lake
When working with large-scale data in Databricks Delta Lake, it’s common to create copies of tables for testing, development, or archival purposes. However, not all copies are created equal. In Delta Lake, shallow copy and deep copy serve different purposes and have very different behaviors — both in terms of performance and data isolation. In…
-
The Hidden Wall Between Fabric OneLake and Databricks Unity Catalog
These days, many teams use Microsoft Fabric OneLake for unified storage and Databricks Unity Catalog (UC) for data governance and analytics. But here’s the catch: when you try to connect them directly, you hit a wall. You can’t simply register a Fabric Lakehouse as an external location in Databricks Unity Catalog like you would with…
-
Databricks Clean Room — where shared insights meet uncompromised privacy
A Data clean Room is a secure space that enables businesses to work together on sensitive data without exposing or compromising it. By using robust protocols and advanced technologies it allows multiple parties to combine and analyse information while ensuring strict adherence to privacy regulations and compliance requirements. Let’s consider a scenario where two organizations…
-
Handling CDC in Databricks: Custom MERGE vs. DLT APPLY CHANGES
Change data capture (CDC) is crucial for keeping data lakes synchronized with source systems. Databricks supports CDC through two main approaches: Custom MERGE operation (Spark SQL or PySpark) Delta Live Tables (DLT) APPLY CHANGES, a declarative CDC API This blog explores both methods, their trade-offs, and demonstrates best practices for production-grade pipelines in Databricks. Custom…
-
End-to-End Ingestion of 400+ MySQL Tables with Databricks Delta Live Tables
Ingesting and managing data from more than 400 MySQL tables on recurring schedules is a complex challenge. Traditional approaches often lead to pipelines that are difficult to scale, hard to maintain, and prone to failure when handling schema changes or scheduling dependencies. To address these challenges, we designed and implemented a configuration-driven ingestion framework using…
-
Streaming Made Simple with Databricks Debezium
Introduction In today’s fast-paced data-driven world, real-time data processing and change data capture (CDC) are crucial for businesses to make timely and informed decisions. Databricks, a powerful cloud-based analytics platform, combined with Debezium, an open-source CDC tool, enables seamless real-time data replication and transformation. This blog will explore Databricks and Debezium, detailing their integration and…
-
Liquid Clustering in Databricks: The Future of Delta Table Optimization
Introduction — The Big Shift in Delta Optimization In the ever-evolving world of big data, performance tuning is no longer optional – it’s essential. As datasets grow exponentially, so does the complexity of keeping them optimized for querying. Databricks’ Liquid Clustering is a groundbreaking approach to data organization within Delta tables. Unlike traditional static partitioning,…