Handling CDC in Databricks: Custom MERGE vs. DLT APPLY CHANGES

Change data capture (CDC) is crucial for keeping data lakes synchronized with source systems. Databricks supports CDC through two main approaches:

Custom MERGE operation (Spark SQL or PySpark)
Delta Live Tables (DLT) APPLY CHANGES, a declarative CDC API

This blog explores both methods, their trade-offs, and demonstrates best practices for production-grade pipelines in Databricks.

Custom MERGE INTO: Flexibility with Complexity

Traditional CDC via MERGE INTO requires manual orchestration:

Deduplicate and order records
You must identify the most recent change per key using window functions (e.g., timestamps) and filter accordingly
Custom merge logic
Specify WHEN MATCHED and WHEN NOT MATCHED clauses to implement INSERT, UPDATE, and DELETE logic.
Manage out-of-order events
Without strict ordering logic, late-arriving records can lead to inconsistent updates
Operational overhead
You must optimize, partition, handle errors, and tune performance—all manually.

Example:

DLT APPLY CHANGES: Declarative CDC Simplified

Delta Live Tables (DLT) offers declarative CDC via APPLY CHANGES, designed for streaming and batch CDC workloads

Key Features:

Automatic sequencing and deduplication
Handles out-of-order events using a defined sequence column—no manual window logic needed
Built-in change semantics
Supports INSERT, UPDATE, DELETE with SCD Type 1/2 via a few key parameters—no need for multiple MERGE clauses
Streaming-native
Designed for structured streaming, offering better performance, checkpointing, and transactional consistency
Simpler code & governance
Less boilerplate, fewer maintenance burdens, and cleaner pipeline definitions.

Example:

MERGE vs APPLY_CHANGES: Side-by-Side Comparison:

Feature	MERGE INTO	DLT APPLY CHANGES
CDC semantics	Manual implementation required	Declarative I/U/D with SCD support
Deduplication & order	Window functions + custom logic	Built-in via the sequence_by column
Out-of-order handling	Manual pre-processing	Automatically handled
Code complexity	High – multiple SQL/logic blocks	Low – simple declarative API
Streaming native	Yes, but with manual checkpoints	Yes, optimized for streaming
SCD Type support	Custom scripts required	Natively supports Type 1 & Type 2
Performance	Depends on manual tuning	Streamlined for efficiency
Flexibility	Full control—joins, custom logic	Best for standard CDC; limited flexibility

When to Use Which?

Choose MERGE INTO when:

You require multi-table joins or complex business logic
Managing non-standard SCD Type 2/3 transformations
Running ad-hoc batch jobs or historical backfills

Choose APPLY CHANGES when:

Building structured streaming/CDC pipelines with Delta Live Tables
The CDC feed has a clear primary key and sequence column
You want simplified, maintainable, and streaming-native CDC pipelines

Best Practices for DLT APPLY CHANGES

Enable CDC source: Ensure your input Delta table uses Change Data Feed
Choose keys and sequence: Clearly define unique keys and an ordering column
Set delete semantics: Use apply_as_deletes to mark records for removal
Choose SCD mode: default is Type 1; use stored_as_scd_type=”2″ for Type 2
Monitor metrics like num_upserted_rows and num_deleted_rows to track pipeline health and performance

Conclusion

Spark MERGE INTO delivers complete control and flexibility but entails significantly more engineering for ordering, deduplication, and performance tuning.
DLT APPLY CHANGES offers a lean, declarative, and streaming-optimized alternative for CDC pipelines in Databricks, minimizing boilerplate and improving maintainability.

For most CDC workflows, especially streaming scenarios, APPLY CHANGES is the preferred choice. Save MERGE INTO for specialized cases that require custom logic or ad-hoc transformations.

-David M
Data Engineer

#data, #databricks

Handling CDC in Databricks: Custom MERGE vs. DLT APPLY CHANGES