Handling CDC in Databricks: Custom MERGE vs. DLT APPLY CHANGES

Handling CDC in Databricks: Custom MERGE vs. DLT APPLY CHANGES

Change data capture (CDC) is crucial for keeping data lakes synchronized with source systems. Databricks supports CDC through two main approaches:

  • Custom MERGE operation (Spark SQL or PySpark)
  • Delta Live Tables (DLT) APPLY CHANGES, a declarative CDC API

This blog explores both methods, their trade-offs, and demonstrates best practices for production-grade pipelines in Databricks.

Custom MERGE INTO: Flexibility with Complexity

Traditional CDC via MERGE INTO requires manual orchestration:

  • Deduplicate and order records
    You must identify the most recent change per key using window functions (e.g., timestamps) and filter accordingly
  • Custom merge logic
    Specify WHEN MATCHED and WHEN NOT MATCHED clauses to implement INSERT, UPDATE, and DELETE logic.
  • Manage out-of-order events
    Without strict ordering logic, late-arriving records can lead to inconsistent updates
  • Operational overhead
    You must optimize, partition, handle errors, and tune performance—all manually.
Example:
Handling CDC in Databricks: Custom MERGE vs. DLT APPLY CHANGES

DLT APPLY CHANGES: Declarative CDC Simplified

Delta Live Tables (DLT) offers declarative CDC via APPLY CHANGES, designed for streaming and batch CDC workloads

Key Features:
  • Automatic sequencing and deduplication
    Handles out-of-order events using a defined sequence column—no manual window logic needed
  • Built-in change semantics
    Supports INSERT, UPDATE, DELETE with SCD Type 1/2 via a few key parameters—no need for multiple MERGE clauses
  • Streaming-native
    Designed for structured streaming, offering better performance, checkpointing, and transactional consistency
  • Simpler code & governance
    Less boilerplate, fewer maintenance burdens, and cleaner pipeline definitions.
Example:
Picture2 - cdc

MERGE vs APPLY_CHANGES: Side-by-Side Comparison:

Feature

MERGE INTO

DLT APPLY CHANGES

CDC semantics

Manual implementation required

Declarative I/U/D with SCD support

Deduplication & order

Window functions + custom logic

Built-in via the sequence_by column

Out-of-order handling

Manual pre-processing

Automatically handled

Code complexity

High – multiple SQL/logic blocks

Low – simple declarative API

Streaming native

Yes, but with manual checkpoints

Yes, optimized for streaming

SCD Type support

Custom scripts required

Natively supports Type 1 & Type 2

Performance

Depends on manual tuning

Streamlined for efficiency

Flexibility

Full control—joins, custom logic

Best for standard CDC; limited flexibility


When to Use Which?

Choose MERGE INTO when:
  • You require multi-table joins or complex business logic
  • Managing non-standard SCD Type 2/3 transformations
  • Running ad-hoc batch jobs or historical backfills
Choose APPLY CHANGES when:
  • Building structured streaming/CDC pipelines with Delta Live Tables
  • The CDC feed has a clear primary key and sequence column
  • You want simplified, maintainable, and streaming-native CDC pipelines
Best Practices for DLT APPLY CHANGES
  1. Enable CDC source: Ensure your input Delta table uses Change Data Feed
  2. Choose keys and sequence: Clearly define unique keys and an ordering column
  3. Set delete semantics: Use apply_as_deletes to mark records for removal
  4. Choose SCD mode: default is Type 1; use stored_as_scd_type=”2″ for Type 2
  5. Monitor metrics like num_upserted_rows and num_deleted_rows to track pipeline health and performance

Conclusion

  • Spark MERGE INTO delivers complete control and flexibility but entails significantly more engineering for ordering, deduplication, and performance tuning.
  • DLT APPLY CHANGES offers a lean, declarative, and streaming-optimized alternative for CDC pipelines in Databricks, minimizing boilerplate and improving maintainability.

For most CDC workflows, especially streaming scenarios, APPLY CHANGES is the preferred choice. Save MERGE INTO for specialized cases that require custom logic or ad-hoc transformations.

-David M
Data Engineer