Delta Lake Speed-Up: Z-Order on Single vs. Multiple Columns

Introduction

As organizations ingest massive volumes of data into Delta Lake, query performance becomes critical, especially for dashboards, ad-hoc analysis, and downstream ETL jobs.

One powerful technique to reduce query latency and improve data skipping is Z-Order Optimization.

In this article, let’s cover:

What Z-Ordering is

How to apply it to single vs. multiple columns
A real-world implementation using PySpark in Databricks

What is Z-Order Optimization?

Z-Ordering is a powerful technique in Delta Lake that co-locates related data in storage to reduce file scans and boost query performance. It uses the Z-order curve to cluster similar values together, improving data skipping efficiency during query execution.

This is especially effective when filtering on high-selectivity columns like id, timestamp, or region.

Use Case: Optimizing Delta Tables with Z-Order

We’re applying Z-Ordering on two Delta tables:

account_statuses – Z-Ordered by a single column: id
buyer logs – Z-Ordered by multiple columns: buyer_mill_id, buyer_group_id, buyer_d, id

Implementation in PySpark (Databricks)

Z-Order on a Single Column

Use Case: Tables with high cardinality columns like id, email, or user_id.

2. Z-Order on Multiple Columns

Use Case: Analytical queries filtering on multiple business identifiers or hierarchical keys.

Why Use Multiple Columns?

Z-Ordering on multiple columns is useful when:

Queries often filter on combinations of columns.

The data has a natural hierarchy (e.g., mill → group → buyer → id).

You want to reduce I/O for multi-filter joins or aggregations.

However, more is not always better. The Z-Order algorithm uses all listed columns to compute ordering, so adding unnecessary columns can degrade performance.

Best Practices

Choose high-cardinality columns.

Run OPTIMIZE after bulk inserts or daily loads.

Use 1–3 columns for best balance.

Avoid low-cardinality columns like gender or country_code unless partitioned.

Automate Z-Order as part of ETL or DLT pipelines.

Checking Z-Order History

You can verify that Z-ORDER was applied using:

DESCRIBE HISTORY dev.app_os_db_silver.buyler_logs;

Look for operations where ZORDER BY appears in the userMetadata column

Automate with Multiple Tables

You can loop over many tables with varied Z-Order columns:

Conclusion

Z-Order optimization is a powerful technique for tuning read performance in Delta Lake. Whether optimizing on a single column or multiple, applying Z-ORDER strategically can lead to significant savings in time, cost, and compute resources.

By understanding your query patterns and organizing your data accordingly, Z-Ordering can become a vital part of your performance engineering toolkit, helping you build faster and more efficient data pipelines.

-Lakshmi Devi Gaddam
Data Engineer

#data, #databricks, #deltalake, #pyspark

Delta Lake Speed-Up: Z-Order on Single vs. Multiple Columns