Delta Lake Speed-Up: Z-Order on Single vs. Multiple Columns

Delta Lake Speed-Up: Z-Order on Single vs. Multiple Columns

Delta Lake Speed-Up: Z-Order on Single vs. Multiple Columns

Introduction 

  • As organizations ingest massive volumes of data into Delta Lake, query performance becomes critical, especially for dashboards, ad-hoc analysis, and downstream ETL jobs. 
  • One powerful technique to reduce query latency and improve data skipping is Z-Order Optimization

In this article, let’s cover: 

  • What Z-Ordering is 
  • How to apply it to single vs. multiple columns 
  • A real-world implementation using PySpark in Databricks 

What is Z-Order Optimization? 

Z-Ordering is a powerful technique in Delta Lake that co-locates related data in storage to reduce file scans and boost query performance. It uses the Z-order curve to cluster similar values together, improving data skipping efficiency during query execution.

This is especially effective when filtering on high-selectivity columns like id, timestamp, or region.

Use Case: Optimizing Delta Tables with Z-Order

We’re applying Z-Ordering on two Delta tables:

  1. account_statuses – Z-Ordered by a single column: id
  2. buyer logs – Z-Ordered by multiple columns: buyer_mill_id, buyer_group_id, buyer_d, id

Implementation in PySpark (Databricks) 

  1. Z-Order on a Single Column
Z ORDER1 - delta lake

 Use Case: Tables with high cardinality columns like id, email, or user_id. 

 2. Z-Order on Multiple Columns 
Z ORDER2 - delta lake

Use Case: Analytical queries filtering on multiple business identifiers or hierarchical keys. 

Why Use Multiple Columns? 

Z-Ordering on multiple columns is useful when: 

  • Queries often filter on combinations of columns. 
  • The data has a natural hierarchy (e.g., mill → group → buyer → id). 
  • You want to reduce I/O for multi-filter joins or aggregations

However, more is not always better. The Z-Order algorithm uses all listed columns to compute ordering, so adding unnecessary columns can degrade performance. 

 Best Practices 

  • Choose high-cardinality columns. 
  • Run OPTIMIZE after bulk inserts or daily loads. 
  • Use 1–3 columns for best balance. 
  • Avoid low-cardinality columns like gender or country_code unless partitioned. 
  • Automate Z-Order as part of ETL or DLT pipelines. 

Checking Z-Order History 

You can verify that Z-ORDER was applied using: 

DESCRIBE HISTORY dev.app_os_db_silver.buyler_logs; 

Z ORDER3 - delta lake

Look for operations where ZORDER BY appears in the userMetadata column 

Automate with Multiple Tables 

You can loop over many tables with varied Z-Order columns: 

Z ORDER4 - delta lake

Conclusion 

Z-Order optimization is a powerful technique for tuning read performance in Delta Lake. Whether optimizing on a single column or multiple, applying Z-ORDER strategically can lead to significant savings in time, cost, and compute resources.

By understanding your query patterns and organizing your data accordingly, Z-Ordering can become a vital part of your performance engineering toolkit, helping you build faster and more efficient data pipelines.

-Lakshmi Devi Gaddam
Data Engineer