Polars vs DuckDB for Delta Lake ops

Stefano Lori
3 min readSep 13, 2023

--

Introduction

Delta Lake is a unified lakehouse storage format that combines the reliability of a data warehouse with the flexibility of a data lake. It is a popular choice for storing large amounts of data that needs to be analyzed frequently. Lately it has introduced a lot of very interesting features.

Python is a popular programming language for data engineering and machine learning. It is also a good choice for developing REST APIs.

I found myself building a Python microservice in which there was the need to consume (read) a Delta Lake table using SQL inputs. I had the perfect chance to test some of my best friends Delta Lake integration APIs:
- Pola-rs (https://www.pola.rs/)
- Duck DB (https://duckdb.org/)

Requirements and installation

I used Conda:

# create env and activate it
$ conda create -n delta-ops python=3.9

# install deps
$ conda install -c conda-forge polars deltalake notebook python-duckdb

# launch notebook
$ jupyter notebook

Dateset

You can create a test Delta Table using Pola-rs as follow:

import polars as pl
import datetime

# Create a DataFrame with the desired columns
df = pl.DataFrame({
"id": [1, 5, 8],
"timestamp": [datetime.datetime.now(), datetime.datetime.now(), datetime.datetime.now()],
"xyz_value": [1.2, 3.4, 5.6]
})

# Write the DataFrame to a Delta Lake table
df.write_delta("delta-table-test")

Processors creation

Let’s start defyning some useful functions to process the Delta Lake data.

Let’s use a decorator to benchmark the functions:

from functools import wraps
import time

def timeit(func):
@wraps(func)
def timeit_wrapper(*args, **kwargs):
start_time = time.perf_counter()
result = func(*args, **kwargs)
end_time = time.perf_counter()
total_time = end_time - start_time
# print(f'Function {func.__name__}{args} {kwargs} Took {total_time:.4f} secs')
return result, total_time
return timeit_wrapper

Process Delta Lake data using DuckDB API:

from deltalake import DeltaTable
import duckdb
import pyarrow

@timeit
def proc_duck_test(query, path) -> pyarrow.dataset.Dataset:

dt = DeltaTable(path)
pyarrow_dataset = dt.to_pyarrow_dataset()
sample = duckdb.arrow(pyarrow_dataset)

res = duckdb.query(query)

return res

NB: DuckDB quacks Arrow! :D
DuckDB uses PyArrow to store an intermediate dataset to be queried. using SQL queries. For more info see: https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/

Process Delta Lake data using Pola-rs API:

import polars as pl

@timeit
def proc_polars_test(query, path) -> pl.DataFrame:
dt = pl.read_delta(path)

res = pl.SQLContext(sample=dt).execute(
query,
eager=True
)

return res

NB: SQLContext it’s used to apply SQL on the dataset.

Tessssssts time!

query =  """SELECT id, timestamp FROM sample WHERE id = 8""" 
path = "PATH-TO/delta-table-test"

res = [proc_polars_test(query, path) for i in range(10000)]
res = [proc_duck_test(query, path) for i in range(10000)]

Results

This is what I got:

[POLARS] Average time: 0.0045 seconds

[DUCKDB] Average time: 0.0030 seconds

Conclusions

DuckDB proved to be a little faster in accessing table reading than Pola-rs in local.
But I must say that in general both libraries show excellent performances.

In terms of usability both accept SQL queries and respond with dataframes. In this case I must say I found Pola-rs interface to be more usable, having the ability to define the view attribution on the call instead of defining the PyArrow intermediary step.

Of course to fully evaluate scalability and limits of the libraries need to be tested at much greater scale with full features such as lazyness, but this is already a very good start. I hope you enjoyed ! ;)

--

--

Stefano Lori
Stefano Lori

Written by Stefano Lori

Lead Big Data and AI, Senior Data Scientist in Fintech, ESG and Spark NLP contributor.

No responses yet