Polars vs DuckDB for Delta Lake ops
Introduction
Delta Lake is a unified lakehouse storage format that combines the reliability of a data warehouse with the flexibility of a data lake. It is a popular choice for storing large amounts of data that needs to be analyzed frequently. Lately it has introduced a lot of very interesting features.
Python is a popular programming language for data engineering and machine learning. It is also a good choice for developing REST APIs.
I found myself building a Python microservice in which there was the need to consume (read) a Delta Lake table using SQL inputs. I had the perfect chance to test some of my best friends Delta Lake integration APIs:
- Pola-rs (https://www.pola.rs/)
- Duck DB (https://duckdb.org/)
Requirements and installation
I used Conda:
# create env and activate it
$ conda create -n delta-ops python=3.9
# install deps
$ conda install -c conda-forge polars deltalake notebook python-duckdb
# launch notebook
$ jupyter notebook
Dateset
You can create a test Delta Table using Pola-rs as follow:
import polars as pl
import datetime
# Create a DataFrame with the desired columns
df = pl.DataFrame({
"id": [1, 5, 8],
"timestamp": [datetime.datetime.now(), datetime.datetime.now(), datetime.datetime.now()],
"xyz_value": [1.2, 3.4, 5.6]
})
# Write the DataFrame to a Delta Lake table
df.write_delta("delta-table-test")
Processors creation
Let’s start defyning some useful functions to process the Delta Lake data.
Let’s use a decorator to benchmark the functions:
from functools import wraps
import time
def timeit(func):
@wraps(func)
def timeit_wrapper(*args, **kwargs):
start_time = time.perf_counter()
result = func(*args, **kwargs)
end_time = time.perf_counter()
total_time = end_time - start_time
# print(f'Function {func.__name__}{args} {kwargs} Took {total_time:.4f} secs')
return result, total_time
return timeit_wrapper
Process Delta Lake data using DuckDB API:
from deltalake import DeltaTable
import duckdb
import pyarrow
@timeit
def proc_duck_test(query, path) -> pyarrow.dataset.Dataset:
dt = DeltaTable(path)
pyarrow_dataset = dt.to_pyarrow_dataset()
sample = duckdb.arrow(pyarrow_dataset)
res = duckdb.query(query)
return res
NB: DuckDB quacks Arrow! :D
DuckDB uses PyArrow to store an intermediate dataset to be queried. using SQL queries. For more info see: https://arrow.apache.org/blog/2021/12/03/arrow-duckdb/
Process Delta Lake data using Pola-rs API:
import polars as pl
@timeit
def proc_polars_test(query, path) -> pl.DataFrame:
dt = pl.read_delta(path)
res = pl.SQLContext(sample=dt).execute(
query,
eager=True
)
return res
NB: SQLContext it’s used to apply SQL on the dataset.
Tessssssts time!
query = """SELECT id, timestamp FROM sample WHERE id = 8"""
path = "PATH-TO/delta-table-test"
res = [proc_polars_test(query, path) for i in range(10000)]
res = [proc_duck_test(query, path) for i in range(10000)]
Results
This is what I got:
[POLARS] Average time: 0.0045 seconds
[DUCKDB] Average time: 0.0030 seconds
Conclusions
DuckDB proved to be a little faster in accessing table reading than Pola-rs in local.
But I must say that in general both libraries show excellent performances.
In terms of usability both accept SQL queries and respond with dataframes. In this case I must say I found Pola-rs interface to be more usable, having the ability to define the view attribution on the call instead of defining the PyArrow intermediary step.
Of course to fully evaluate scalability and limits of the libraries need to be tested at much greater scale with full features such as lazyness, but this is already a very good start. I hope you enjoyed ! ;)