Polars is all you need: SQL chapter
Intro
Let’s do some (quick) data analysis in Python. No doubts, Pandas!
Wait, I’m a Spark (Scala/Python) developer…syntax is so different…what are these indexes coming up all the times…do I have an alternative tool that adds an arrow in my quiver?
Yeah! Polars [pola-rs] is the brand new DataFrame library there for you.
It’s written in Rust..ok so what? Do I have to learn a new language?
No don’t worry, you just have to use the great Python API with so many advantages..let’s have an overview.
Rust brief history
Rust programming language project started in 2006 (Mozilla Research, Graydon Hoare) and it was announced later in 2010. Since then, many core aspects of the language changed.
Version 1.0 was announced on May 15, 2015.
During August 2020 COVID-19 pandemic , the team behind Servo, an Rust based experimental browser engine, was disbanded by Mozilla. Because of some members of the team were active Rust contributors, the fact raised concerns about the future of the language.
In February 2021, formation of the Rust Foundation was announced by its five founding companies (AWS, Huawei, Google, Microsoft, and Mozilla).
In December 2022, Linux 6.1 officially added support for Rust in the Kernel. The future looks bright!
Why Rust is so cool (for data analysis too) ?
- fast and memory-efficient: Rust facilitates easy scalability and concurrency
- memory-safety and thread-safety with an enhanced memory management without having a garbage collector (automatic memory management feature)
- Polars: here we go!
It is a DataFrame interface on top of an OLAP Query Engine implemented in Rust using Apache Arrow Columnar Format as the memory model, implemented in Rust, Python, Node.js, R and SQL!
- Fast: Polars is written from the ground up, designed close to the machine and without external dependencies.
- I/O: First class support for all common data storage layers: local, cloud storage & databases.
- Easy to use: Write your queries the way they were intended. Polars, internally, will determine the most efficient way to execute using its query optimizer.
- Out of Core: Polars supports out of core data transformation with its streaming API. Allowing you to process your results without requiring all your data to be in memory at the same time
- Parallel: Polars fully utilises the power of your machine by dividing the workload among the available CPU cores without any additional configuration.
- Vectorized Query Engine: Polars uses Apache Arrow, a columnar data format, to process your queries in a vectorized manner. It uses SIMD to optimize CPU usage.
Now..why Polars is so cool (for data analysis too) ?
Polars is written in Rust, but don’t worry you don’t have to learn a new language. It exposes a complete Python API, including the full set of features to manipulate DataFrames.
If you are a Spark developer, Scala or Python APIs, the violins will start playing.
Installation
It has never been simpler than that:
$ pip install polars
Eventually you may want to install the additional dependencies.
Getting Started with DataFrames
Let’s create a DataFrame.
Syntax recalls pandas, so you will not find much difference.
You have utility methods to output information such as print and head.
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"index": [1, 2, 3, 4, 5],
"date": [
datetime(2023, 1, 1),
datetime(2023, 1, 2),
datetime(2023, 1, 3),
datetime(2023, 1, 4),
datetime(2023, 1, 5),
],
"value": [4.0, 5.0, 6.0, 7.0, 8.0],
}
)
print(df)
---------
shape: (5, 3)
┌───────┬─────────────────────┬───────┐
│ index ┆ date ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═══════╪═════════════════════╪═══════╡
│ 1 ┆ 2023-01-01 00:00:00 ┆ 4.0 │
│ 2 ┆ 2023-01-02 00:00:00 ┆ 5.0 │
│ 3 ┆ 2023-01-03 00:00:00 ┆ 6.0 │
│ 4 ┆ 2023-01-04 00:00:00 ┆ 7.0 │
│ 5 ┆ 2023-01-05 00:00:00 ┆ 8.0 │
└───────┴─────────────────────┴───────┘
df.head()
---------
# similar to above just prettier
You can see visualize you DataFrame with column types and shape!
Make some quick data analysis
Let’s do some descriptive analysis in one command:
print(df.describe())
--------------------
shape: (7, 4)
┌────────────┬──────────┬────────────────────────────┬──────────┐
│ describe ┆ index ┆ date ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str ┆ f64 │
╞════════════╪══════════╪════════════════════════════╪══════════╡
│ count ┆ 5.0 ┆ 5 ┆ 5.0 │
│ null_count ┆ 0.0 ┆ 0 ┆ 0.0 │
│ mean ┆ 3.0 ┆ null ┆ 6.0 │
│ std ┆ 1.581139 ┆ null ┆ 1.581139 │
│ min ┆ 1.0 ┆ 2023-01-01 00:00:00.000000 ┆ 4.0 │
│ max ┆ 5.0 ┆ 2023-01-05 00:00:00.000000 ┆ 8.0 │
│ median ┆ 3.0 ┆ null ┆ 6.0 │
└────────────┴──────────┴────────────────────────────┴──────────┘
That will give you a summary statistics for the DataFrame. Cool!
Index or not to index? This is the question..or not.
If you are new to Pandas you will find this “get an index” thing kind of bizarre. I asked myself many times why should you get an index column..? O>O I still dunno.
The more I worked with Pandas the more I found myself using reset_index() or set_index() at every operational step in the pipeline.
IMPO this index labelling feature in Pandas is somehow not very usefull.
If you have the time to get through the Polars doc you can find the following explanation: “In Polars a DataFrame will always be a 2D table with heterogeneous data-types. The data-types may have nesting, but the table itself will not. Operations like resampling will be done by specialized functions or methods that act like ‘verbs’ on a table explicitly stating columns that ‘verb’ operates on. As such, it is our conviction that not having indices make things simpler, more explicit, more readable and less error-prone.”
Very convincing to me.
Apache Arrow (Polars) vs Numpy (Pandas)
Apache Arrow is an emerging standard for in-memory columnar analytics that can accelerate data load times, reduce memory usage and accelerate calculations.
Pandas was initially built on Numpy; however, it was never built as a backend for dataframe libraries and it has some limitations.
Pandas 2.x has adopted Arrow, Polars was born with it representing data in memory with Arrow arrays instead of Numpy arrays. To maintain porting capability Polars offers a to_numpy method.
Selection operation
This is where your (Spark) SQL senses will wake up!
No index so no more
df[‘a’]
# and/or
df.loc[:,’a’]
so you need to use a real SQL expression
df.select(['a'])
# or
df.select(pl.col('a'))
Filtering operation
If you want to select rows based on the values then in Polars
you use the .filter
method:
df.filter(pl.col('a') < 10)
Aggregation operation
Another usefull expression greatly close to Spark syntax is the aggregation that we can chain with a casting on datetime to date
df_count = (
df
.with_columns(
pl.col('date').cast(pl.Date).alias('date')
)
.groupby(['date', 'index']).agg(pl.count().alias("items"))
.sort(by="date", descending=False)
)
print(df_count)
---------------
shape: (5, 3)
┌────────────┬───────┬───────┐
│ date ┆ index ┆ items │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ u32 │
╞════════════╪═══════╪═══════╡
│ 2023-01-01 ┆ 1 ┆ 1 │
│ 2023-01-02 ┆ 2 ┆ 1 │
│ 2023-01-03 ┆ 3 ┆ 1 │
│ 2023-01-04 ┆ 4 ┆ 1 │
│ 2023-01-05 ┆ 5 ┆ 1 │
└────────────┴───────┴───────┘
SQL query integration
Similarly to Spark, Polars uses the SQLContext
to manage SQL queries .
While it’s not the recommended approach this allows plain SQL portability in your applications, which is kind of useful. SQL is then translated to expressions to be evaluated by the execution engine…but this is material for another article! ;)
import polars as pl
# prep for sql execution
sql = pl.SQLContext()
sql.register('df', df)
result_df = sql.execute(
"""
SELECT *
FROM df
"""
).collect()
print(result_df)
----------------shape: (5, 3)
┌───────┬─────────────────────┬───────┐
│ index ┆ date ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ f64 │
╞═══════╪═════════════════════╪═══════╡
│ 1 ┆ 2023-01-01 00:00:00 ┆ 4.0 │
│ 2 ┆ 2023-01-02 00:00:00 ┆ 5.0 │
│ 3 ┆ 2023-01-03 00:00:00 ┆ 6.0 │
│ 4 ┆ 2023-01-04 00:00:00 ┆ 7.0 │
│ 5 ┆ 2023-01-05 00:00:00 ┆ 8.0 │
└───────┴─────────────────────┴───────┘
Conclusion
I found in Polars a powerful and promising data processing tool.
Syntax is easy, elegant, flexible and winks at Spark.
If you — like me — come from a development background scaling application over Spark and encounter some difficulties in Pandas’ phylosophy, Polars is all you need to quickly implement and deploy your safer and more performant applications.