Reliable and serverless data ingestion using Delta Lake on AWS Glue
The Big Data scenario
Imagine a new requirement suddenly comes out: it has become necessary to re-process your last 2 years of data to come out with some specific reports for the BI team. Wait, your production database is taking 15 hours to compute the reports and the clients need the outcomes tomorrow? And actually every week?
One possible solution is to create a data warehouse on S3 ingesting data from your database and scale the data processing using some of the most powerful parallel and distributed engines such as Apache Spark.
Delta Lake
Last April 2019 Databricks open sourced a very interesting project called Delta Lake. As per creators definition “Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines”.
You can follow the open source project on GitHub: https://github.com/delta-io/delta .
Core concepts
Data consistency: one of Delta Lake core concepts is the transaction log, an ordered record of every transaction that has ever been performed on a Delta Lake table since its creation.
Technically, Delta Lake extends Apache Spark SQL module adding the single source of truth to track all the changes made on a Delta table, so that multiple readers and writers of the given table can work with a consistent and correct view of the data.
For every new access to a Delta table or new query on a modified open table, Spark uses the transaction log protocol to check the changes and ensure data consistency for the users. Transactions that fully execute are recorded and used as base for the guarantee of atomicity. In such a way, batch and streaming data ingestions can be unified consistently.
Upserts and deletes: bringing ACID transaction guarantees between reads and writes, Delta Lake now supports merge, delete and update operations to solve complex scenarios.
Schema enforcement and evolution: somebody added 3 columns in 2 different tables without notifying you and you didn’t have the time to prepare the ingestion workflow in order to process them properly? No worries, even if such as thing should never happen, Delta Lake provides useful options such as mergeSchema
and overwriteSchema
to handle schema variations and prevent errors during ingestion. These features can literally save the day.
Going serverless using AWS Glue
Behind the scenes AWS Glue, the fully managed ETL (extract, transform, and load) service, uses a Spark YARN cluster but it can be seen as an auto-scale “serverless Spark” solution.
Main differences with its self-managed counterpart, Amazon EMR, are of course the serverless service which can be configured to auto-scale with higher flexibility but at a higher price, and the possibility to take advantage of the Glue Catalogue to store metadata about data sources, transforms, and targets.
At the same time, it is possible to use the Spark session object in the Glue session, ignoring the AWS Glue SDK and the Glue Catalog, and replacing the auto-generated script with regular and customized Spark code. You can find more information in the developer guide: https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html .
Regarding performance, AWS Glue has just entered a new phase to fill the gap with with the faster EMR. For more info https://pages.awscloud.com/glue-reduced-spark-times-preview-2020.html .
Some AWS Scala SDK hints
AWS Scala SDK is not published on Maven so you need to “complete some prerequisite steps and then issue a Maven command to run your Scala ETL script locally” as mentioned on the AWS doc (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html).
Alternatively you can create a uber jar with all dependencies (not the best solution) or download the necessary jars from the artifact repository, https://aws-glue-etl-artifacts.s3.amazonaws.com/ .
In particular, if you want to locally use Glue 1.0, which is built on top of Spark 2.4.3 and Hadoop 2.8.x, you have to handle a couple of jars:
- AWSGlueETL-1.0.0.jar
- AWSGlueReaders-1.0.0.jar
and add them to your classpath using an sbt integration, for instance
unmanagedBase := baseDirectory.value / "aws-glue-lib"
Creating the uber jar
After your development is finished you can assemble a jar to load your dependencies in Glue.
There are many ways to do it. If you are using sbt, you can create it by adding the Delta Lake dependency in the build.sbt
// https://mvnrepository.com/artifact/io.delta/delta-core
libraryDependencies += "io.delta" %% "delta-core" % "0.6.1"
and then run
$ sbt assembly
Loading Delta Lake dependency
AWS Glue is providing many libraries to execute ETL jobs but sometimes, as presented in this article, we may want to add other functionalities to our applications such as the ones introduced above in the Delta lake library.
In such a case, AWS Glue is providing a nice way to create powerful and serverless data ingestion applications.
In fact dependencies can be packaged in a custom jar and pushed to a S3 location where Glue’s configuration could load it passing the path at execution time.
By assembling your customized jar which includes Delta Lake dependency, it is possible to run AWS Glue taking advantage of the library features.
Consuming the ingested Delta Lake data
Once the data has been ingested on S3 using the Delta format, it can be consumed by other Spark applications packaged with Delta Lake library, or can be registered and queried using serverless SQL services such Amazon Athena (performing a certain number of manual operations).
- From EC2 using a Spark shell
bin/spark-shell --packages io.delta:delta-core_2.11:0.6.1
...
// load the delta table
val deltaTable = DeltaTable.forPath(pathToDeltaTable)
- From Athena
In such direction, starting from Delta Lake version 0.5, the library introduced the symlink_format_manifest
generation to add the capability of being consumed by the Amazon Athena service. For more info see https://docs.delta.io/0.5.0/presto-integration.html .
The disadvantage is represented by the fact that skipping the AWS Glue Catalogue the table have to be registered in the Athena metastore manually using external tables creation scripts and partitions repair commands for every new data ingestion.
Conclusion
In this article is presented a relatively simple procedure to add Delta Lake features to AWS Glue framework in order to create a reliable and serveless data ingestion layer to your Data Warehouse or Data Lake.
This solution is flexible and allows you to create a custom jar which will bring very nice features to your serverless ingestion process and will make your Ops team very happy! :)