Reliable and serverless data ingestion using Delta Lake on AWS Glue

Stefano Lori
5 min readJul 21, 2020

--

The Big Data scenario

Imagine a new requirement suddenly comes out: it has become necessary to re-process your last 2 years of data to come out with some specific reports for the BI team. Wait, your production database is taking 15 hours to compute the reports and the clients need the outcomes tomorrow? And actually every week?

One possible solution is to create a data warehouse on S3 ingesting data from your database and scale the data processing using some of the most powerful parallel and distributed engines such as Apache Spark.

Delta Lake

Last April 2019 Databricks open sourced a very interesting project called Delta Lake. As per creators definition “Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines”.

The Delta Lake overview

You can follow the open source project on GitHub: https://github.com/delta-io/delta .

Core concepts

Data consistency: one of Delta Lake core concepts is the transaction log, an ordered record of every transaction that has ever been performed on a Delta Lake table since its creation.
Technically, Delta Lake extends Apache Spark SQL module adding the single source of truth to track all the changes made on a Delta table, so that multiple readers and writers of the given table can work with a consistent and correct view of the data.
For every new access to a Delta table or new query on a modified open table, Spark uses the transaction log protocol to check the changes and ensure data consistency for the users. Transactions that fully execute are recorded and used as base for the guarantee of atomicity. In such a way, batch and streaming data ingestions can be unified consistently.

Upserts and deletes: bringing ACID transaction guarantees between reads and writes, Delta Lake now supports merge, delete and update operations to solve complex scenarios.

Schema enforcement and evolution: somebody added 3 columns in 2 different tables without notifying you and you didn’t have the time to prepare the ingestion workflow in order to process them properly? No worries, even if such as thing should never happen, Delta Lake provides useful options such as mergeSchema and overwriteSchema to handle schema variations and prevent errors during ingestion. These features can literally save the day.

Going serverless using AWS Glue

Behind the scenes AWS Glue, the fully managed ETL (extract, transform, and load) service, uses a Spark YARN cluster but it can be seen as an auto-scale “serverless Spark” solution.
Main differences with its self-managed counterpart, Amazon EMR, are of course the serverless service which can be configured to auto-scale with higher flexibility but at a higher price, and the possibility to take advantage of the Glue Catalogue to store metadata about data sources, transforms, and targets.
At the same time, it is possible to use the Spark session object in the Glue session, ignoring the AWS Glue SDK and the Glue Catalog, and replacing the auto-generated script with regular and customized Spark code. You can find more information in the developer guide: https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html .
Regarding performance, AWS Glue has just entered a new phase to fill the gap with with the faster EMR. For more info https://pages.awscloud.com/glue-reduced-spark-times-preview-2020.html .

Some AWS Scala SDK hints

AWS Scala SDK is not published on Maven so you need to “complete some prerequisite steps and then issue a Maven command to run your Scala ETL script locally” as mentioned on the AWS doc (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html).

Alternatively you can create a uber jar with all dependencies (not the best solution) or download the necessary jars from the artifact repository, https://aws-glue-etl-artifacts.s3.amazonaws.com/ .

In particular, if you want to locally use Glue 1.0, which is built on top of Spark 2.4.3 and Hadoop 2.8.x, you have to handle a couple of jars:

  • AWSGlueETL-1.0.0.jar
  • AWSGlueReaders-1.0.0.jar

and add them to your classpath using an sbt integration, for instance

unmanagedBase := baseDirectory.value / "aws-glue-lib"

Creating the uber jar

After your development is finished you can assemble a jar to load your dependencies in Glue.

There are many ways to do it. If you are using sbt, you can create it by adding the Delta Lake dependency in the build.sbt

// https://mvnrepository.com/artifact/io.delta/delta-core
libraryDependencies += "io.delta" %% "delta-core" % "0.6.1"

and then run

$ sbt assembly

Loading Delta Lake dependency

AWS Glue is providing many libraries to execute ETL jobs but sometimes, as presented in this article, we may want to add other functionalities to our applications such as the ones introduced above in the Delta lake library.
In such a case, AWS Glue is providing a nice way to create powerful and serverless data ingestion applications.
In fact dependencies can be packaged in a custom jar and pushed to a S3 location where Glue’s configuration could load it passing the path at execution time.

By assembling your customized jar which includes Delta Lake dependency, it is possible to run AWS Glue taking advantage of the library features.

Add your uber jar dependencies into AWS Glue configuration panel

Consuming the ingested Delta Lake data

Once the data has been ingested on S3 using the Delta format, it can be consumed by other Spark applications packaged with Delta Lake library, or can be registered and queried using serverless SQL services such Amazon Athena (performing a certain number of manual operations).

  1. From EC2 using a Spark shell
bin/spark-shell --packages io.delta:delta-core_2.11:0.6.1
...
// load the delta table
val
deltaTable = DeltaTable.forPath(pathToDeltaTable)
  1. From Athena

In such direction, starting from Delta Lake version 0.5, the library introduced the symlink_format_manifest generation to add the capability of being consumed by the Amazon Athena service. For more info see https://docs.delta.io/0.5.0/presto-integration.html .

Generate simlinks directly from APIs

The disadvantage is represented by the fact that skipping the AWS Glue Catalogue the table have to be registered in the Athena metastore manually using external tables creation scripts and partitions repair commands for every new data ingestion.

Conclusion

In this article is presented a relatively simple procedure to add Delta Lake features to AWS Glue framework in order to create a reliable and serveless data ingestion layer to your Data Warehouse or Data Lake.

This solution is flexible and allows you to create a custom jar which will bring very nice features to your serverless ingestion process and will make your Ops team very happy! :)

--

--

Stefano Lori
Stefano Lori

Written by Stefano Lori

Lead Big Data and AI, Senior Data Scientist in Fintech, ESG and Spark NLP contributor.

No responses yet