{"id":2938,"date":"2021-04-02T11:31:44","date_gmt":"2021-04-02T09:31:44","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=2938"},"modified":"2021-04-02T14:39:40","modified_gmt":"2021-04-02T12:39:40","slug":"orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/","title":{"rendered":"Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation"},"content":{"rendered":"\n
Big Data analytics is becoming increasingly important to draft major business choices in corporations of all sizes. However collecting, aggregating, joining, and analyzing (wrangling) huge amounts of data stored in different locations with a heterogeneous structure (e.g. databases, CRMs, unstructured text, etc.) is often a daunting and very time-consuming task. <\/p>\n\n\n\n
Cloud computing often comes to the rescue, by providing cheap and scalable storage computing and data lake solutions, and in particular, AWS is the pack leader with the very versatile Glue and S3 services which allow users to ingest transform, and normalize store datasets of all sizes. Furthermore, Glue Catalog and Athena allow users to easily run Presto-based SQL queries on the normalized data in S3 data lakes, whose results can easily be stored and analyzed in business intelligence tools such as QuickSight.<\/p>\n\n\n\n
Despite the great advantages offered by Glue and S3 the creation and maintenance of complex multi-stage Glue ETL flows is often a very time-consuming task: Glue jobs are by their nature decoupled, and their code is stored in S3. This makes it very difficult to integrate different jobs and develop them in a well-structured software project. <\/p>\n\n\n\n
A little help could come from Glue workflows: by using these integrated Glue pipelines, it is possible to run several different Glue jobs and\/or crawlers automatically in a given order. However, this tool is lacking several features, very common in flow control tools, such as conditional branching, loops, dynamic maps, and custom steps.<\/p>\n\n\n\n
A better alternative is AWS StepFunctions. StepFunctions is a very powerful and versatile AWS orchestration tool, capable of handling most AWS services, either directly or through lambda integrations.<\/p>\n\n\n\n
In the following sections, we will explain how StepFunctions work and how to integrate and develop both infrastructure and code for Glue Jobs.<\/p>\n\n\n\n
Let\u2019s draft a very simple, yet realistic ETL job for data ingestion and transformation to explain why an orchestration service in general and, on AWS StepFunctions in particular, is an essential component in the data engineer toolbox. Here are the logical components for our toy ETL workflow:<\/p>\n\n\n\n
These four steps describe a relatively basic but very common use case. Now let\u2019s try to draft a list of steps we need to execute in AWS Glue in order to complete the described workflow:<\/p>\n\n\n\n
All these steps need to be executed in the given order, and in case of problems, we would like to be notified and have a simple way to understand what went wrong.<\/p>\n\n\n\n
Without AWS StepFunctions, manually managing these steps would be hellish, and we would probably need an external orchestration tool or to create a Custom orchestration script to be executed on an EC2 or on a Fargate container.<\/p>\n\n\n\n
But why bother? AWS StepFunctions do all this for us, and by being able to directly interact with many AWS services, many integrations are a breeze: for example, with few lines of StepFunctions language, we can catch all the errors in a pipe and forward them to an SNS topic, in order to receive an email in case of error (or a slack notification, SMS or whatever you like more).<\/p>\n\n\n\n
Managing complex flow thus becomes safe and relatively easy. Here is an example of a quite contrived flow:<\/p>\n\n\n\n