{"id":1980,"date":"2020-11-26T16:51:14","date_gmt":"2020-11-26T15:51:14","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=1980"},"modified":"2021-03-18T16:25:38","modified_gmt":"2021-03-18T15:25:38","slug":"etl-orchestration-on-aws-with-aws-step-functions","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/etl-orchestration-on-aws-with-aws-step-functions\/","title":{"rendered":"ETL Orchestration on AWS with AWS Step Functions"},"content":{"rendered":"\n
In the latest years, the engineering, governance, and analysis of data has become a very common talking point.<\/p>\n\n\n\n
The need for data-driven decision-making, in fact, has grown the need of collecting and analyzing data in many ways and AWS has shown a particular interest in this field developing multiple tools for achieving these business goals.<\/p>\n\n\n\n
Before being able to allow the figure of the data analyst to explore and visualize the data, a crucial step is needed. This procedure is commonly identified as ETL (extract, transform, and load) and, usually, it\u2019s far from being simple.<\/span><\/p>\n\n\n\n He who carries out this process has the responsibility of the following tasks:<\/p>\n\n\n\n As briefly seen, a couple of AWS services have been cited as important components of an infrastructure capable of hosting an ETL process.<\/p>\n\n\n\n However, other services have been developed by AWS and they have already become the state of the art in the construction of data ingestion pipelines.<\/p>\n\n\n\n The extraction of data from which business analytics of an organization can profit from, may come with disparate paces and dimensions. From the hundreds of orders per second submitted to an e-commerce store during black Friday, to the ingestion of a monthly business report. The ETL infrastructure must always be ready to welcome the new information into the staging storage.<\/p>\n\n\n\n AWS services can help to accommodate such dissimilar business needs by making the data convey into the same repository, which is commonly identified by S3 buckets.<\/p>\n\n\n\n Depending on the mole of the expected data, it\u2019s possible to defer the validation of incoming files to different AWS services. In order to accomplish the best cost\/performance ratio, it\u2019s necessary to choose between AWS Lambda for an event-driven pattern when small files are expected, and AWS Glue with scheduled batch job runs when data may reach volumes that may exceed AWS Lambda\u2019s computational limits.<\/p>\n\n\n\n The transformation of the incoming data is commonly a heavy duty job to be executed in batches. For this reason, the best candidates for this task are Glue resources. AWS Glue is based on serverless clusters that can seamlessly scale to terabytes of RAM and thousands of core workers<\/p>\n\n\n\n It is possible to run python scripts or either PySpark and Spark code for optimal scalability. Python shell glue jobs are mostly indicated for low-to-medium loads due to the fact that cannot scale to more than a single worker (4 vCPU and 16 GB of RAM).<\/p>\n\n\n\n However, although with Spark, Glue Jobs, and Glue Studio it\u2019s possible to create very meticulous transformation jobs, it\u2019s most likely that the new AWS Glue DataBrew service can fulfill this need with its very clear and complete web-interface.<\/p>\n\n\n\n It\u2019s important to note that, in order to allow the Glue Jobs to retrieve the needed data from a single source, AWS Glue incorporates in its interface the Data Catalog. As the well-explanatory name explains, an archive of the data present in our data stores is maintained and used for ingestion. For the purpose of maintaining and updating the catalog, an AWS Glue component called Crawler is used. The crawler, in fact, will give visibility on new files and partitions to the jobs trying to fetch data from the sources.<\/p>\n\n\n\n After the transformation process, a specific Glue Job or the same component employed in the previous step can finally store the valid, clean, and transformed data to the targets used for business analysis and visualization via, for example, Amazon QuickSight dashboards.<\/p>\n\n\n\n In order to preserve the privacy of the sensitive data that may progress in the pipeline, it\u2019s important to set-up the needed security measures such as KMS encryption for the data at rest in the buckets and databases and SSL protected transfers for data in transit. Moreover, it is a good practice to introduce obfuscation in PII stored in the domain. <\/p>\n\n\n\n The management of the bits and bytes flowing in the whole ETL data pipeline is commonly not an easy task.<\/p>\n\n\n\n In order to apply appropriate governance on the data produced from the process, ad-hoc quality checks are usually performed. It\u2019s important, in fact, to check any inadequacy in business requirements such as the lack of data in the data lake due to an error in the code of the validation job.\n<\/p>\n\n\n\n AWS Glue has the tools to create workflows and triggers to build some sort of data pipelines. However, the possible solutions you can achieve are very limited by the lack of directives that allow loops, retries, proper error handling, and the invocation of other AWS services outside of AWS Glue.<\/p>\n\n\n\n In AWS, however, a specific tool allows scrupulous orchestration of serverless services: AWS Step Functions. This tool allows the management of retry logic and error handling to make our distributed applications better react in case of unexpected behaviours.<\/p>\n\n\n\n In the following sections, we are going to discover and employ Step Functions for the orchestration of a realistic use case of ETL.\n<\/p>\n\n\n\n This AWS service allows the construction of highly scalable finite-state machines, that in the express configuration can handle up to one-hundred-thousands state changes per second.\n<\/p>\n\n\n\n It\u2019s important to note that a workflow built with this service is mainly composed of:<\/p>\n\n\n\n<\/figure><\/div>\n\n\n\n
ETL on AWS<\/h2>\n\n\n\n
ETL Extraction on AWS<\/h3>\n\n\n\n
ETL Transformation on AWS<\/h3>\n\n\n\n
ETL Load on AWS<\/h3>\n\n\n\n
<\/figure><\/div>\n\n\n\n
ETL orchestration on AWS<\/h2>\n\n\n\n
<\/figure><\/div>\n\n\n\n
AWS Step Functions<\/h3>\n\n\n\n