{"id":2938,"date":"2021-04-02T11:31:44","date_gmt":"2021-04-02T09:31:44","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=2938"},"modified":"2021-04-02T14:39:40","modified_gmt":"2021-04-02T12:39:40","slug":"orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/","title":{"rendered":"Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation"},"content":{"rendered":"\n<p>Big Data analytics is becoming increasingly important to draft major business choices in corporations of all sizes. However collecting, aggregating, joining, and analyzing (wrangling) huge amounts of data stored in different locations with a heterogeneous structure (e.g. databases, CRMs, unstructured text, etc.) is often a daunting and very time-consuming task.&nbsp;<\/p>\n\n\n\n<p>Cloud computing often comes to the rescue, by providing cheap and scalable storage computing and data lake solutions, and in particular, AWS is the pack leader with the very versatile Glue and S3 services which allow users to ingest transform, and normalize store datasets of all sizes. Furthermore, Glue Catalog and Athena allow users to easily run Presto-based SQL queries on the normalized data in S3 data lakes, whose results can easily be stored and analyzed in business intelligence tools such as QuickSight.<\/p>\n\n\n\n<p>Despite the great advantages offered by Glue and S3 the creation and maintenance of complex multi-stage Glue ETL flows is often a very time-consuming task: Glue jobs are by their nature decoupled, and their code is stored in S3. This makes it very difficult to integrate different jobs and develop them in a well-structured software project.&nbsp;<\/p>\n\n\n\n<p>A little help could come from Glue workflows: by using these integrated Glue pipelines, it is possible to run several different Glue jobs and\/or crawlers automatically in a given order. However, this tool is lacking several features, very common in flow control tools, such as conditional branching, loops, dynamic maps, and custom steps.<\/p>\n\n\n\n<p>A better alternative is AWS StepFunctions. StepFunctions is a very powerful and versatile AWS orchestration tool, capable of handling most AWS services, either directly or through lambda integrations.<\/p>\n\n\n\n<p>In the following sections, we will explain how StepFunctions work and how to integrate and develop both infrastructure and code for Glue Jobs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Why do I need StepFunctions?<\/h2>\n\n\n\n<p>Let\u2019s draft a very simple, yet realistic ETL job for data ingestion and transformation to explain why an orchestration service in general and, on AWS StepFunctions in particular, is an essential component in the data engineer toolbox. Here are the logical components for our toy ETL workflow:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Data should be ingested from a relational database. Multiple schemas and tables.<\/li><li>Ingested data should be loaded in S3 and crawled to extract a Glue DataCatalog for AWS Athena queries.<\/li><li>Several tables of the data catalog need to be joined using nontrivial rules to create a dataset on S3 to be used by a Machine Learning job for customer segmentation.<\/li><li>The output of the data segmentation job should be stored both in the S3 data lake and be written back to the relational database for access by other corporate tools.<\/li><\/ol>\n\n\n\n<p>These four steps describe a relatively basic but very common use case. Now let\u2019s try to draft a list of steps we need to execute in AWS Glue in order to complete the described workflow:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Crawl the original database through a JDBC connection.<\/li><li>Use a Glue Job to move the data from the database to S3. Some tables may use bookmarks but others may not.<\/li><li>Crawl the target S3 bucket.<\/li><li>Run a dedicated Glue Spark job to run the join operation on the S3 data lake. Write the results to another S3 partition or bucket.<\/li><li>Crawl the target partition to make the Join results easily queryable with AWS Athena.<\/li><li>Execute the ML Job (SageMaker or the new Glue ML jobs).<\/li><li>Crawl the resulting dataset.<\/li><li>Run a final Glue ETL job to upload the new dataset to the original database.<\/li><\/ol>\n\n\n\n<p>All these steps need to be executed in the given order, and in case of problems, we would like to be notified and have a simple way to understand what went wrong.<\/p>\n\n\n\n<p>Without AWS StepFunctions, manually managing these steps would be hellish, and we would probably need an external orchestration tool or to create a Custom orchestration script to be executed on an EC2 or on a Fargate container.<\/p>\n\n\n\n<p>But why bother? AWS StepFunctions do all this for us, and by being able to directly interact with many AWS services, many integrations are a breeze: for example, with few lines of StepFunctions language, we can catch all the errors in a pipe and forward them to an SNS topic, in order to receive an email in case of error (or a slack notification, SMS or whatever you like more).<\/p>\n\n\n\n<p>Managing complex flow thus becomes safe and relatively easy. Here is an example of a quite contrived flow:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"625\" height=\"1024\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image2-625x1024.png\" alt=\"AWS StepFunctions flow\" class=\"wp-image-2946\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image2-625x1024.png 625w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image2-183x300.png 183w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image2-768x1258.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image2.png 830w\" sizes=\"auto, (max-width: 625px) 100vw, 625px\" \/><figcaption><em>Simple StepFunctions flow<\/em><\/figcaption><\/figure><\/div>\n\n\n\n<p>If any of these steps fail we&#8217;ll receive an email notification from the SNS topic, have visual feedback of the failed step, and the corresponding logs.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"490\" height=\"880\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image3.png\" alt=\"Example of failed step with logs\" class=\"wp-image-2948\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image3.png 490w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image3-167x300.png 167w\" sizes=\"auto, (max-width: 490px) 100vw, 490px\" \/><figcaption>Example of failed step with logs<\/figcaption><\/figure><\/div>\n\n\n\n<p>StepFunctions thus seems to be a jack of all trades, with a lot of good features, and no significant drawbacks, however, as we all know, this is almost always not true in IT, so which is the catch?&nbsp;<\/p>\n\n\n\n<p>The real problem for StepFunctions is <strong>code management<\/strong>: the step function language is a declarative JSON template (see <a href=\"https:\/\/docs.aws.amazon.com\/step-functions\/latest\/dg\/concepts-amazon-states-language.html\">https:\/\/docs.aws.amazon.com\/step-functions\/latest\/dg\/concepts-amazon-states-language.html<\/a>), which is quite a pain to write and maintain even using dedicated tools such as the visual studio code plugin (see <a href=\"https:\/\/aws.amazon.com\/blogs\/compute\/aws-step-functions-support-in-visual-studio-code\">https:\/\/aws.amazon.com\/blogs\/compute\/aws-step-functions-support-in-visual-studio-code<\/a>).<\/p>\n\n\n\n<p>Furthermore, it would be wonderful to be able to maintain both the StepFunctions code and the Glue Jobs, and the eventual Lambda code in a single integrated project.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Cloudformation with Troposphere or AWS CDK<\/h2>\n\n\n\n<p>The most obvious instrument we can use to maintain StepFunctions, the Glue Jobs, and the rest of our ETL infrastructure in an integrated way, is Cloudformation as a deployment tool for everything. However, Cloudformation code is a declarative YML\/JSON language not too different from StepFunctions code, and including that code in these templates is usually quite painful since involves including it as a JSON string in our Cloud Formation YML.<\/p>\n\n\n\n<p>A much more effective solution is to create our Cloudformation template using a high-level programming language and leveraging the AWS CDK Cloudformation software development framework (<a href=\"https:\/\/aws.amazon.com\/cdk\/\">https:\/\/aws.amazon.com\/cdk\/<\/a>) which supports many languages (TypeScript, Python, and Java).&nbsp;<\/p>\n\n\n\n<p>If you decide to use Python, which usually makes sense, since your ETL jobs will probably be written in Python, you also have the option to use Troposphere instead of AWS CDK, as a Cloudformation framework, which is much more versatile in several situations.&nbsp;<\/p>\n\n\n\n<p>Furthermore, you can author the StepFunctions using the python Step Functions Framework (<a href=\"https:\/\/docs.aws.amazon.com\/step-functions\/latest\/dg\/concepts-python-sdk.html\">https:\/\/docs.aws.amazon.com\/step-functions\/latest\/dg\/concepts-python-sdk.html<\/a>) as we\u2019ll do in the following example (Troposphere + Python step function SDK).<\/p>\n\n\n\n<p>In this very simple example, we want to demonstrate how to create a simple Flow to download Covid Data from a public AWS OpenData S3 bucket, save a small subset of them in a different S3 bucket, and crawl them in order to be prepared for Athena queries. You can extend this basic working example at will! Here is a basic sketch of the infrastructure:<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image5-1024x576.png\" alt=\"AWS Step Functions Basic example infrastructure\" class=\"wp-image-2952\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image5-1024x576.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image5-400x225.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image5-768x432.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image5-1536x864.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image5.png 1999w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption><em>Basic example infrastructure<\/em><\/figcaption><\/figure><\/div>\n\n\n\n<p>First of all, let\u2019s install the AWS CLI (<a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/userguide\/install-cliv2.html\">https:\/\/docs.aws.amazon.com\/cli\/latest\/userguide\/install-cliv2.html<\/a>) and the python requirements: <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">pip install troposphere stepfunctions.<\/pre>\n\n\n\n<p>Once the installation is done you can <a href=\"https:\/\/github.com\/besharpsrl\/stepfunctions-troposphere-glue-example.git\" target=\"_blank\" rel=\"noreferrer noopener\">download our repository<\/a> and you\u2019ll find a <strong>troposphere_main.py<\/strong> file, which contains the troposphere representation of the whole infrastructure (see sketch), other folders containing the python code of the various Lambda functions (start_crawler, check_crawler status), and a README file explaining how to run the project. After this, we\u2019ll need to create an S3 bucket as support for the Cloudformation deployment with the name you prefer.<\/p>\n\n\n\n<p>Following the instructions in the README, we can just run the main file by running in a console python troposphere_main.py. Executing this script, we\u2019ll compile the python troposphere code to a Cloudformation compliant JSON format. Once this is done, we are ready to run <\/p>\n\n\n\n<p><span class=\"has-inline-color has-cyan-bluish-gray-color\">aws cloudformation package &#8211;template-file troposphere_main.json &#8211;s3-bucket &lt;YOUR CLOUDFORMATION S3 BUCKET&gt; &#8211;s3-prefix &#8216;&lt;THE PATH YOU PREFER&gt;&#8217; &#8211;output-template-file troposphere_main.yml<\/span><\/p>\n\n\n\n<p>This command takes as input the JSON file created by Troposphere, and uploads to S3 the Glue and Lambda functions code referenced as local paths giving back another Cloudformation template (this time YML), where the local paths references have been changed in the corresponding S3 references (for further info, <a href=\"https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/cloudformation\/package.html\" target=\"_blank\" rel=\"noreferrer noopener\">take a look at this<\/a>).<\/p>\n\n\n\n<p>Finally we are ready to deploy the Cloudformation template using the command:&nbsp;<\/p>\n\n\n\n<p><span class=\"has-inline-color has-cyan-bluish-gray-color\">aws cloudformation deploy &#8211;template-file .\/troposphere_main.yml &#8211;stack-name testStepfunctionsStack &#8211;capabilities CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND<\/span><\/p>\n\n\n\n<p>This we\u2019ll create the testStepfunctionsStack which contains the infrastructure described above. Now you can go to the AWS StepFunctions console and run the newly created function (<span class=\"has-inline-color has-cyan-bluish-gray-color\">test-stepfunctions-glue<\/span>), the flow will run its course, and you\u2019ll import the Covid data.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"582\" height=\"826\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image4.png\" alt=\"AWS Step Functions Our example flow completed\" class=\"wp-image-2950\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image4.png 582w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/image4-211x300.png 211w\" sizes=\"auto, (max-width: 582px) 100vw, 582px\" \/><figcaption><em>Our example flow completed<\/em><\/figcaption><\/figure><\/div>\n\n\n\n<p>While this is just a basic example, it is important to note that all the code presented is in the same project and thus you can easily extend the flow at will without losing control of the various components, just use Git for version control and Cloudformation for the deployments!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Takeaways&nbsp;<\/h2>\n\n\n\n<p>We showed that step functions are a great way to orchestrate AWS-based flows in general and ETL pipelines in particular! Furthermore, we shared an example of how to use Troposphere and Python StepFunctions SDK to develop, in a single python project, both a step function and the code of its various components.<\/p>\n\n\n\n<p>So, this is it! As always, feel free to comment in the section below, and <a href=\"https:\/\/www.besharp.it\/contact-us\/\" target=\"_blank\" rel=\"noreferrer noopener\">reach us<\/a> for any doubt, question or idea!<\/p>\n\n\n\n<p>See you on <strong>#proud2becloud<\/strong> in a couple of weeks for another exciting story!  <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Big Data analytics is becoming increasingly important to draft major business choices in corporations of all sizes. However collecting, aggregating, [&hellip;]<\/p>\n","protected":false},"author":9,"featured_media":2954,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[475],"tags":[252,276,445,487,417],"class_list":["post-2938","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-analytics-en","tag-amazon-s3-en","tag-aws-cloudformation-en","tag-aws-glue-en","tag-aws-step-functions","tag-etl-en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation - Proud2beCloud Blog<\/title>\n<meta name=\"description\" content=\"Using AWS Step Functions, AWS Glue, and AWS CloudFormation to orchestrate ETL pipelines on Amazon Web Services.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation\" \/>\n<meta property=\"og:description\" content=\"Using AWS Step Functions, AWS Glue and AWS CloudFormation to orchestrate ETL pipelines on AWS.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/\" \/>\n<meta property=\"og:site_name\" content=\"Proud2beCloud Blog\" \/>\n<meta property=\"article:published_time\" content=\"2021-04-02T09:31:44+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2021-04-02T12:39:40+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/etl-social.jpg\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"Matteo Moroni\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation\" \/>\n<meta name=\"twitter:description\" content=\"Using AWS Step Functions, AWS Glue and AWS CloudFormation to orchestrate ETL pipelines on AWS.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/etl-social.jpg\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Matteo Moroni\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/\",\"url\":\"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/\",\"name\":\"Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation - Proud2beCloud Blog\",\"isPartOf\":{\"@id\":\"https:\/\/blog.besharp.it\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/etl-pipeline.jpg\",\"datePublished\":\"2021-04-02T09:31:44+00:00\",\"dateModified\":\"2021-04-02T12:39:40+00:00\",\"author\":{\"@id\":\"https:\/\/blog.besharp.it\/#\/schema\/person\/0b3e69eb2dcb125d58476b906ec1c7bc\"},\"description\":\"Using AWS Step Functions, AWS Glue, and AWS CloudFormation to orchestrate ETL pipelines on Amazon Web Services.\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/#primaryimage\",\"url\":\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/etl-pipeline.jpg\",\"contentUrl\":\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/etl-pipeline.jpg\",\"width\":1200,\"height\":900},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.besharp.it\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.besharp.it\/#website\",\"url\":\"https:\/\/blog.besharp.it\/\",\"name\":\"Proud2beCloud Blog\",\"description\":\"il blog di beSharp\",\"alternateName\":\"Proud2beCloud Blog\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.besharp.it\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.besharp.it\/#\/schema\/person\/0b3e69eb2dcb125d58476b906ec1c7bc\",\"name\":\"Matteo Moroni\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.besharp.it\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/acad790b9bb4c6d62e076ecdc1debb35?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/acad790b9bb4c6d62e076ecdc1debb35?s=96&d=mm&r=g\",\"caption\":\"Matteo Moroni\"},\"description\":\"DevOps e Solution Architect di beSharp, mi occupo di sviluppare soluzioni Saas, Data Analysis, HPC e di progettare architetture non convenzionali a complessit\u00e0 divergente. Appassionato di informatica e fisica, da sempre lavoro nella prima e ho un PhD nella seconda. Parlare di tutto ci\u00f2 che \u00e8 tecnico e nerd mi rende felice!\",\"url\":\"https:\/\/blog.besharp.it\/author\/matteo-moroni\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation - Proud2beCloud Blog","description":"Using AWS Step Functions, AWS Glue, and AWS CloudFormation to orchestrate ETL pipelines on Amazon Web Services.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/","og_locale":"en_US","og_type":"article","og_title":"Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation","og_description":"Using AWS Step Functions, AWS Glue and AWS CloudFormation to orchestrate ETL pipelines on AWS.","og_url":"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/","og_site_name":"Proud2beCloud Blog","article_published_time":"2021-04-02T09:31:44+00:00","article_modified_time":"2021-04-02T12:39:40+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/etl-social.jpg","type":"image\/jpeg"}],"author":"Matteo Moroni","twitter_card":"summary_large_image","twitter_title":"Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation","twitter_description":"Using AWS Step Functions, AWS Glue and AWS CloudFormation to orchestrate ETL pipelines on AWS.","twitter_image":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/etl-social.jpg","twitter_misc":{"Written by":"Matteo Moroni","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/","url":"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/","name":"Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation - Proud2beCloud Blog","isPartOf":{"@id":"https:\/\/blog.besharp.it\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/#primaryimage"},"image":{"@id":"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/etl-pipeline.jpg","datePublished":"2021-04-02T09:31:44+00:00","dateModified":"2021-04-02T12:39:40+00:00","author":{"@id":"https:\/\/blog.besharp.it\/#\/schema\/person\/0b3e69eb2dcb125d58476b906ec1c7bc"},"description":"Using AWS Step Functions, AWS Glue, and AWS CloudFormation to orchestrate ETL pipelines on Amazon Web Services.","breadcrumb":{"@id":"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/#primaryimage","url":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/etl-pipeline.jpg","contentUrl":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/04\/etl-pipeline.jpg","width":1200,"height":900},{"@type":"BreadcrumbList","@id":"https:\/\/blog.besharp.it\/orchestrating-etl-pipelines-on-aws-with-glue-stepfunctions-and-cloudformation\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.besharp.it\/"},{"@type":"ListItem","position":2,"name":"Orchestrating ETL pipelines on AWS with Glue, StepFunctions, and Cloudformation"}]},{"@type":"WebSite","@id":"https:\/\/blog.besharp.it\/#website","url":"https:\/\/blog.besharp.it\/","name":"Proud2beCloud Blog","description":"il blog di beSharp","alternateName":"Proud2beCloud Blog","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.besharp.it\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.besharp.it\/#\/schema\/person\/0b3e69eb2dcb125d58476b906ec1c7bc","name":"Matteo Moroni","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.besharp.it\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/acad790b9bb4c6d62e076ecdc1debb35?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/acad790b9bb4c6d62e076ecdc1debb35?s=96&d=mm&r=g","caption":"Matteo Moroni"},"description":"DevOps e Solution Architect di beSharp, mi occupo di sviluppare soluzioni Saas, Data Analysis, HPC e di progettare architetture non convenzionali a complessit\u00e0 divergente. Appassionato di informatica e fisica, da sempre lavoro nella prima e ho un PhD nella seconda. Parlare di tutto ci\u00f2 che \u00e8 tecnico e nerd mi rende felice!","url":"https:\/\/blog.besharp.it\/author\/matteo-moroni\/"}]}},"_links":{"self":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/posts\/2938","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/comments?post=2938"}],"version-history":[{"count":0,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/posts\/2938\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/media\/2954"}],"wp:attachment":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/media?parent=2938"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/categories?post=2938"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/tags?post=2938"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}