{"id":2627,"date":"2021-02-19T11:05:48","date_gmt":"2021-02-19T10:05:48","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=2627"},"modified":"2023-02-22T17:05:12","modified_gmt":"2023-02-22T16:05:12","slug":"orchestrating-data-analytics-and-business-intelligence-pipelines-via-step-function","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/orchestrating-data-analytics-and-business-intelligence-pipelines-via-step-function\/","title":{"rendered":"Orchestrating Data Analytics and Business Intelligence pipelines via AWS Step Functions"},"content":{"rendered":"\n
ETL pipelines on AWS usually have a linear behavior: starting from one service and ending to another one. This time though, we would like to present a more flexible setup, in which some ETL jobs could be skipped depending on data. Furthermore, some of the transformed data in our data lake need to be queried by AWS Athena in order to generate BI dashboards in QuickSight while other data partitions are used to train ad-hoc anomaly detection via Sagemaker.<\/p>\n\n\n\n
A powerful tool to orchestrate this type of ETL pipelines is the AWS StepFunctions service.<\/p>\n\n\n\n
In this article, we want to show you some of the steps involved in the creation of the pipeline as well as how many AWS services for data analytics can be used in near real-time scenarios to manage a high volume of data in a scalable way.<\/p>\n\n\n\n
In particular, we\u2019ll investigate AWS Glue connectors and Crawlers, AWS Athena, QuickSight, Kinesis Data Firehose, and finally a brief explanation on how to use of SageMaker to create forecasts starting from the collected data. To learn more about Sagemaker you can also take a look at our other articles<\/a>.<\/p>\n\n\n\n Let\u2019s start!<\/p>\n\n\n\n In this example, we\u2019ll set up several temperature sensors to send temperature and diagnostic data to our pipeline and we\u2019ll perform different BI analyses to verify efficiency, and we\u2019ll use a Sagemaker model to check for anomalies.<\/p>\n\n\n\n To keep things interesting we also want to grab historical data from two different locations: an S3 bucket and a Database residing on an EC2 instance in a different VPC from one of our ETL pipelines.<\/p>\n\n\n\n We will use different ETL jobs to recover and extract cleaned data from row data and AWS Step Functions to orchestrate all the crawlers and jobs.<\/p>\n\n\n\n Kinesis Data Firehose will continuously fetch sensors\u2019 data and with AWS Athena we will query information out of both aggregated and per-sensor data to show Graphical stats in Amazon Quicksight.<\/p>\n\n\n\n Here is a simple schema illustrating the services involved and the complete flow.<\/p>\n\n\n\nOur setup<\/h2>\n\n\n\n