{"id":1980,"date":"2020-11-26T16:51:14","date_gmt":"2020-11-26T15:51:14","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=1980"},"modified":"2021-03-18T16:25:38","modified_gmt":"2021-03-18T15:25:38","slug":"etl-orchestration-on-aws-with-aws-step-functions","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/etl-orchestration-on-aws-with-aws-step-functions\/","title":{"rendered":"ETL Orchestration on AWS with AWS Step Functions"},"content":{"rendered":"\n
In the latest years, the engineering, governance, and analysis of data has become a very common talking point.<\/p>\n\n\n\n
The need for data-driven decision-making, in fact, has grown the need of collecting and analyzing data in many ways and AWS has shown a particular interest in this field developing multiple tools for achieving these business goals.<\/p>\n\n\n\n
Before being able to allow the figure of the data analyst to explore and visualize the data, a crucial step is needed. This procedure is commonly identified as ETL (extract, transform, and load) and, usually, it\u2019s far from being simple.<\/span><\/p>\n\n\n\n
He who carries out this process has the responsibility of the following tasks:<\/p>\n\n\n\n
Extraction<\/b>: data usually comes from numerous heterogeneous sources, such as databases, web spidering, data streams, semi-structured data etc.<\/span>\n<\/span>Due to the potential diversity of the data sources, a validation of the incoming data is mandatory, in order to not introduce information with an unexpected format or pattern. <\/span><\/li>
Transformation<\/b>: after the load of the valid data into <\/span>staging storage<\/span><\/i>, a set of common transformations are applied to it. Typically, this stage is also identified as data preparation and it usually involves the removal of incomplete or inaccurate data (data cleansing), aggregation with other data, records deduplication, and all the steps of normalization and encoding.<\/span><\/li>
Load<\/b>: finally, data that has been validated and transformed, is stored in the persistent data stores. These data stores may vary according to business needs. In fact, they can be identified according to different attributes. Lately, the most common data stores for ETL are data warehouses and data lakes. The first is generally used to store data with a strict schema definition in relational databases like Amazon Redshift. The latter, instead, is commonly made up of semi-structured data and it\u2019s mostly employed for machine learning, exploratory analysis, big data analysis, visualization etcetera. The coupling of Amazon S3 (for low-cost storage) and Amazon Athena (for fast and serverless queries on files), allows an excellent development of data lakes on AWS.<\/span><\/li><\/ul>\n\n\n\n