{"id":3006,"date":"2021-04-16T13:59:00","date_gmt":"2021-04-16T11:59:00","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=3006"},"modified":"2023-03-29T15:33:26","modified_gmt":"2023-03-29T13:33:26","slug":"aws-glue-elastic-views-an-almost-no-code-etl-and-aggregation-framework","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/aws-glue-elastic-views-an-almost-no-code-etl-and-aggregation-framework\/","title":{"rendered":"AWS Glue Elastic Views! An almost no code ETL and Aggregation Framework"},"content":{"rendered":"\n
Introduction<\/h2>\n\n\n\n
ETL<\/strong> is a fundamental step of a Machine Learning process as it is the stepping stone on which all the dataset for the model definition is based. Because of that, data scientists and MLOps experts carefully plan jobs and pipelines to manage the extraction of data from databases<\/strong>, often of different natures, clean<\/strong> and normalize data<\/strong>, and finally generate a data lake<\/strong> to make further enhancement on data during the investigation process.<\/p>\n\n\n\n
Usually, this process involves different steps, coordinating their resolution, accessing different databases with different technologies, preparing many scripts, knowing different languages to query the relevant data, and so on.<\/p>\n\n\n\n
Taking care of all these steps is a daring task and requires a lot of expertise, and of course, is time-consuming, undercutting the efficiency of the entire project at hand.<\/p>\n\n\n\n
In the last couple of years AWS has been aggressively developing tools and services to help in Machine Learning and ETL tasks and at the last re:Invent<\/strong> introduced another important component for ETL-ML preparation: AWS Elastic Views<\/strong>. <\/p>\n\n\n\n
AWS Elastic Views allows a user to request data from different data sources being completely agnostic on their nature, to query for data in a SQL-compatible language, and to send all the queried data to a target, typically S3 or another data store in order to aggregate the heterogenous data in a data lake.<\/strong><\/p>\n\n\n
\n<\/figure><\/div>\n\n\n
Some of the main advantages are: <\/p>\n\n\n\n
\n
being able to query with PartiQL language, databases or datastream of different nature, becoming a defacto aggregator without the need to write custom complex ETL jobs.<\/li>\n\n\n\n
Using powerful commands like JOIN to add aggregation capabilities to data sources that usually don\u2019t have.<\/li>\n<\/ul>\n\n\n\n
The purpose of this article is to guide the reader in exploring some of the key factors that make this service something to consider while drafting your Machine Learning projects.<\/p>\n\n\n\n
We\u2019ll dive deep into what AWS Elastic Views is capable of, considering that it is still in the beta private preview phase, so you\u2019ll have to ask AWS access for the preview.<\/p>\n\n\n\n
Let\u2019s go! <\/p>\n\n\n\n
How it works<\/h2>\n\n\n\n
Let\u2019s start our journey by understanding what is AWS Glue Elastic Views<\/strong>, and how it works. At first, let\u2019s take a look at this scheme by AWS:<\/p>\n\n\n
\nCourtesy of AWS – AWS Glue Elastic Views inputs and outputs<\/em><\/figcaption><\/figure><\/div>\n\n\n
As shown in the image the focal point of this service is represented by the Materialized View<\/strong>, which is a way to abstract the dataset being it from any kind of data source: i.e. Amazon Aurora, RDS, or DynamoDB. This allows keeping things in synch without the actual use of a Glue Crawler as we would expect from our other articles about ETL workloads: for example this one<\/a>, or this one<\/a>.<\/p>\n\n\n\n
Keeps target\u2019s data always up-to-date automatically<\/h4>\n\n\n\n
Keeping data in sync usually requires Crawlers and jobs to be created and maintained, AWS Glue Elastic Views, instead, continuously monitors for changes in data in the starting data stores, and when a change occurs, Elastic Views automatically updates the targets. This ensures that applications that access data using Elastic Views always have the most up-to-date data.<\/p>\n\n\n\n
Alerts you if there is a change to the data model in a source data store<\/h4>\n\n\n\n
AWS Glue Elastic Views proactively alerts developers when there is a change to the data model in one of the source data stores so that they can update their views to adapt to this change.<\/p>\n\n\n\n
Serverless<\/h4>\n\n\n\n
AWS Glue Elastic Views is serverless and scales capacity up or down automatically to accommodate workloads lifecycles. There is no hardware or software to manage, and as always, a user pays only for the resources it utilizes.<\/p>\n\n\n\n