{"id":2803,"date":"2021-03-16T12:35:44","date_gmt":"2021-03-16T11:35:44","guid":{"rendered":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/"},"modified":"2021-04-08T15:29:42","modified_gmt":"2021-04-08T13:29:42","slug":"costruire-un-data-lake-su-aws-con-aws-lake-formation","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/","title":{"rendered":"Building a Data Lake from scratch on AWS using AWS Lake Formation"},"content":{"rendered":"\n

Introduction<\/h2>\n\n\n\n

Leveraging available data (Big Data) has become a significant focus for most companies in the last decades. In the last few years, the advent of Cloud Computing has democratized access to more powerful IT resources, thus eliminating the costs and hassles of managing the necessary infrastructure required in an on-premises data center.<\/p>\n\n\n\n

Cloud Computing also helps companies use their data efficiently, lowering engineering costs thanks to its managed services’ powerfulness.<\/p>\n\n\n\n

It also promotes the use of on-demand infrastructures, making it easier to re-think, re-engineer, and re-architect a data lake to explore new use cases.<\/p>\n\n\n\n

Being data a focal point for business decisions, managing it efficiently becomes a priority.<\/p>\n\n\n\n

Among many ways to do so, the data lake concept, a scalable, low-cost, centralized data repository for storing raw data from various sources, has grown to success. It enables users to store data as-is without structuring it first to run different analytics types, gaining insights, and guiding more accurate strategic business decisions.<\/p>\n\n\n\n

Building a data lake is not an easy task: it involves numerous manual steps, making the process complex and, more importantly, very time-consuming. Data usually comes from diverse sources and should be carefully monitored.<\/p>\n\n\n\n

Moreover, managing this amount of data requires several procedures to avoid leaks and security breaches, which means you need to set up access management policies, enable encryption of sensitive data and manage keys for it.<\/p>\n\n\n\n

Without the right choices about technology, architecture, data quality, and data governance, a data lake can quickly become an isolated mess of difficult-to-use, hard-to-understand, often inaccessible data.<\/p>\n\n\n\n

Fortunately, AWS Cloud comes to the rescue with many services designed to manage a data lake, such as AWS Glue and S3.<\/p>\n\n\n\n

For this article, we will assume the reader already has some knowledge about AWS Services and understands the concepts behind AWS Glue and S3. If this is not the case, we encourage you to read our latest stories about ingesting data for Machine Learning workloads<\/a> and managing complex Machine Learning projects via Step Functions<\/a>.<\/p>\n\n\n\n

We will explore how to build a very simple data lake using Lake Formation quickly. Then, we will focus on the security and governance advantages that this service offers over plain AWS Glue.<\/p>\n\n\n\n

Let us dig into it!<\/p>\n\n\n\n

Quick Setup<\/h2>\n\n\n\n

Before focusing on the advantages of managing a data lake through AWS Lake Formation, we first need to create a simple one.<\/p>\n\n\n\n

Let us go to the AWS console and choose AWS Lake Formation in the service list or via the search bar. We will find this dashboard:<\/p>\n\n\n\n

\"AWS
Welcome screen of Lake Formation<\/figcaption><\/figure>\n\n\n\n

After clicking on \u201cGet started,\u201d we will be asked to set up an administrator for the data lake; it is possible to add AWS users and roles available on the account you are logged into. Select a suitable one, preferably a role, assumable with temporary credentials from humans and services, and go on.<\/p>\n\n\n\n

\"create
Select a user or a role<\/em><\/figcaption><\/figure>\n\n\n\n

When we gain access to the Lake Formation dashboard it\u2019s time to add a Lake Location, which is a valid S3 path to retrieve data from. Data can be obtained by various means, for example with AWS Glue Jobs<\/strong>, through the combination of AWS Kinesis stream<\/strong> and Data Firehose<\/strong>, or by simply uploading data directly to S3<\/strong>.<\/p>\n\n\n\n

Let\u2019s quickly review all the possibilities to populate our Glue Catalog<\/strong> (which defines our data lake under the hood)<\/p>\n\n\n\n

First, we\u2019ll add the data lake location by clicking on the \u201cRegister location\u201d button in the Register and Ingest section of the service\u2019s dashboard, like in the figure.<\/p>\n\n\n\n

\"Choose
Add a new location for the data lake<\/figcaption><\/figure>\n\n\n\n

We\u2019ll be asked to select an S3 bucket, let\u2019s do so, add a suitable role (or let AWS create one for us), and finally click on \u201cRegister location\u201d.<\/p>\n\n\n\n

Now we can:<\/p>\n\n\n\n

  1. simply upload data to S3 before starting the crawling process;<\/li>
  2. use a suitable combination of AWS services to ingest data, like Kinesis Stream and Firehose (see our story<\/a> for more information);<\/li>
  3. use a blueprint<\/strong> from Lake Formation to easily obtain data from various databases or log sources;<\/li><\/ol>\n\n\n\n

    Let\u2019s review rapidly the third option, which is limited, but still interesting and not yet covered in our previous posts.<\/p>\n\n\n\n

    \"\"
    Use a blueprint workflow<\/figcaption><\/figure>\n\n\n\n

    Clicking on \u201cUse blueprint\u201d, we\u2019ll be redirected to a form where we can select whether we want to grab data from a database or a log source.<\/p>\n\n\n\n

    Just follow the instructions to set up a workload, which is basically a Glue ETL Job where all the options for the Extract, Transform, and Load steps are in one place.<\/p>\n\n\n\n

    For example, for a MySQL, MSSQL, or Oracle database add (or create) an AWS Glue connection, specifying also the source DB and table in this format: \/<_tag_>, encoded_< \/ strong, encoded_tag_opentable_>tag_closed<\/strong> Add (or create) the target Glue Catalog, specifying a DB and a table, also browse with the tool provided, for a suitable S3 path to host the catalog data.<\/p>\n\n\n\n

    Select a workflow name, decide the job frequency, i.e. \u201cRun on demand\u201d, and a table prefix, the other options can be left as defaults.<\/p>\n\n\n\n

    A couple of notes: always opt for parquet<\/strong> format in the target S3 section, as it gives a solid performance boost for dataset operations later on. Also, if you plan to use Athena to query your catalog, please use \u201c_\u201d instead of \u201c-\u201d for database and table names, as the latter character sometimes can lead to unwanted compatibility issues.<\/p>\n\n\n\n

    Enhanced Security<\/h2>\n\n\n\n

    Once Lake Formation is up and running, we can focus on the details that make it stand out: in primis, a coarse-permission model that works augmenting the one provided by IAM.<\/p>\n\n\n\n

    This centrally defined model enables fine-grained access to data stored in data lakes with a simple grant\/revoke mechanism, like in the figure shown below:<\/p>\n\n\n\n

    \"AWS
    How a request passes through 2 stages of permission before having access to resources<\/figcaption><\/figure>\n\n\n\n

    Lake Formation permissions are also enforced at the table and column level and work on all the full stack of AWS services for analytics and machine learning, including, but not limited to, Amazon Athena, Redshift, Sagemaker, and are also mapped with S3 objects under the hood.<\/p>\n\n\n\n

    Access control in AWS Lake Formation is divided into two distinct areas:<\/p>\n\n\n\n