{"id":2803,"date":"2021-03-16T12:35:44","date_gmt":"2021-03-16T11:35:44","guid":{"rendered":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/"},"modified":"2021-04-08T15:29:42","modified_gmt":"2021-04-08T13:29:42","slug":"costruire-un-data-lake-su-aws-con-aws-lake-formation","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/","title":{"rendered":"Building a Data Lake from scratch on AWS using AWS Lake Formation"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p>Leveraging available data (Big Data) has become a significant focus for most companies in the last decades. In the last few years, the advent of Cloud Computing has democratized access to more powerful IT resources, thus eliminating the costs and hassles of managing the necessary infrastructure required in an on-premises data center.<\/p>\n\n\n\n<p>Cloud Computing also helps companies use their data efficiently, lowering engineering costs thanks to its managed services&#8217; powerfulness.<\/p>\n\n\n\n<p>It also promotes the use of on-demand infrastructures, making it easier to re-think, re-engineer, and re-architect a data lake to explore new use cases.<\/p>\n\n\n\n<p>Being data a focal point for business decisions, managing it efficiently becomes a priority.<\/p>\n\n\n\n<p>Among many ways to do so, the data lake concept, a scalable, low-cost, centralized data repository for storing raw data from various sources, has grown to success. It enables users to store data as-is without structuring it first to run different analytics types, gaining insights, and guiding more accurate strategic business decisions.<\/p>\n\n\n\n<p>Building a data lake is not an easy task: it involves numerous manual steps, making the process complex and, more importantly, very time-consuming. Data usually comes from diverse sources and should be carefully monitored.<\/p>\n\n\n\n<p>Moreover, managing this amount of data requires several procedures to avoid leaks and security breaches, which means you need to set up access management policies, enable encryption of sensitive data and manage keys for it.<\/p>\n\n\n\n<p>Without the right choices about technology, architecture, data quality, and data governance, a data lake can quickly become an isolated mess of difficult-to-use, hard-to-understand, often inaccessible data.<\/p>\n\n\n\n<p>Fortunately, AWS Cloud comes to the rescue with many services designed to manage a data lake, such as AWS Glue and S3.<\/p>\n\n\n\n<p>For this article, we will assume the reader already has some knowledge about AWS Services and understands the concepts behind AWS Glue and S3. If this is not the case, we encourage you to read our latest stories about <a href=\"https:\/\/blog.besharp.it\/en\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/\">ingesting data for Machine Learning workloads<\/a> and <a href=\"https:\/\/blog.besharp.it\/en\/orchestrating-data-analytics-and-business-intelligence-pipelines-via-step-function\/\">managing complex Machine Learning projects via Step Functions<\/a>.<\/p>\n\n\n\n<p>We will explore how to build a very simple data lake using Lake Formation quickly. Then, we will focus on the security and governance advantages that this service offers over plain AWS Glue.<\/p>\n\n\n\n<p>Let us dig into it!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Quick Setup<\/h2>\n\n\n\n<p>Before focusing on the advantages of managing a data lake through AWS Lake Formation, we first need to create a simple one.<\/p>\n\n\n\n<p>Let us go to the AWS console and choose AWS Lake Formation in the service list or via the search bar. We will find this dashboard:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"207\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145213.264-1024x207.png\" alt=\"AWS Lake Formation get started\" class=\"wp-image-2873\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145213.264-1024x207.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145213.264-400x81.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145213.264-768x155.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145213.264-1536x310.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145213.264.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Welcome screen of Lake Formation<\/figcaption><\/figure>\n\n\n\n<p>After clicking on \u201cGet started,\u201d we will be asked to set up an administrator for the data lake; it is possible to add AWS users and roles available on the account you are logged into. Select a suitable one, preferably a role, assumable with temporary credentials from humans and services, and go on.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"604\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145437.504-1024x604.png\" alt=\"create data lake on AWS Lake Formation\" class=\"wp-image-2871\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145437.504-1024x604.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145437.504-400x236.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145437.504-768x453.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145437.504.png 1196w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption><em>Select a user or a role<\/em><\/figcaption><\/figure>\n\n\n\n<p>When we gain access to the Lake Formation dashboard it\u2019s time to add a Lake Location, which is a valid S3 path to retrieve data from. Data can be obtained by various means, for example with <strong>AWS Glue Jobs<\/strong>, through the combination of <strong>AWS Kinesis stream<\/strong> and <strong>Data Firehose<\/strong>, or by simply uploading data directly to <strong>S3<\/strong>.<\/p>\n\n\n\n<p>Let\u2019s quickly review all the possibilities to populate our <strong>Glue Catalog<\/strong> (which defines our data lake under the hood)<\/p>\n\n\n\n<p>First, we\u2019ll add the data lake location by clicking on the \u201cRegister location\u201d button in the Register and Ingest section of the service\u2019s dashboard, like in the figure.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"156\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145633.100-1024x156.png\" alt=\"Choose data lake location\" class=\"wp-image-2869\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145633.100-1024x156.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145633.100-400x61.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145633.100-768x117.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145633.100-1536x233.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145633.100.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Add a new location for the data lake<\/figcaption><\/figure>\n\n\n\n<p>We\u2019ll be asked to select an S3 bucket, let\u2019s do so, add a suitable role (or let AWS create one for us), and finally click on \u201cRegister location\u201d.<\/p>\n\n\n\n<p>Now we can:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>simply upload data to S3 before starting the crawling process;<\/li><li>use a suitable combination of AWS services to ingest data, like Kinesis Stream and Firehose (see <a href=\"https:\/\/blog.besharp.it\/en\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/\">our story<\/a> for more information);<\/li><li>use a <strong>blueprint<\/strong> from Lake Formation to easily obtain data from various databases or log sources;<\/li><\/ol>\n\n\n\n<p>Let\u2019s review rapidly the third option, which is limited, but still interesting and not yet covered in our previous posts.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"277\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145804.270-1024x277.png\" alt=\"\" class=\"wp-image-2867\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145804.270-1024x277.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145804.270-400x108.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145804.270-768x208.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145804.270-1536x416.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T145804.270.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Use a blueprint workflow<\/figcaption><\/figure>\n\n\n\n<p>Clicking on \u201cUse blueprint\u201d, we\u2019ll be redirected to a form where we can select whether we want to grab data from a database or a log source.<\/p>\n\n\n\n<p>Just follow the instructions to set up a workload, which is basically a Glue ETL Job where all the options for the Extract, Transform, and Load steps are in one place.<\/p>\n\n\n\n<p>For example, for a MySQL, MSSQL, or Oracle database add (or create) an AWS Glue connection, specifying also the source DB and table in this format: <strong>\/&lt;_tag_&gt;, encoded_&lt; \/ strong, encoded_tag_opentable_&gt;tag_closed<\/strong> Add (or create) the target Glue Catalog, specifying a DB and a table, also browse with the tool provided, for a suitable S3 path to host the catalog data.<\/p>\n\n\n\n<p>Select a workflow name, decide the job frequency, i.e. \u201cRun on demand\u201d, and a table prefix, the other options can be left as defaults.<\/p>\n\n\n\n<p>A couple of notes: always opt for <strong>parquet<\/strong> format in the target S3 section, as it gives a solid performance boost for dataset operations later on. Also, if you plan to use Athena to query your catalog, please use \u201c_\u201d instead of \u201c-\u201d for database and table names, as the latter character sometimes can lead to unwanted compatibility issues.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Enhanced Security<\/h2>\n\n\n\n<p>Once Lake Formation is up and running, we can focus on the details that make it stand out: in primis, a coarse-permission model that works augmenting the one provided by IAM.<\/p>\n\n\n\n<p>This centrally defined model enables fine-grained access to data stored in data lakes with a simple grant\/revoke mechanism, like in the figure shown below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"336\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150004.196-1024x336.png\" alt=\"AWS Lake Formation permissions\" class=\"wp-image-2865\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150004.196-1024x336.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150004.196-400x131.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150004.196-768x252.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150004.196.png 1421w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>How a request passes through 2 stages of permission before having access to resources<\/figcaption><\/figure>\n\n\n\n<p>Lake Formation permissions are also enforced at the table and column level and work on all the full stack of AWS services for analytics and machine learning, including, but not limited to, Amazon Athena, Redshift, Sagemaker, and are also mapped with S3 objects under the hood.<\/p>\n\n\n\n<p>Access control in AWS Lake Formation is divided into two distinct areas:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Metadata access control<\/strong> \u2013 Permissions on Glue Data Catalog resources that allow <strong>principals<\/strong> to create, read, update, and delete metadata databases and tables.<\/li><\/ul>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Underlying data access control<\/strong> \u2013 Permissions for S3 which include data access and data location permissions. Data access permissions enable <strong>principals<\/strong> to read and write S3 Objects. Data location permissions enable the creation of metadata databases and tables that point to specific S3 locations.<\/li><\/ul>\n\n\n\n<p>Using these elements to centralize the data access policies is simple: first shut down any direct access to required buckets in S3, so Lake Formation manages all data access.<\/p>\n\n\n\n<p>Next, configure data protection and <strong>access policies<\/strong> to enforce those policies across all the AWS services accessing data in the data lake. By leveraging Metadata and Object permissions, we can configure users and roles to only access specific data down to the table and column level.<\/p>\n\n\n\n<p>Before assigning policies to users and resources, Lake Formation needs some required personas to work correctly, and those are also needed to grant backward compatibility with previous created IAM permissions for S3, AWS Glue, Athena, and more:<\/p>\n\n\n\n<p><strong>IAM administrator<\/strong><\/p>\n\n\n\n<p>Users who can create IAM users and roles. Has the AdministratorAccess AWS managed policy. It can also be designated as a data lake administrator.<\/p>\n\n\n\n<p><strong>Data lake administrator<\/strong><\/p>\n\n\n\n<p>Users who can register Amazon S3 locations, access the Data Catalog, create databases, create and run workflows, grant Lake Formation permissions to other users, and view AWS CloudTrail logs.<\/p>\n\n\n\n<p><strong>Workflow role<\/strong><\/p>\n\n\n\n<p>Role that runs a workflow on behalf of a user. You specify this role when you create a workflow from a blueprint. You specify this role when you create a workflow from a blueprint.<\/p>\n\n\n\n<p>The first two personas are <strong>IAMAllowedPrincipals<\/strong> and have granted &#8220;Super&#8221; permission and &#8220;Use only IAM access control&#8221; enabled by default grating backward compatibility with pre-existing workloads with Glue, S3, and Athena.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How permissions work<\/h2>\n\n\n\n<p>Fine-grained Permissions are organized in a way that Lake formation can use to substitute coarse-grained IAM permissions. This helps for more seamless transitions from the old permission set to the one managed by Lake Formation.<\/p>\n\n\n\n<p>This is a simple schema illustrating the possible choices:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"387\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150448.162-1024x387.png\" alt=\"transition from IAM to Lake Formation permission set \" class=\"wp-image-2863\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150448.162-1024x387.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150448.162-400x151.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150448.162-768x290.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150448.162.png 1175w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption><em>How transition from IAM to Lake Formation permission set works<\/em><\/figcaption><\/figure>\n\n\n\n<p>To view a list of permissions available via Lake Formation follow the official AWS <a href=\"https:\/\/docs.aws.amazon.com\/lake-formation\/latest\/dg\/lf-permissions-reference.html\">documentation<\/a>.<\/p>\n\n\n\n<p>Lake Formation also currently supports Server-Side-Encryption on S3, as well as private endpoints for VPC.<\/p>\n\n\n\n<p>It also records all activity in AWS CloudTrail (which can also be a supported dataset), giving an upper hand on network isolation and auditability.<\/p>\n\n\n\n<p>Lake Formation permissions <strong>apply only in the Region in which they were granted.<\/strong> For backward compatibility, Lake Formation <strong>passes through IAM permissions for new resources<\/strong> <strong>until directed differently<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"700\" height=\"390\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150622.667.png\" alt=\"AWS Lake Formation Data Catalog settings\" class=\"wp-image-2861\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150622.667.png 700w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150622.667-400x223.png 400w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><figcaption>Warning on the use of legacy permissions<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Enhanced Governance<\/h2>\n\n\n\n<p>What really helps maintain control over your data lake is that with Lake Formation, we finally have a centralized dashboard to control your S3 Locations, ETL Jobs, Crawlers, Glue Catalogs, and moreover permissions.<\/p>\n\n\n\n<p>Another exciting feature is that Lake Formation comes with Cloud Trail enabled, so every action done by users or services via IAM roles is checked and logged directly in the dashboard.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"159\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150734.870-1024x159.png\" alt=\"AWS Lake Formation - CloudTrail enabled\" class=\"wp-image-2859\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150734.870-1024x159.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150734.870-400x62.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150734.870-768x120.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150734.870-1536x239.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/unnamed-2021-03-15T150734.870.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Cloud Trail access activity directly from Lake Formation dashboard<\/figcaption><\/figure>\n\n\n\n<p>Another matter that we must manage when dealing with data lakes is data deduplication and data cleaning, which, if passed over, leads to inconsistent, inefficient, often inaccessible data.<\/p>\n\n\n\n<p>By encapsulating AWS Glue capabilities, Lake Formation offers <strong>FindMatches ML transform<\/strong>: a Glue Job used to remove duplicated data by leveraging Machine Learning algorithms. We can select a threshold for Accuracy to indicate how much precision the algorithm must use to identify potentially duplicate data (more precision, more costs).<\/p>\n\n\n\n<p>To check it in detail follow this <a href=\"https:\/\/aws.amazon.com\/blogs\/big-data\/integrate-and-deduplicate-datasets-using-aws-lake-formation-findmatches\/\">tutorial from AWS<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Preview capabilities<\/h2>\n\n\n\n<p><strong>Here is a quick preview of the features currently in closed beta. You will have to sign in with AWS to request early access. Transactions &#8211; Insert, delete, and modify rows concurrently<\/strong><\/p>\n\n\n\n<p>To manage the evolution of data in time, it is essential to define procedures capable of keeping the data lake always up-to-date. This is crucial as access to data must be granted to different users at any time, and we also need to ensure data integrity. Data is also often organized in both structured and unstructured ways.<\/p>\n\n\n\n<p>Implementing real-time updates is complex and challenging to scale. AWS Lake Formation introduces, in preview, new APIs that support ACID transactions using a new data lake table type, called a Governed table.<\/p>\n\n\n\n<p>It allows multiple users to concurrently insert, delete, and modify rows across tables while still allowing other users to run queries and machine learning models on the same data sets, with the assurance of data being actual.<\/p>\n\n\n\n<p><strong>Row-level security<\/strong><\/p>\n\n\n\n<p>Making sure users have access to only the correct data in a data lake is difficult. Data lake administrators often maintain multiple copies of data to apply different security policies for different users. This adds complexity, operational overhead, and extra storage costs.<\/p>\n\n\n\n<p>AWS Lake Formation already allows setting access policies to hide data, even on a per-column basis, i.e., social security numbers.<\/p>\n\n\n\n<p>With row-level security, we can give special per-row permissions to users and roles, i.e., access to specific regional data or data related to a specific bank account.<\/p>\n\n\n\n<p><strong>Acceleration<\/strong><\/p>\n\n\n\n<p>Better performance with filtering, aggregations, and automatic file compaction, thanks to a new storage optimizer that automatically combines small files into larger files to speed up queries by up to 7x. The process is performed in the background with no performance impact on production workloads.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<p><a href=\"https:\/\/pages.awscloud.com\/Lake_Formation_Feature_Preview.html\">https:\/\/pages.awscloud.com\/Lake_Formation_Feature_Preview.html<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/aws.amazon.com\/lake-formation\/faqs\/\">https:\/\/aws.amazon.com\/lake-formation\/faqs\/<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/docs.aws.amazon.com\/lake-formation\/latest\/dg\/access-control-overview.html\">https:\/\/docs.aws.amazon.com\/lake-formation\/latest\/dg\/access-control-overview.html<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/docs.aws.amazon.com\/lake-formation\/latest\/dg\/permissions-reference.html\">https:\/\/docs.aws.amazon.com\/lake-formation\/latest\/dg\/permissions-reference.html<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Takeaways<\/h2>\n\n\n\n<p>We have seen all the features that make Lake Formation a befitting choice for managing data lakes on AWS.<\/p>\n\n\n\n<p>We focused mainly on this service&#8217;s security and governance aspects, showing how managing permissions on an Object-level in S3 can be a hectic process, simplified by Lake Formation permissions. <\/p>\n\n\n\n<p>We showed how it enables to grant\/revoke permissions to users or roles on a table\/column level.<\/p>\n\n\n\n<p>We have seen that AWS Lake Formation Permissions are better suited than IAM permissions to secure a data lake because they are enforced on logical objects like a database, table, or column instead of files and directories; they provide granular control for column-level access.<\/p>\n\n\n\n<p>We have also seen that these permissions are internally mapped to underlying objects sitting in S3.<\/p>\n\n\n\n<p>Thanks to a straightforward UI, we do not need to keep multiple tabs open to keep track of ETL Jobs, S3 locations, and Data Catalogs for our workloads. All this information resides in a single dashboard where we can directly revoke\/grant permissions of the objects residing there.<\/p>\n\n\n\n<p>We took a quick look also at the new features available in early access. In particular, a new type of table, the <strong>governed<\/strong> table, allowing for seamless transactions to keep data always up-to-date. The possibility of using a per-row access policy and a new storage optimizer to increase performance on managing large quantities of small files.<\/p>\n\n\n\n<p>Though managing permissions, data ingestion workflow is made easy, but most of the Glue processes like ETL, Crawler, ML specific transformations have to be set up manually.<\/p>\n\n\n\n<p>So, this is it! As always, feel free to comment in the section below, and <a href=\"https:\/\/www.besharp.it\/contact-us\/\" target=\"_blank\" rel=\"noreferrer noopener\">reach us<\/a> for any doubt, question or idea!<\/p>\n\n\n\n<p>See you in a couple of weeks for another exciting story! Stay safe and <strong>#proud2becloud.<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Leveraging available data (Big Data) has become a significant focus for most companies in the last decades. In the [&hellip;]<\/p>\n","protected":false},"author":6,"featured_media":2794,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[475],"tags":[252,278,466,411,464,462,460],"class_list":["post-2803","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-analytics-en","tag-amazon-s3-en","tag-aws-identity-and-access-management-iam-en","tag-aws-lake-formation-en","tag-data-analytics-en","tag-data-lake-en","tag-data-security-and-governance-en","tag-mlops-en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Building a Data Lake from scratch on AWS using AWS Lake Formation - Proud2beCloud Blog<\/title>\n<meta name=\"description\" content=\"Building a Data Lake on Amazon Web Services using AWS Lake Formation, AWS Glue and Amazon S3.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Building a Data Lake from scratch on AWS using AWS Lake Formation\" \/>\n<meta property=\"og:description\" content=\"Building a Data Lake on Amazon Web Services using AWS Lake Formation, AWS Glue and Amazon S3\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/\" \/>\n<meta property=\"og:site_name\" content=\"Proud2beCloud Blog\" \/>\n<meta property=\"article:published_time\" content=\"2021-03-16T11:35:44+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2021-04-08T13:29:42+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/facebook-link-image.png\" \/>\n<meta name=\"author\" content=\"Alessandro Gaggia\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Building a Data Lake from scratch on AWS using AWS Lake Formation\" \/>\n<meta name=\"twitter:description\" content=\"Building a Data Lake on Amazon Web Services using AWS Lake Formation, AWS Glue and Amazon S3\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/twitter-shared-link.png\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Alessandro Gaggia\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"12 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/\",\"url\":\"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/\",\"name\":\"Building a Data Lake from scratch on AWS using AWS Lake Formation - Proud2beCloud Blog\",\"isPartOf\":{\"@id\":\"https:\/\/blog.besharp.it\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/beSharp_blog_Copertine_2021_19_03_2021.png\",\"datePublished\":\"2021-03-16T11:35:44+00:00\",\"dateModified\":\"2021-04-08T13:29:42+00:00\",\"author\":{\"@id\":\"https:\/\/blog.besharp.it\/#\/schema\/person\/f27fc12d10867c6ea6e0158ce4dd8924\"},\"description\":\"Building a Data Lake on Amazon Web Services using AWS Lake Formation, AWS Glue and Amazon S3.\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/#primaryimage\",\"url\":\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/beSharp_blog_Copertine_2021_19_03_2021.png\",\"contentUrl\":\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/beSharp_blog_Copertine_2021_19_03_2021.png\",\"width\":1667,\"height\":1250,\"caption\":\"Lake Formation: improved AWS Glue Data Security and Data Governance\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.besharp.it\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Building a Data Lake from scratch on AWS using AWS Lake Formation\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.besharp.it\/#website\",\"url\":\"https:\/\/blog.besharp.it\/\",\"name\":\"Proud2beCloud Blog\",\"description\":\"il blog di beSharp\",\"alternateName\":\"Proud2beCloud Blog\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.besharp.it\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.besharp.it\/#\/schema\/person\/f27fc12d10867c6ea6e0158ce4dd8924\",\"name\":\"Alessandro Gaggia\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.besharp.it\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f58dc28050f26409e22ab60346d06220?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f58dc28050f26409e22ab60346d06220?s=96&d=mm&r=g\",\"caption\":\"Alessandro Gaggia\"},\"description\":\"Head of software development di beSharp, Full-Stack developer, mi occupo di garantire lo stato dell\u2019arte di tutta la nostra codebase. Scrivo codice in quasi ogni linguaggio, ma prediligo Typescript. Respiro Informatica, Game design, Cinema, Fumetti e buona cucina. Disegno per passione!\",\"url\":\"https:\/\/blog.besharp.it\/author\/alessandro-gaggia\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Building a Data Lake from scratch on AWS using AWS Lake Formation - Proud2beCloud Blog","description":"Building a Data Lake on Amazon Web Services using AWS Lake Formation, AWS Glue and Amazon S3.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/","og_locale":"en_US","og_type":"article","og_title":"Building a Data Lake from scratch on AWS using AWS Lake Formation","og_description":"Building a Data Lake on Amazon Web Services using AWS Lake Formation, AWS Glue and Amazon S3","og_url":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/","og_site_name":"Proud2beCloud Blog","article_published_time":"2021-03-16T11:35:44+00:00","article_modified_time":"2021-04-08T13:29:42+00:00","og_image":[{"url":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/facebook-link-image.png","type":"","width":"","height":""}],"author":"Alessandro Gaggia","twitter_card":"summary_large_image","twitter_title":"Building a Data Lake from scratch on AWS using AWS Lake Formation","twitter_description":"Building a Data Lake on Amazon Web Services using AWS Lake Formation, AWS Glue and Amazon S3","twitter_image":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/twitter-shared-link.png","twitter_misc":{"Written by":"Alessandro Gaggia","Est. reading time":"12 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/","url":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/","name":"Building a Data Lake from scratch on AWS using AWS Lake Formation - Proud2beCloud Blog","isPartOf":{"@id":"https:\/\/blog.besharp.it\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/#primaryimage"},"image":{"@id":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/beSharp_blog_Copertine_2021_19_03_2021.png","datePublished":"2021-03-16T11:35:44+00:00","dateModified":"2021-04-08T13:29:42+00:00","author":{"@id":"https:\/\/blog.besharp.it\/#\/schema\/person\/f27fc12d10867c6ea6e0158ce4dd8924"},"description":"Building a Data Lake on Amazon Web Services using AWS Lake Formation, AWS Glue and Amazon S3.","breadcrumb":{"@id":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/#primaryimage","url":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/beSharp_blog_Copertine_2021_19_03_2021.png","contentUrl":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/03\/beSharp_blog_Copertine_2021_19_03_2021.png","width":1667,"height":1250,"caption":"Lake Formation: improved AWS Glue Data Security and Data Governance"},{"@type":"BreadcrumbList","@id":"https:\/\/blog.besharp.it\/costruire-un-data-lake-su-aws-con-aws-lake-formation\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.besharp.it\/"},{"@type":"ListItem","position":2,"name":"Building a Data Lake from scratch on AWS using AWS Lake Formation"}]},{"@type":"WebSite","@id":"https:\/\/blog.besharp.it\/#website","url":"https:\/\/blog.besharp.it\/","name":"Proud2beCloud Blog","description":"il blog di beSharp","alternateName":"Proud2beCloud Blog","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.besharp.it\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.besharp.it\/#\/schema\/person\/f27fc12d10867c6ea6e0158ce4dd8924","name":"Alessandro Gaggia","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.besharp.it\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/f58dc28050f26409e22ab60346d06220?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f58dc28050f26409e22ab60346d06220?s=96&d=mm&r=g","caption":"Alessandro Gaggia"},"description":"Head of software development di beSharp, Full-Stack developer, mi occupo di garantire lo stato dell\u2019arte di tutta la nostra codebase. Scrivo codice in quasi ogni linguaggio, ma prediligo Typescript. Respiro Informatica, Game design, Cinema, Fumetti e buona cucina. Disegno per passione!","url":"https:\/\/blog.besharp.it\/author\/alessandro-gaggia\/"}]}},"_links":{"self":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/posts\/2803","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/comments?post=2803"}],"version-history":[{"count":0,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/posts\/2803\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/media\/2794"}],"wp:attachment":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/media?parent=2803"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/categories?post=2803"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/tags?post=2803"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}