{"id":3775,"date":"2021-11-12T14:00:00","date_gmt":"2021-11-12T13:00:00","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=3775"},"modified":"2021-11-12T12:00:05","modified_gmt":"2021-11-12T11:00:05","slug":"lake-formation-data-security-and-data-governance-with-lf-tbac","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/lake-formation-data-security-and-data-governance-with-lf-tbac\/","title":{"rendered":"Lake Formation: Data Security and Data Governance with LF-TBAC"},"content":{"rendered":"\n
Big Data has rapidly grown as a way to describe information obtained from heterogeneous sources when it becomes incredibly complex to manage in terms of Variety<\/strong>, Veracity<\/strong>, Value<\/strong>, Volume<\/strong>, and Velocity<\/strong>. Still, it can be considered the \u201cNew Gold because of the potential to generate business value.\u201d<\/p>\n\n\n\n Without adequate governance or quality, data lakes can quickly turn into unmanageable data swamps. Data engineers know the data they need lives in these swamps, but they won’t be able to find, trust, or use it without a clear data governance strategy.<\/p>\n\n\n\n A very common challenge is maintaining <\/strong>Governance, access contro<\/strong>l over users who operate on the Data Lake, and protecting sensitive information. <\/p>\n\n\n\n Companies need to centralize governance, access control, and a strategy backed by managed services to fine-grain control user access to data.<\/p>\n\n\n\n Dealing with these situations typically requires two approaches: manual<\/em>, more flexible<\/strong> but complex<\/strong>; managed<\/em> which requires your solution to fit into specific standards<\/strong> but in return takes away all management complexities<\/strong> for the developers.<\/p>\n\n\n\n This article will guide you through setting up your Data Lake with Lake Formation, showing all the challenges that must be addressed during the process with a particular eye on Security and Governance through the LF-TBAC approach. <\/p>\n\n\n\n Tag-Based Access Control, in short TBAC<\/strong>, is an increasingly popular way to solve these challenges, applying constraints based on tags associated with specific resources.<\/p>\n\n\n\n So, without further ado, let\u2019s dig in!<\/p>\n\n\n\n Tag-based access control allows administrators of IAM-enabled resources to create access policies based on existing tags associated with eligible resources. <\/p>\n\n\n\n Cloud providers manage permissions of both users and applications with policies, documents with rules that reference resources. By applying tags to those resources is possible to define simple and effective allow\/deny conditions.<\/p>\n\n\n\n Using access management tags may reduce the number of access policies needed within a cloud account while also providing a simplified way to grant access to a heterogeneous group of resources.<\/p>\n\n\n\n S3, like most AWS services, leverages the IAM principals for access management<\/strong>, meaning that it is possible to define which parts of a bucket (files and folders\/prefixes) a single IAM principal can read\/write; however is not possible to further restrict IAM access to specific parts of an object, nor to certain data segments stored inside objects.<\/p>\n\n\n\n For example, let\u2019s assume that our application data is stored as a collection of parquet files divided per country in different folders.<\/p>\n\n\n\n It is possible to constrain a user to access only the users belonging to a given country<\/strong>. Still, there is no way to prevent them from reading the anagraphic information (e.g., username and address) stored as columns in the parquet. <\/p>\n\n\n\n The only way to prevent users from accessing sensitive information would be to encrypt the columns before writing the files to S3, <\/strong>which can be slow<\/strong>, cumbersome,<\/strong> and open a whole new \u2018can of worm\u2019 regarding key storage<\/strong>, sharing,<\/strong> and eventually key decommissioning<\/strong>.<\/p>\n\n\n\n Furthermore, giving access to external entities using IAM principals is often a non-trivial problem on its own<\/strong>.<\/p>\n\n\n\n Luckily, AWS offers a battery included solution to the S3 Data Lake permission problem<\/strong>: enters AWS Lake Formation!<\/p>\n\n\n\n AWS Lake Formation is a fully managed service that simplifies building, securing, and managing data lakes, automating many of the complex manual steps required to create them. <\/p>\n\n\n\n Lake Formation also provides its own permissions model, which is what we want to explore in detail, that augments the classical AWS IAM permissions model<\/strong>. <\/p>\n\n\n\n This centrally defined permissions model enables fine-grained access to data stored in data lakes through a simple grant\/revoke mechanism.<\/p>\n\n\n\n So, by leveraging the power of Lake Formation, we would like to demonstrate, with a simple solution, how to address the aforementioned S3 challenges; let\u2019s continue!<\/p>\n\n\n\n To accompany the reader in understanding why AWS Lake Formation can be a good choice in dealing with the complexities of managing a DataLake, we have prepared a simple tutorial on how to migrate heterogeneous data.<\/p>\n\n\n\n From legacy on-prem databases into S3 while also creating a Lake Formation catalog to deal with data cleansing, permissions, and further operations.<\/p>\n\n\n\n <\/p>\n\n\n\nWhat is TBAC access<\/h2>\n\n\n\n
Why S3 alone is not enough<\/h2>\n\n\n\n
Leveraging TBAC approach in Lake Formation<\/h2>\n\n\n\n