{"id":7690,"date":"2025-03-12T09:00:00","date_gmt":"2025-03-12T08:00:00","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=7690"},"modified":"2025-03-12T10:25:02","modified_gmt":"2025-03-12T09:25:02","slug":"democratize-data-access-through-a-self-service-data-platform-using-aws-lakeformation-part-2","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/democratize-data-access-through-a-self-service-data-platform-using-aws-lakeformation-part-2\/","title":{"rendered":"Democratize data access through a self-service Data Platform using AWS LakeFormation – Part 2"},"content":{"rendered":"\n

In this series of articles, we will describe how to properly create and structure a self-service Data Platform for data democratization analytics on AWS. We will start with data ingestion and storage and then move through processing tools to create valuable data for analytics, visualizations, and reporting. Moreover, we will focus on data governance, discoverability, and collaboration, with an eye on security and access control.<\/p>\n\n\n\n

Follow this article to learn how to democratize data access through your self-service Data Platform. Using AWS LakeFormation, ensure data governance and properly structure data, access, and visibility. Don\u2019t forget to keep an eye on the website for part 3!<\/p>\n\n\n\n

This article is a sequel to the description of data platforms, and related data pipelines, building on top of those concepts. If you are still trying to familiarize yourself, or maybe you just need a refresher on these concepts, here is part 1<\/a><\/strong><\/p>\n\n\n\n

TL;DR<\/h2>\n\n\n\n

Ingest your data sources inside S3 buckets and register data locations inside AWS LakeFormation. Catalog data with databases, tables, and columns. Define and associate LF-Tags to these catalog resources to perform attribute-based access control (ABAC). Define roles and grant them tag-based permissions to enable data access. Create an administrator with grantable permissions on specific areas and use tags for data discoverability to democratize and achieve self-service data access.<\/p>\n\n\n\n

The Challenge of Data Democratization<\/h2>\n\n\n\n

In today’s data-driven world, organizations face a critical paradox: they are swimming in vast oceans of data, although most struggle to effectively utilize this valuable resource.<\/p>\n\n\n\n

Traditional data management approaches tended to organize data in separated, disconnected, structures, like siloes. In these approaches, each silo is usually accessible only by its own technical department, creating several problems along the way.<\/p>\n\n\n\n

The challenges of data democratization extend beyond technical limitations. This silos separation creates complex barriers that prevent analysts from widespread data access, like submitting time-consuming requests to IT or data teams for even the most basic data access. Users operate with incomplete information, having a very hard time seeing the \u201cbigger picture\u201d, and the potential competitive advantage of data-driven decision-making remains unrealized. <\/p>\n\n\n\n

Many companies find themselves trapped in a cycle of manual access management, where data access requests require multiple approvals, complex permission configurations, and ongoing maintenance. This not only creates a significant administrative burden but also slows down the potential for innovation.<\/p>\n\n\n\n

The data lake architecture helps solve this challenge by concentrating all data into a single place. Everyone needing access to data knows where to look. But not all the glitter is gold! Aggregating all data into a single place creates a new, yet different, challenge: user access management. Even though now everyone can, potentially, have access to data, is it safe?<\/p>\n\n\n\n

Organizations must simultaneously balance two competing priorities: enabling broad data access while maintaining rigorous data governance and security protocols. The risk of exposing sensitive information, coupled with compliance requirements like GDPR, CCPA, and industry-specific regulations, creates a significant overhead in managing data permissions.<\/p>\n\n\n\n

Here is where AWS LakeFormation<\/strong> can become a very handy tool!<\/p>\n\n\n\n

What is AWS LakeFormation?<\/h2>\n\n\n\n

AWS Lake Formation is a fully managed service that simplifies the creation, security, and management of data lakes.<\/p>\n\n\n\n

At its core, the service simplifies the traditionally complex and time-consuming process of consolidating data from multiple sources into a unified and secure repository, the data lake, within a few days instead of months\/year. Unlike traditional data management approaches that require extensive manual configuration and complex infrastructure setup, AWS LakeFormation automates critical tasks such as data ingestion, metadata cataloging, and access control. It is a centralized platform that abstracts from technical complexities, allowing data engineers, analysts, and business leaders to focus on what really matters: extracting insights and real value from data.<\/p>\n\n\n\n

Moreover, AWS LakeFormation provides robust governance and security capabilities, essential features for data governance in data-driven enterprises. The service offers granular, attribute-based access controls that enable organizations to define precise data access policies at the database, table, column, and even row levels. This means businesses can implement fine-grained security mechanisms that ensure sensitive information is protected while still being able to achieve data democratization. By seamlessly integrating with other AWS services like Amazon S3, AWS Glue, and Amazon Athena, AWS LakeFormation creates a comprehensive ecosystem that supports the entire data lifecycle, from raw data ingestion and transformation to analysis and visualization. Its ability to centralize metadata management, automate data discovery, and provide consistent security across diverse data sources makes it a pivotal tool for enterprises seeking to leverage their data assets efficiently and securely.<\/p>\n\n\n\n

Govern your Data Lake<\/h2>\n\n\n\n

Now that we\u2019ve described the challenges and the toolset, let\u2019s dirty our hands and put it into action!<\/p>\n\n\n\n

If you read the first of this series of articles, you already know what we are working with, but to get everyone on the same page, here is a very brief overview of the setup.<\/p>\n\n\n\n

Acting as a data engineer for a fictional company that helps its customers increase their revenues, you created a data platform following the standard medallion architecture<\/strong>. You developed ingestion and transformation logic to gather data and move it through the increasingly refined layers of the data platform.<\/p>\n\n\n\n

<\/p>\n\n\n

\n
\"\"<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

The company now asks you to govern the data platform, making data accessible to internal teams and customers. Customers just want to see and query their data. Meanwhile, internal teams need to visualize data and use it to train machine learning models to help customers achieve their goals. Additionally, you need to keep an eye on data access and security: customers must only see their data! Moreover, customers’ data contains PII, which is not useful for internal teams and should not be visible to them.<\/p>\n\n\n\n

Data Ingestion<\/h4>\n\n\n\n

We already have all raw data ingested inside the bronze layer bucket; however, here is a quick tip that may be useful for some readers who are trying to implement ingestion.<\/p>\n\n\n\n

AWS LakeFormation offers blueprints to ingest data from relational databases, CloudTrail, and load balancer logs. Blueprints are pre-defined CloudFormation templates that create all the needed resources to perform ingestion from your sources. Under the hood, it creates a Glue workflow, composed of Glue jobs and crawlers that ingest data inside your S3 buckets and update the Glue Data Catalog.<\/p>\n\n\n\n

Register Data Lake Locations<\/h4>\n\n\n\n

First, we need to make AWS LakeFormation aware of the assets composing our data lake. To do so, we need to register the S3 locations. You can register buckets or specific paths inside them. Following the medallion architecture, we created 3 buckets:<\/p>\n\n\n\n