{"id":5171,"date":"2022-11-25T14:08:22","date_gmt":"2022-11-25T13:08:22","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=5171"},"modified":"2022-11-25T14:08:25","modified_gmt":"2022-11-25T13:08:25","slug":"on-demand-data-lakes-on-amazon-s3-how-to-tackle-etl-and-datamart-at-scale","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/on-demand-data-lakes-on-amazon-s3-how-to-tackle-etl-and-datamart-at-scale\/","title":{"rendered":"On-Demand Data lakes on Amazon S3: how to tackle ETL and DataMart at scale"},"content":{"rendered":"\n
Big Data started with a bang in the mid-2000s with the development of the MapReduce methodology at Google and kept growing at a breakneck pace with the continuous development of better and better tools: Apache Hadoop, Spark, Pandas have all been developed in this timeframe. <\/p>\n\n\n\n
Concurrently, more and more Cloud providers and service integrators have started offering managed Big Data Data Lake solutions to meet the growing demand of companies that are more and more eager to analyze and monetize their data: from Cloudera to AWS Glue.<\/p>\n\n\n\n
While in the meantime BigData has stopped being a buzzword and has been supplanted in this role by newer more appealing ones (such as Blockchain and Quantum computing) the need for companies to leverage data to better target their customers, optimize their products and refine their processes, has markedly increased.<\/p>\n\n\n\n
In this short article, we will describe how to create a dynamic decentralized multi-account DataLake on AWS<\/strong>, leveraging AWS Glue<\/strong>, Athena<\/strong>, and Redshift<\/strong>.<\/p>\n\n\n\n Usually, data lakes are unstructured data repositories that collect input data from heterogeneous data sources such as legacy SQL databases, document databases (e.g. MongoDB), Key value databases (e.g. Cassandra), and raw files from various sources (SFTP servers, Samba, Object storages). <\/p>\n\n\n\n In this case, our requirement is to split the database so that each internal project in the customer Company structure can only access its segregated data silo and set up the ETL operations it needs. A selected number of users from the general administration will be able to access the data from several silos in order to read and aggregate the data for company-wide analytics, business intelligence, and general reporting. <\/p>\n\n\n\n In order to meet these requirements and assure the strongest possible segregation<\/strong> between different silos we decided to split the projects into several AWS accounts using AWS organizations. The accounts structure is represented in the diagram below:<\/p>\n\n\n\n <\/p>\n\n\n <\/p>\n\n\n\n In order not to get bored another requirement was to be able to create credentials for third parties to send data directly to the data lake either with APIs or SFTP. <\/p>\n\n\n\n This means that each account will not only contain the S3 bucket with the data and the Glue\/Step Functions jobs needed to transform them, which are different for each silo, but also an admin web application to manage third-party access through temporary IAM credentials and a frontend deployed on Cloudfront to give a simple interface to users in order to load data on S3 directly.<\/p>\n\n\n\n <\/p>\n\n\n <\/p>\n\n\n\n If you are wondering how we managed the SFTP part it is nearly straightforward, we just activated AWS Transfer Family service for SFTP<\/a> with a custom Lambda to authenticate users with the same IAM credentials used for the Webapp access.<\/p>\n\n\n\n Thus, developing a relatively straightforward web application, we managed to create a fully serverless interface to our S3 buckets so that internal users can create temp credentials for external users to access an S3 dropzone where they can upload new files. <\/p>\n\n\n\n If you are wondering what type of black magic the Lambda(s) Backend in the diagram above is performing in order to do so without a database the answer is very simple: the state of our credentials vending machine is a collection of AWS resources (Users, Roles, Buckets) so we create them directly with CloudFormation and preserve the state in directly in the template!<\/p>\n\n\n\n Using cross-account pipelines, all the infrastructure and application components can be automatically deployed on each account in a self-service way: when an account is created in the relevant AWS Organization Unit a CloudFormation Stack Set is created in the account which deploys the basic infrastructure components, and an AWS CodePipeline to fetch the application code from the Master Datalake account and deploy it in the target Silo account. <\/p>\n\n\n\nThe Problem at hand<\/strong><\/h2>\n\n\n\n
<\/figure><\/div>\n\n\n
<\/figure><\/div>\n\n\n