{"id":5171,"date":"2022-11-25T14:08:22","date_gmt":"2022-11-25T13:08:22","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=5171"},"modified":"2022-11-25T14:08:25","modified_gmt":"2022-11-25T13:08:25","slug":"on-demand-data-lakes-on-amazon-s3-how-to-tackle-etl-and-datamart-at-scale","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/on-demand-data-lakes-on-amazon-s3-how-to-tackle-etl-and-datamart-at-scale\/","title":{"rendered":"On-Demand Data lakes on Amazon S3: how to tackle ETL and DataMart at scale"},"content":{"rendered":"\n
Big Data started with a bang in the mid-2000s with the development of the MapReduce methodology at Google and kept growing at a breakneck pace with the continuous development of better and better tools: Apache Hadoop, Spark, Pandas have all been developed in this timeframe. <\/p>\n\n\n\n
Concurrently, more and more Cloud providers and service integrators have started offering managed Big Data Data Lake solutions to meet the growing demand of companies that are more and more eager to analyze and monetize their data: from Cloudera to AWS Glue.<\/p>\n\n\n\n
While in the meantime BigData has stopped being a buzzword and has been supplanted in this role by newer more appealing ones (such as Blockchain and Quantum computing) the need for companies to leverage data to better target their customers, optimize their products and refine their processes, has markedly increased.<\/p>\n\n\n\n
In this short article, we will describe how to create a dynamic decentralized multi-account DataLake on AWS<\/strong>, leveraging AWS Glue<\/strong>, Athena<\/strong>, and Redshift<\/strong>.<\/p>\n\n\n\n Usually, data lakes are unstructured data repositories that collect input data from heterogeneous data sources such as legacy SQL databases, document databases (e.g. MongoDB), Key value databases (e.g. Cassandra), and raw files from various sources (SFTP servers, Samba, Object storages). <\/p>\n\n\n\n In this case, our requirement is to split the database so that each internal project in the customer Company structure can only access its segregated data silo and set up the ETL operations it needs. A selected number of users from the general administration will be able to access the data from several silos in order to read and aggregate the data for company-wide analytics, business intelligence, and general reporting. <\/p>\n\n\n\n In order to meet these requirements and assure the strongest possible segregation<\/strong> between different silos we decided to split the projects into several AWS accounts using AWS organizations. The accounts structure is represented in the diagram below:<\/p>\n\n\n\n <\/p>\n\n\nThe Problem at hand<\/strong><\/h2>\n\n\n\n