{"id":5761,"date":"2023-04-14T09:00:00","date_gmt":"2023-04-14T07:00:00","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=5761"},"modified":"2023-04-13T16:29:58","modified_gmt":"2023-04-13T14:29:58","slug":"efficiently-stream-dml-events-on-aws","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/efficiently-stream-dml-events-on-aws\/","title":{"rendered":"Efficiently stream DML events on AWS"},"content":{"rendered":"\n

In the past, many workloads came with their own big and monolithic database, where not only the application but also reporting tools and technical support connected to it and performed queries. <\/p>\n\n\n\n

As this is still true today, companies are moving towards having single information stored on multiple data sources and servers. Only the core application should be able to access the database, reporting tools should use data that is stored on a separate instance, and monitoring and data analytics should be done by aggregating data that comes from different sources.<\/p>\n\n\n\n

To do this we need to stream the changes that are occurring on our database to one or more destinations. Today we are going to take a look at how to do this on AWS.<\/p>\n\n\n\n

AWS Database Migration Service (DMS)<\/strong> is a powerful tool for migrating data between various database platforms. One of the standout features of AWS DMS is its Change Data Capture (CDC) functionality, which allows for real-time streaming of changes made to a source database to a target database.<\/p>\n\n\n\n

When using AWS DMS, you have the option to attach a target database directly as an endpoint or use Amazon Kinesis Data Streams to capture and process the streaming data. <\/p>\n\n\n\n

Here are some differences between the two approaches:<\/p>\n\n\n\n

    \n
  1. Latency: when streaming data directly to a target database, there may be some latency involved in the processing and writing the data. With Kinesis Data Streams, the data is captured and processed in real-time, so there is no delay in processing.<\/li>\n\n\n\n
  2. Scalability: Kinesis Data Streams is designed to handle large volumes of streaming data, and can automatically scale to accommodate increased traffic. When streaming data directly to a target database, you may need to manually scale the database to handle increased traffic.<\/li>\n\n\n\n
  3. Flexibility: with Kinesis Data Streams, you can easily process and analyze the streaming data using various AWS services, such as AWS Glue or AWS Lambda. When streaming data directly to a target database, you may have limited options for processing and analyzing the data.<\/li>\n\n\n\n
  4. Cost: using Kinesis Data Streams may incur additional costs for processing and storing the streaming data, as well as any associated AWS services used for processing and analysis. Streaming data directly to a target database may not have any additional costs, but you may need to consider the cost of scaling the database to handle increased traffic.<\/li>\n<\/ol>\n\n\n\n

    Overall, both approaches have their advantages and disadvantages, and the best choice depends on your specific use case and requirements. In this article, we are going to explore the possibility to process insert\/update\/delete events on flight with the help of Amazon Kinesis Data Streams.<\/p>\n\n\n\n

    Setup efficient DML events streaming on AWS <\/h2>\n\n\n\n

    Now, let\u2019s build a proof of concept to test out the CDC streaming solution with DMS and Kinesis Data Streams. The idea is to have an automated process that gives us an easy way to replicate changes that happen on a source database to one or more destination engines.<\/p>\n\n\n\n

    This is a diagram of what we\u2019re going to build:<\/p>\n\n\n\n

    <\/p>\n\n\n

    \n
    \"stream<\/figure><\/div>\n\n\n

    <\/p>\n\n\n\n

    The ingestion<\/h3>\n\n\n\n

    The first thing we need to do, if we want to enable CDC, is configure our source database<\/strong> to make all the information needed by DMS available to trap new events. For many engines, this means running a bunch of queries that are well-described in the official AWS documentation<\/a>. <\/p>\n\n\n\n

    After the source database has been configured, let\u2019s create our Kinesis Data Stream<\/strong>. <\/p>\n\n\n\n

    This step is pretty straightforward as we don\u2019t need to provide many parameters. We only need to decide what\u2019s our data stream\u2019s capacity mode: <\/p>\n\n\n\n