{"id":2584,"date":"2021-02-04T12:49:38","date_gmt":"2021-02-04T11:49:38","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=2584"},"modified":"2023-03-24T18:30:16","modified_gmt":"2023-03-24T17:30:16","slug":"iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/","title":{"rendered":"IoT ingestion and ML analytics pipeline with AWS IoT, Kinesis and SageMaker"},"content":{"rendered":"\n
Machine Learning is rapidly becoming part of our daily life: it lets software and devices manage routines without human intervention and moreover gives us the ability to automate, standardize, and simplify many daily tasks. One interesting topic is, for example, home automation, where it is now possible to have intelligent lights, smart heating, and autonomous robots that clean floors even in complex home landscapes filled with obstacles. <\/p>\n\n\n\n
Generally speaking, information retrievable from connected devices is nearly infinite. Cheap cost of data acquisition, and computational power to manage big data, made Machine Learning accessible to many use-cases. One of the most interesting is ingestion and real-time analysis of IoT connected devices.<\/p>\n\n\n\n
In this article, we would like to share a solution that takes advantage of AWS Managed Services to handle high volumes of data in real-time coming from one or more IoT connected devices. We\u2019ll show in detail how to set up a pipeline to give access to potential users to near real-time forecasting results based on the received IoT data. <\/p>\n\n\n\n
The solution will also explore some key concepts related to Machine Learning, ETL jobs, Data Cleaning, and data lake preparation.<\/p>\n\n\n\n
But before jumping into code and infrastructure design, a little recap on ML, IoT, and ETL is needed. Let\u2019s dive together into it!<\/p>\n\n\n\n
The Internet of things (IoT) is a common way to describe a set of interconnected physical devices \u2014 \u201cthings\u201d \u2014 fitted with sensors, that to exchange data to each other and over the Internet.<\/p>\n\n\n\n
IoT has evolved rapidly due to the decreasing cost of smart sensors, and to the convergence of multiple technologies like real-time analytics, machine learning, and embedded systems.<\/p>\n\n\n\n
Of course, traditional fields of embedded systems, wireless sensor networks, control systems, and automation, also contribute to the IoT world.<\/p>\n\n\n\n
ML was born as an evolution of Artificial Intelligence<\/strong> . Traditional ML required the programmers to write complex and difficult to maintain heuristics in order to carry out a traditionally human task (e.g. text recognition in images) using a computer.<\/p>\n\n\n\n With Machine Learning it is the system itself that learns relationships between data.<\/p>\n\n\n\n For example, in a chess game, there is no longer an algorithm that makes chess play, but by providing a dataset of features concerning chess games, the model learns to play by itself. <\/p>\n\n\n\n Machine Learning also makes sense in a distributed context<\/strong> where the prediction must scale<\/strong>.<\/p>\n\n\n\n In a Machine Learning pipeline, the data must be uniform, i.e. standardized. Differences in the data can result from heterogeneous sources, such as different DB table schema, or different data ingestion workflows . <\/p>\n\n\n\n Transformation (ETL: Extract, transform, load) of data is thus an essential step in all ML pipelines. Standardized data are not only essential in training the ML model but are also much easier to analyse and visualize in the preliminary data discovery<\/strong> step.<\/p>\n\n\n\n In general, for data cleaning and formatting, Scipy Pandas and similar libraries are usually used.<\/p>\n\n\n\n – NumPy<\/strong>: <\/em> – library for the management of multidimensional arrays, it is mainly used in the importing and reading phase of a dataset.<\/p>\n\n\n\n – Pandas<\/strong> Dataframe<\/strong>: – library for managing data in table format. It takes data points from CSV<\/strong>, JSON<\/strong>, Excel<\/strong>, and pickle<\/strong> files and transforms them into tables. <\/p>\n\n\n\n – SciKit-Learn<\/strong>: – library for final data manipulation and training. To achieve our result, we will make extensive use of what AWS gives us in terms of managed services. Here is a simple sketch, showing the main actors involved in our Machine Learning pipeline.<\/p>\n\n\n\nData Transformation<\/h3>\n\n\n\n
Cleaning and formatting the data is essential to ensure the best chance for the model to converge well <\/strong>to the desired solution.<\/p>\n\n\n\nThe Pipeline<\/h2>\n\n\n\n