{"id":2584,"date":"2021-02-04T12:49:38","date_gmt":"2021-02-04T11:49:38","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=2584"},"modified":"2023-03-24T18:30:16","modified_gmt":"2023-03-24T17:30:16","slug":"iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/","title":{"rendered":"IoT ingestion and ML analytics pipeline with AWS IoT, Kinesis and SageMaker"},"content":{"rendered":"\n

Introduction<\/h2>\n\n\n\n

Machine Learning is rapidly becoming part of our daily life: it lets software and devices manage routines without human intervention and moreover gives us the ability to automate, standardize, and simplify many daily tasks. One interesting topic is, for example, home automation, where it is now possible to have intelligent lights, smart heating, and autonomous robots that clean floors even in complex home landscapes filled with obstacles. <\/p>\n\n\n\n

Generally speaking, information retrievable from connected devices is nearly infinite. Cheap cost of data acquisition, and computational power to manage big data, made Machine Learning accessible to many use-cases. One of the most interesting is ingestion and real-time analysis of IoT connected devices.<\/p>\n\n\n\n

In this article, we would like to share a solution that takes advantage of AWS Managed Services to handle high volumes of data in real-time coming from one or more IoT connected devices. We\u2019ll show in detail how to set up a pipeline to give access to potential users to near real-time forecasting results based on the received IoT data. <\/p>\n\n\n\n

The solution will also explore some key concepts related to Machine Learning, ETL jobs, Data Cleaning, and data lake preparation.<\/p>\n\n\n\n

But before jumping into code and infrastructure design, a little  recap on ML, IoT, and ETL is needed. Let\u2019s dive together into it!<\/p>\n\n\n\n

IoT, Machine Learning and Data Transformation: key concepts<\/h2>\n\n\n\n

IoT<\/h3>\n\n\n\n

The Internet of things (IoT) is a common way to describe a set of interconnected physical devices \u2014 \u201cthings\u201d \u2014 fitted with sensors, that to  exchange data to each other and over the Internet.<\/p>\n\n\n\n

IoT has evolved rapidly due to the decreasing cost of smart sensors, and to the convergence of multiple technologies like real-time analytics, machine learning, and embedded systems.<\/p>\n\n\n\n

Of course, traditional fields of embedded systems, wireless sensor networks, control systems, and automation, also  contribute to the IoT world.<\/p>\n\n\n\n

Machine Learning<\/h3>\n\n\n\n

ML was born as an evolution of Artificial Intelligence<\/strong> . Traditional ML required the programmers to write complex and difficult to maintain heuristics in order to carry out a traditionally human task (e.g. text recognition in images) using a computer.<\/p>\n\n\n\n

With Machine Learning it is the system itself that learns relationships between data.<\/p>\n\n\n\n

For example, in a chess game, there is no longer an algorithm that makes chess play, but by providing a dataset of features concerning chess games, the model learns to play by itself. <\/p>\n\n\n\n

Machine Learning also makes sense in a distributed context<\/strong> where the prediction must scale<\/strong>.<\/p>\n\n\n\n

Data Transformation<\/h3>\n\n\n\n

In a Machine Learning pipeline, the data must be uniform, i.e. standardized. Differences in the data can result from heterogeneous sources, such as different DB table schema, or different data ingestion workflows . <\/p>\n\n\n\n

Transformation (ETL: Extract, transform, load) of data is thus an essential step  in all ML pipelines. Standardized data are not only essential in training the ML model but are also much easier to analyse and visualize in the preliminary data discovery<\/strong> step.<\/p>\n\n\n\n

In general, for data cleaning and formatting, Scipy Pandas and similar libraries are usually used.<\/p>\n\n\n\n

NumPy<\/strong>: <\/em> – library for the management of multidimensional arrays, it is mainly used in the importing and reading phase of a dataset.<\/p>\n\n\n\n

Pandas<\/strong> Dataframe<\/strong>: – library for managing data in table format. It takes data points from CSV<\/strong>, JSON<\/strong>, Excel<\/strong>, and pickle<\/strong> files and transforms them into tables. <\/p>\n\n\n\n

SciKit-Learn<\/strong>: – library for final data manipulation and training.
Cleaning and formatting the data is essential to ensure the best chance for the model to converge well <\/strong>to the desired solution.<\/p>\n\n\n\n

The Pipeline<\/h2>\n\n\n\n

To achieve our result, we will make extensive use of what AWS gives us in terms of managed services. Here is a simple sketch, showing the main actors involved in our Machine Learning pipeline.<\/p>\n\n\n\n

\"The<\/figure>\n\n\n\n

Let\u2019s see take a look at the purpose of each component before going into the details of each one of them.<\/p>\n\n\n\n

The pipeline is organized into 5 main phases: ingestion<\/strong>, datalake preparation<\/strong>, transformation<\/strong>, training<\/strong>, inference<\/strong>.<\/p>\n\n\n\n

The ingestion phase <\/strong>will receive data from our connected devices using AWS IoT Core<\/strong> to allow connecting them with AWS services without managing servers and communication complexities<\/a>. Data\u00a0 from the devices will be sent\u00a0 using the MQTT protocol<\/a> to minimize code footprint and network bandwidth. Should you need it AWS IoT Core can also manage device authentication<\/strong>.<\/p>\n\n\n\n

\"AWS
AWS IoT Core – Courtesy of AWS <\/em><\/figcaption><\/figure>\n\n\n\n

To send information to our Amazon S3 data lake we will use Amazon Kinesis Data Firehose<\/a> which comes with a built-in action for reading AWS IoT Core messages.
To transform data and make it available for Amazon SageMaker we will use
AWS Glue<\/a>: a serverless data integration service that makes it easy to find, prepare and combine data for analytics, machine learning, and application development. AWS Glue provides all the capabilities needed for data integration, to start analyzing and using data in minutes rather than months.<\/p>\n\n\n\n

Finally, to train and then deploy our model for online inference we will show how to leverage built-in algorithms from Amazon SageMaker, in particular, DeepAR<\/strong>.<\/p>\n\n\n\n

Ingestion: AWS IoT Core to Amazon Kinesis Data Firehose<\/h2>\n\n\n\n

To connect our test device to AWS we used AWS IoT Core capabilities. In the following, we assume that the reader already has an AWS account ready. <\/p>\n\n\n\n

AWS IoT Core<\/h3>\n\n\n\n

Go to your account and then search for \u201cIoT Core\u201d then in the service page, in the sidebar menu, choose \u201cGet started\u201d and then select \u201cOnboard a device\u201d. <\/p>\n\n\n

\n
\"Onboarding\"
Connettere un nuovo dispositivo<\/figcaption><\/figure><\/div>\n\n\n

Follow the wizard to connect a device as we did. The purpose is to:<\/p>\n\n\n\n

    \n
  1. Create an AWS IoT Thing<\/strong><\/li>\n\n\n\n
  2. Download the requested code directly to your device to allow connection to AWS.<\/li>\n<\/ol>\n\n\n\n

    This is important because we also connect Amazon Kinesis Data Firehose to read the messages sent from AWS IoT Core. As a side note, remember that you need access to the device and that device must have a TCP connection to the public internet on port 8883.<\/p>\n\n\n\n

    Following the wizard, select Linux as the OS and an SDK (in our case Node.js):<\/p>\n\n\n

    \n
    \"Platform<\/figure><\/div>\n\n\n

    After that, we gave a name to the new \u201cthing\u201d and got the connection kit which contains:<\/p>\n\n\n\n