{"id":2584,"date":"2021-02-04T12:49:38","date_gmt":"2021-02-04T11:49:38","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=2584"},"modified":"2023-03-24T18:30:16","modified_gmt":"2023-03-24T17:30:16","slug":"iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/","title":{"rendered":"IoT ingestion and ML analytics pipeline with AWS IoT, Kinesis and SageMaker"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p>Machine Learning is rapidly becoming part of our daily life: it lets software and devices manage routines without human intervention and moreover gives us the ability to automate, standardize, and simplify many daily tasks. One interesting topic is, for example, home automation, where it is now possible to have intelligent lights, smart heating, and autonomous robots that clean floors even in complex home landscapes filled with obstacles.&nbsp;<\/p>\n\n\n\n<p>Generally speaking, information retrievable from connected devices is nearly infinite. Cheap cost of data acquisition, and computational power to manage big data, made Machine Learning accessible to many use-cases. One of the most interesting is ingestion and real-time analysis of IoT connected devices.<\/p>\n\n\n\n<p>In this article, we would like to share a solution that takes advantage of AWS Managed Services to handle high volumes of data in real-time coming from one or more IoT connected devices. We\u2019ll show in detail how to set up a pipeline to give access to potential users to near real-time forecasting results based on the received IoT data.&nbsp;<\/p>\n\n\n\n<p>The solution will also explore some key concepts related to Machine Learning, ETL jobs, Data Cleaning, and data lake preparation.<\/p>\n\n\n\n<p>But before jumping into code and infrastructure design, a little&nbsp; recap on ML, IoT, and ETL is needed. Let\u2019s dive together into it!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">IoT, Machine Learning and Data Transformation: key concepts<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">IoT<\/h3>\n\n\n\n<p>The Internet of things (IoT) is a common way to describe a set of interconnected physical devices \u2014 \u201cthings\u201d \u2014 fitted with sensors, that to&nbsp; exchange data to each other and over the Internet.<\/p>\n\n\n\n<p>IoT has evolved rapidly due to the decreasing cost of smart sensors, and to the convergence of multiple technologies like real-time analytics, machine learning, and embedded systems.<\/p>\n\n\n\n<p>Of course, traditional fields of embedded systems, wireless sensor networks, control systems, and automation, also&nbsp; contribute to the IoT world.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Machine Learning<\/h3>\n\n\n\n<p>ML was born as an <strong>evolution of Artificial Intelligence<\/strong> . Traditional ML required the programmers to write complex and difficult to maintain heuristics in order to carry out a traditionally human task (e.g. text recognition in images) using a computer.<\/p>\n\n\n\n<p>With Machine Learning it is the system itself that learns relationships between data.<\/p>\n\n\n\n<p>For example, in a chess game, there is no longer an algorithm that makes chess play, but by providing a dataset of features concerning chess games, the model learns to play by itself.&nbsp;<\/p>\n\n\n\n<p>Machine Learning also makes sense in a <strong>distributed context<\/strong> where the <strong>prediction must scale<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Data Transformation<\/h3>\n\n\n\n<p>In a Machine Learning pipeline, the data must be uniform, i.e. standardized. Differences in the data can result from heterogeneous sources, such as different DB table schema, or different data ingestion workflows .&nbsp;<\/p>\n\n\n\n<p>Transformation (ETL: Extract, transform, load) of data is thus an essential step&nbsp; in all ML pipelines. Standardized data are not only essential in training the ML model but are also much easier to analyse and visualize in the preliminary <strong>data discovery<\/strong> step.<\/p>\n\n\n\n<p>In general, for data cleaning and formatting, Scipy Pandas and similar libraries are usually used.<\/p>\n\n\n\n<p>&#8211; <strong>NumPy<\/strong>:<em> <\/em>&nbsp;&#8211; library for the management of multidimensional arrays, it is mainly used in the importing and reading phase of a dataset.<\/p>\n\n\n\n<p>&#8211; <strong>Pandas<\/strong> <strong>Dataframe<\/strong>: &#8211; library for managing data in table format. It takes data points from <strong>CSV<\/strong>, <strong>JSON<\/strong>, <strong>Excel<\/strong>, and <strong>pickle<\/strong> files and transforms them into tables.&nbsp;<\/p>\n\n\n\n<p>&#8211; <strong>SciKit-Learn<\/strong>: &#8211; library for final data manipulation and training.<br>Cleaning and formatting the data is essential to ensure the best chance for the model to <strong>converge well <\/strong>to the desired solution.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">The Pipeline<\/h2>\n\n\n\n<p>To achieve our result, we will make extensive use of what AWS gives us in terms of managed services. Here is a simple sketch, showing the main actors involved in our Machine Learning pipeline.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"887\" height=\"147\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image18.png\" alt=\"The Pipeline\" class=\"wp-image-2555\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image18.png 887w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image18-400x66.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image18-768x127.png 768w\" sizes=\"auto, (max-width: 887px) 100vw, 887px\" \/><\/figure>\n\n\n\n<p>Let\u2019s see take a look at the purpose of each component before going into the details of each one of them.<\/p>\n\n\n\n<p>The pipeline is organized into 5 main phases: <strong>ingestion<\/strong>, <strong>datalake preparation<\/strong>, <strong>transformation<\/strong>, <strong>training<\/strong>, <strong>inference<\/strong>.<\/p>\n\n\n\n<p>The <strong>ingestion phase <\/strong>will receive data from our connected devices using <strong>AWS IoT Core<\/strong> to allow connecting them with AWS services without <a href=\"https:\/\/aws.amazon.com\/it\/iot-core\/\" target=\"_blank\" rel=\"noreferrer noopener\">managing servers and communication complexities<\/a>. Data\u00a0 from the devices will be sent\u00a0 using the <a href=\"https:\/\/mqtt.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">MQTT protocol<\/a> to minimize code footprint and network bandwidth. Should you need it AWS IoT Core can also manage device <strong>authentication<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"317\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image2-1024x317.png\" alt=\"AWS IoT Core - Courtesy of AWS \" class=\"wp-image-2523\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image2-1024x317.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image2-400x124.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image2-768x238.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image2.png 1166w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>AWS IoT Core &#8211; Courtesy of AWS&nbsp;<\/em><\/figcaption><\/figure>\n\n\n\n<p>To send information to our Amazon S3 data lake we will use <a href=\"https:\/\/docs.aws.amazon.com\/iot\/latest\/developerguide\/kinesis-firehose-rule-action.html\" target=\"_blank\" rel=\"noreferrer noopener\">Amazon Kinesis Data Firehose<\/a> which comes with a built-in action for reading AWS IoT Core messages.<br>To transform data and make it available for Amazon SageMaker we will use <a href=\"https:\/\/aws.amazon.com\/glue\" target=\"_blank\" rel=\"noreferrer noopener\">AWS Glue<\/a>: a serverless data integration service that makes it easy to find, prepare and combine data for analytics, machine learning, and application development. AWS Glue provides all the capabilities needed for data integration, to start analyzing and using data in minutes rather than months.<\/p>\n\n\n\n<p>Finally, to train and then deploy our model for online inference we will show how to leverage built-in algorithms from Amazon SageMaker, in particular, <strong>DeepAR<\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Ingestion: AWS IoT Core to Amazon Kinesis Data Firehose<\/h2>\n\n\n\n<p>To connect our test device to AWS we used AWS IoT Core capabilities. In the following, we assume that the reader already has an AWS account ready.&nbsp;<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">AWS IoT Core<\/h3>\n\n\n\n<p>Go to your account and then search for \u201cIoT Core\u201d then in the service page, in the sidebar menu, choose \u201cGet started\u201d and then select \u201cOnboard a device\u201d.&nbsp;<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"845\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image4-1024x845.png\" alt=\"Onboarding\" class=\"wp-image-2527\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image4-1024x845.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image4-364x300.png 364w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image4-768x634.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image4.png 1280w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Connettere un nuovo dispositivo<\/figcaption><\/figure><\/div>\n\n\n<p>Follow the wizard to connect a device as we did. The purpose is to:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Create an <strong>AWS IoT Thing<\/strong><\/li>\n\n\n\n<li>Download the requested code directly to your device to allow connection to AWS.<\/li>\n<\/ol>\n\n\n\n<p>This is important because we also connect Amazon Kinesis Data Firehose to read the messages sent from AWS IoT Core. As a side note, remember that you need access to the device and that device must have a TCP connection to the public internet on port 8883.<\/p>\n\n\n\n<p>Following the wizard, select Linux as the OS and an SDK (in our case Node.js):<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"392\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image1-1024x392.png\" alt=\"Platform selection\" class=\"wp-image-2521\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image1-1024x392.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image1-400x153.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image1-768x294.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image1-1536x588.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image1.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure><\/div>\n\n\n<p>After that, we gave a name to the new \u201cthing\u201d and got the connection kit which contains:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The SDK selected<\/li>\n\n\n\n<li>An example program<\/li>\n\n\n\n<li>The certificates necessary to connect the device<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"619\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image9-1024x619.png\" alt=\"Thing properties\" class=\"wp-image-2537\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image9-1024x619.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image9-400x242.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image9-768x464.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image9.png 1532w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Review the \u201cThings\u201d properties<\/em><\/figcaption><\/figure><\/div>\n\n\n<p>Once downloaded, initialize a new Node.js project and<strong> install AWS-IoT-device-SDK<\/strong>. This will install the required node modules; after that, it is possible to run the included <strong>start.sh<\/strong> script, by including all the certificates downloaded with the kit in the same project directory.<br>We developed our example using <strong>device-example.js<\/strong> as a simple base to understand how to connect a device to AWS IoT. <\/p>\n\n\n\n<pre>const deviceModule = require('aws-iot-device-sdk').device;\nconst cmdLineProcess = require('aws-iot-device-sdk\/examples\/lib\/cmdline');\nprocessPollutionData = (args) => {\n   \/\/ Device properties which are needed\n   const device = deviceModule({\n       keyPath: args.privateKey,\n       certPath: args.clientCert,\n       caPath: args.caCert,\n       clientId: args.clientId,\n       region: args.region,\n       baseReconnectTimeMs: args.baseReconnectTimeMs,\n       keepalive: args.keepAlive,\n       protocol: args.Protocol,\n       port: args.Port,\n       host: args.Host,\n       debug: args.Debug\n   });\n   const minimumDelay = 250; \/\/ ms\n   const interval = Math.max(args.delay, minimumDelay);\n   \/\/ Send device information\n   setInterval(function() {\n       \/\/ Prepare Data to be sent by the device\n       const payload = {\n           ozone: Math.round(Math.random() * 100),\n           particullate_matter: Math.round(Math.random() * 100),\n           carbon_monoxide: Math.round(Math.random() * 100),\n           sulfure_dioxide: Math.round(Math.random() * 100),\n           nitrogen_dioxide: Math.round(Math.random() * 100),\n           longitude: 10.250786139881143,\n           latitude: 56.20251117218925,\n           timestamp: new Date()\n       };\n       device.publish('<YOUR_TOPIC>', JSON.stringify(payload));\n   }, interval);\n   \/\/ Device callbacks, for the purpose of this example we have put\n   \/\/ some simple console logs\n   device.on('connect', () => { console.log('connect'); });\n   device.on('close', () => { console.log('close'); });\n   device.on('reconnect', () => { console.log('reconnect'); });\n   device.on('offline', () => { console.log('offline'); });\n   device.on('error', (error) => { console.log('error', error); });\n   device.on('message', (topic, payload) => { \nconsole.log('message', topic, payload.toString()); \n   });\n}\n\/\/ this is a precooked module from aws to launch\n\/\/ the script with arguments\nmodule.exports = cmdLineProcess;\n\/\/ Start App\nif (require.main === module) {\n   cmdLineProcess('connect to the AWS IoT service using MQTT',\n       process.argv.slice(2), processPollutionData);\n}\n<\/pre>\n\n\n\n<p>We required the Node.js modules necessary to connect the device to AWS and to publish to a relevant topic. You can read data from your sensor in any way you want, for example, if the device can write sensor data in a specific location, just read and stringify that data using <strong>device.publish(&#8216;&lt;YOUR_TOPIC&gt;&#8217;, JSON.stringify(payload))<\/strong>.<\/p>\n\n\n\n<p>The last part of the code just calls the main function to start sending information to the console.<\/p>\n\n\n\n<p>To run the script use the start.sh script included in the development kit, <strong>be sure to point to your code and not the example one from AWS<\/strong>. Leave the certificates and client ID as they are because they were generated from your previous setup.<\/p>\n\n\n\n<p><em>Note: for the sake of this article the device code is oversimplified, don\u2019t use it like this in production environments.<\/em><\/p>\n\n\n\n<p>To test that everything is working as intended, access the AWS IoT console, go to the <strong>Test<\/strong> section in the left sidebar and when asked, type the name of your topic and click \u201cSubscribe to topic\u201d. If everything is set up correctly you should see something like the screenshot below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"417\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image13-1024x417.png\" alt=\"Topic\" class=\"wp-image-2545\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image13-1024x417.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image13-400x163.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image13-768x313.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image13.png 1214w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><br>Now we need to connect Amazon Kinesis Data Firehose to start sending data to Amazon S3.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Amazon Kinesis Data Firehose<\/h3>\n\n\n\n<p>Keeping the data lake up to date populate the data lake with the data sent by the device, is extremely important to avoid a problem called Concept Drift which happens when there is a <strong>gradual misalignment of the deployed model with respect to the world of real data<\/strong>; this happens because the historical data can no longer represent the problem that has evolved.\u00a0<\/p>\n\n\n\n<p>To overcome this problem we must ensure efficient logging and the means to understand when to intervene on the model e.g. by retraining or upgrading the version and redeploy the updated version. To help with this kind of \u201cproblem\u201d we define Amazon Kinesis Data Firehose action, specifically to automatically register and transport every MQTT message delivered from the device, directly to Amazon S3 to provide our data lake always with fresh data.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Create the Firehose stream<\/h3>\n\n\n\n<p>To create a Firehose stream search for \u201cKinesis firehose\u201d in the service search bar, select it, then \u201cCreate delivery stream\u201d like in the figure:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"994\" height=\"82\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image8.png\" alt=\"Firehose stream\" class=\"wp-image-2535\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image8.png 994w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image8-400x33.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image8-768x63.png 768w\" sizes=\"auto, (max-width: 994px) 100vw, 994px\" \/><\/figure>\n\n\n\n<p><br>Select a valid name, under \u201cDelivery stream name\u201d, \u201cDirect PUT or other sources\u201d under \u201cSources\u201d, and then, on the next page, leave everything as it is (we will convert data in Amazon S3 later), finally in the last page select <strong>Amazon S3<\/strong> as a destination and eventually add a prefix to the data inserted in the bucket. Click \u201cNext\u201d and create the stream.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Create the IoT Rule<\/h3>\n\n\n\n<p>To use the stream we must connect it with AWS IoT by the means of an <strong>IoT Rule<\/strong>; this rule will allow Amazon Kinesis Data Firehose to receive messages and write them to an Amazon S3 bucket. To configure AWS IoT to send to Firehose we followed these steps:.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>When creating a rule in the AWS IoT console, on the Create a rule page, under Set one or more actions, choose \u201cAdd action\u201d.<\/li>\n\n\n\n<li>Choose \u201cSend a message to an Amazon Kinesis Firehose stream\u201d.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"271\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image5-1024x271.png\" alt=\"IoT Rule\" class=\"wp-image-2529\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image5-1024x271.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image5-400x106.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image5-768x203.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image5-1536x406.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image5.png 1870w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\">\n<li>Choose \u201cConfigure action\u201d.<\/li>\n\n\n\n<li>For the Stream name, choose the Amazon Kinesis Data Firehose delivery stream we\u2019ve just created.<\/li>\n\n\n\n<li>For Separator, choose a separator character to be inserted between records for example comma (,).<\/li>\n\n\n\n<li>For the IAM role name, choose \u201cCreate a new role\u201d.<\/li>\n\n\n\n<li>Choose \u201cAdd action\u201d.<\/li>\n<\/ol>\n\n\n\n<p>This is an example of how such a rule will then be created:<\/p>\n\n\n\n<pre>{\n    \"topicRulePayload\": {\n        \"sql\": \"SELECT * FROM '<your_topic_name>'\", \n        \"ruleDisabled\": false, \n        \"awsIotSqlVersion\": \"2016-03-23\",\n        \"actions\": [\n            {\n                \"firehose\": {\n                    \"deliveryStreamName\": \"<your_firehose_stream>\",\n                    \"roleArn\": \"arn:aws:iam::<account_number>:role\/<role_name>\"\n                }\n            }\n        ] \n    }\n}\n<\/pre>\n\n\n\n<p>If everything is ok, you\u2019ll start seeing data showing in your bucket soon, like in the example below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"79\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image17-1024x79.png\" alt=\"Data visualization\" class=\"wp-image-2553\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image17-1024x79.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image17-400x31.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image17-768x60.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image17-1536x119.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image17.png 1999w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><br>And opening one of these will show the data generated from our device!<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"101\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image19-1024x101.png\" alt=\"Generated data\" class=\"wp-image-2557\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image19-1024x101.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image19-400x39.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image19-768x76.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image19-1536x151.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image19.png 1999w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><br>Datalake: Amazon S3<\/h2>\n\n\n\n<p>Amazon Simple Storage Service is an object storage service ideal for building a <a href=\"https:\/\/aws.amazon.com\/big-data\/datalakes-and-analytics\/what-is-a-data-lake\/\" target=\"_blank\" rel=\"noreferrer noopener\">datalake<\/a>. With almost unlimited scalability, an Amazon S3 datalake provides many benefits when developing analytics for Big Data.&nbsp;<\/p>\n\n\n\n<p>The centralized data architecture of Amazon S3 makes it simple to build a multi-tenant environment where multiple users can bring their own Big Data analytics tool to analyze a common set of data.\u00a0<\/p>\n\n\n\n<p>Moreover, Amazon S3 integrates seamlessly with other Amazon Web Services such as Amazon Athena, Amazon Redshift, and like in the case presented, AWS Glue.\u00a0<\/p>\n\n\n\n<p>Amazon S3 also enables storage to be decoupled from compute and data processing to optimize costs and data processing workflows, as well as to keep the solution dry, scalable, and maintainable.<\/p>\n\n\n\n<p>Additionally, Amazon S3 allows you to store any type of structured, semi-structured, or even unstructured data in its native format. In our case, we are simply interested in saving mocked data from a test device to make some simple forecasting predictions.\u00a0<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">ETL process: AWS Glue<\/h2>\n\n\n\n<p>Even if the data is saved on Amazon S3 on a near-real time basis, it is still not enough to allow us\u00a0 to train an Amazon SageMaker model. As we explained in the introduction in fact the data must be prepared and when dealing with <strong>predefined Amazon SageMaker algorithms<\/strong> some defaults must be kept in mind.<\/p>\n\n\n\n<p>For example, Amazon SageMaker doesn\u2019t accept headers, and in case we want to define a <strong>supervised training<\/strong>, we also need to put the ground truth as the first column of the dataset.<\/p>\n\n\n\n<p>In this simple example we used AWS Glue studio to transform the raw data in the input Amazon S3 bucket to structured parquet files to be saved in a dedicated output Bucket. The output bucket will be used by Amazon Sagemaker as the data source.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">AWS Glue Crawler<\/h3>\n\n\n\n<p>At first, we need a Crawler to read from the source bucket\u00a0to generate an AWS Glue Schema. To create it go to the AWS Glue page, select Crawlers under \u201cGlue console\u201d, add a new crawler, by simply giving a name, selecting the source Amazon S3 bucket and the root folder created by Amazon Kinesis Data Firehose. A new schema will be created from this information. Leave all other options as default.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"636\" height=\"272\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image12.png\" alt=\"S3 Path selection\" class=\"wp-image-2543\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image12.png 636w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image12-400x171.png 400w\" sizes=\"auto, (max-width: 636px) 100vw, 636px\" \/><figcaption class=\"wp-element-caption\"><em>The source bucket<\/em><\/figcaption><\/figure><\/div>\n\n\n<p>Activate the Crawler once created, by clicking on \u201cRun crawler\u201d.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"452\" height=\"338\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image20.png\" alt=\"Crawler selection\" class=\"wp-image-2559\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image20.png 452w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image20-400x300.png 400w\" sizes=\"auto, (max-width: 452px) 100vw, 452px\" \/><figcaption class=\"wp-element-caption\"><em>Run the crawler<\/em><\/figcaption><\/figure><\/div>\n\n\n<p>The next step is to set up an AWS Glue Studio job using the Catalog as the data source..<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">ETL job<\/h3>\n\n\n\n<p>An AWS Glue Studio job consists of at least 3 main nodes, which are <strong>source<\/strong>, <strong>transform<\/strong>, and <strong>target<\/strong>. We need to configure all three nodes to define a <strong>crawler<\/strong> capable of reading and transforming data on the fly.<\/p>\n\n\n\n<p>To do so, here are the step we followed:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Choose Create and manage jobs from AWS Glue Studio dashboard.<\/li>\n\n\n\n<li>On the Manage Jobs page, choose the source and target added to the graph option. Then, choose Amazon S3 for the Source and Amazon S3 for the Target.<\/li>\n<\/ol>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"85\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image14-1024x85.png\" alt=\"Target selection\" class=\"wp-image-2547\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image14-1024x85.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image14-400x33.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image14-768x64.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image14-1536x128.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image14.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\">\n<li>Click the Create button to start the job creation process.<\/li>\n<\/ol>\n\n\n\n<p>Now you\u2019ll see a\u00a0 three-node graph displayed that represents the steps involved in the ETL process. When AWS Glue is instructed to read from an Amazon S3 data source, it will also create an internal schema, called <strong>Glue Data Catalog<\/strong>.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"630\" height=\"1014\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image15.png\" alt=\"Glue Data Catalog\" class=\"wp-image-2549\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image15.png 630w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image15-186x300.png 186w\" sizes=\"auto, (max-width: 630px) 100vw, 630px\" \/><figcaption class=\"wp-element-caption\"><em>The ETL graph<\/em><\/figcaption><\/figure><\/div>\n\n\n<p>To configure the source node, click on it in the graph:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>On the Node Properties tab, for Name, enter a name that is unique for this job. The value you enter is used as the label for the data source node in the graph. Choose the Data source properties &#8211; Amazon S3 tab in the node details panel.<\/li>\n\n\n\n<li>Select your crawler database from the list of available databases in the AWS Glue Data Catalog.<\/li>\n\n\n\n<li>Choose the correct table from the Catalog.<\/li>\n<\/ol>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"262\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image11-1024x262.png\" alt=\"Data source properties - S3\" class=\"wp-image-2541\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image11-1024x262.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image11-400x102.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image11-768x196.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image11-1536x393.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image11.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Select the crawler database and table<\/em><\/figcaption><\/figure><\/div>\n\n\n<p>The same can be done for the transform node: by clicking on it is possible to define what kind of transformation we want to apply to input data. Here you can also verify that the JSON data is imported correctly:<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"455\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image10-1024x455.png\" alt=\"Target - Data mapping\" class=\"wp-image-2539\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image10-1024x455.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image10-400x178.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image10-768x341.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image10-1536x683.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image10.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>The auto-mapping generated by AWS Glue<\/em><\/figcaption><\/figure><\/div>\n\n\n<p>Finally, we can select the target node, specifying, again Amazon S3 as a target, and using .parquet as the output format.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"600\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image6-1024x600.png\" alt=\"Data target properties - S3\" class=\"wp-image-2531\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image6-1024x600.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image6-400x234.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image6-768x450.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image6.png 1400w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>The target node properties<\/em><\/figcaption><\/figure><\/div>\n\n\n<p>Now we need to set the ETL job parameters for the workflow just created. Go to the \u201cJob details\u201d tab on the right, give a name, and select a role capable of managing data and deploying again on Amazon S3.\u00a0<\/p>\n\n\n\n<p>Leave the rest as default.&nbsp;<\/p>\n\n\n\n<p>Note that you must have this snippet on the \u201cTrust Relationship\u201d tab of the role to let it assume be assumed by AWS Glue:<\/p>\n\n\n\n<pre>{ \n    \"Version\": \"2012-10-17\", \n    \"Statement\": [ \n       { \n          \"Effect\": \"Allow\", \n          \"Principal\": { \"Service\": \"glue.amazonaws.com\" }, \n          \"Action\": \"sts:AssumeRole\" \n       } \n    ]\n}\n<\/pre>\n\n\n\n<p>If everything is defined correctly, the job will start and begin converting your data in parquet format. The files will be put in your out directory in the bucket of your choice.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"205\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image7-1024x205.png\" alt=\"Parquet format\" class=\"wp-image-2533\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image7-1024x205.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image7-400x80.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image7-768x154.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image7-1536x307.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image7.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>File converted in parquet<\/em><\/figcaption><\/figure><\/div>\n\n\n<p><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Dataset optimization: why parquet over CSV<\/h2>\n\n\n\n<p>We chose to use parquet instead of the CSV data format for the target dataset. Parquet is a highly compressed columnar format, which uses the record shredding and assembly algorithm, vastly superior to the simple flattening of nested namespaces. It has the following advantages:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>It brings efficiency compared to row-based files like CSV. When querying, columnar storage skipping non-relevant data can be done very quickly.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Aggregation queries are less time-consuming compared to row-oriented databases, minimizing latency for accessing data.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Apache Parquet can support advanced nested data structures.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Parquet is built to support flexible compression options and efficient encoding schemes.&nbsp;<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Apache Parquet works best with interactive and serverless technologies like AWS Athena, Amazon Redshift, and AWS Glue.<\/li>\n<\/ul>\n\n\n\n<p>Also compared to file stored in .csv format we have these advantages in terms of cost savings:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Amazon Athena and Redshift Spectrum will charge based on the amount of data scanned per query.<\/li>\n\n\n\n<li>Amazon charges according to the amount of data stored on Amazon S3.<br><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">The Machine Learning step: Forecasting with Amazon SageMaker<\/h2>\n\n\n\n<p>Amazon SageMaker offers 17 prebuilt algorithms out-of-the-box that cover a plethora of topics related to Machine Learning problems. In our case, we wanted to simplify the development of a forecasting model for the data retrieved from our device, so instead of showing how to bring your own algorithm, like in our previous article, this time we\u2019ll be using a pre-cooked one.<\/p>\n\n\n\n<p>As explained before, apart from cleaning data, our ETL process was done to transform the data to be compatible with ready-made Amazon SageMaker algorithms.<\/p>\n\n\n\n<p>Amazon SageMaker API and Sklearn Library offer methods to retrieve the data, call the training method, save the model, and deploy it to production for online or batch inference.<\/p>\n\n\n\n<p>Start by going to the Amazon SageMaker page and creating a new notebook instance, for this article we chose an <strong>ml.t3.medium<\/strong>. Add a name and create a <strong>new IAM role<\/strong>.<\/p>\n\n\n\n<p>Leave the rest as default and click \u201cCreate notebook\u201d.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"68\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image16-1024x68.png\" alt=\"Notebook creation\" class=\"wp-image-2551\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image16-1024x68.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image16-400x27.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image16-768x51.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image16-1536x102.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image16.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Create a new notebook instance<\/em><\/figcaption><\/figure>\n\n\n\n<p>Access it by either Jupiter or Jupiter Lab, we chose the second. We managed to put up a simple notebook illustrating all the steps involved in using the pre-backed DeepAR algorithm by Amazon  Sagemaker.\u00a0<\/p>\n\n\n\n<p><em>Note: the code is made solely for this article and is not meant for production as there is no preliminary investigation on data and no validation of the results. Still, all the code presented is tested and usable for use cases similar to the one presented.<\/em><\/p>\n\n\n\n<p>We start by importing all the necessary libraries:<\/p>\n\n\n\n<pre>import time\nimport io\nimport math\nimport random\nimport numpy as np\nimport pandas as pd\nimport JSON\nimport matplotlib.pyplot as plt\nimport boto3\nimport sagemaker\nfrom sagemaker import get_execution_role\n\n# set random seeds for reproducibility\nnp.random.seed(42)\nrandom.seed(42)\n<\/pre>\n\n\n\n<p>We also set the seed for our random methods to ensure reproducibility. After that, we need to recover our <strong>parquet<\/strong> files from Amazon <strong>S3<\/strong> and obtain a Pandas Dataframe from them.<\/p>\n\n\n\n<pre>bucket = \"<your_bucket_name>\"\ndata = \"output\"\nmodel = \"model\"\nsagemaker_session = sagemaker.Session()\nrole = get_execution_role()\ns3_data_path = f\"{bucket}\/{data}\"\ns3_output_path = f\"{bucket}\/{model}\/\"\n<\/pre>\n\n\n\n<p>At first, we prepare all the Amazon S3 \u201cpaths\u201d that will be used in the notebook, and we generate an Amazon <strong>SageMaker Session<\/strong> and a valid <strong>IAM<\/strong> <strong>Role<\/strong> with <strong>get_execution_role()<\/strong>. As you can see Amazon SageMaker takes care of all these aspects for us.<\/p>\n\n\n\n<pre>from sagemaker.amazon.amazon_estimator import get_image_uri\nimage_uri = get_image_uri(boto3.Session().region_name, \"forecasting-deepar\")\n<\/pre>\n\n\n\n<p>In the step above we recover our <strong>forecasting Estimator, DeepAR. <\/strong>An estimator is a class in Amazon SageMaker capable of generating, and testing a model which will then be saved on Amazon S3.<\/p>\n\n\n\n<p>Before starting to read the parquet files we also add a couple of constants to our experiment:<\/p>\n\n\n\n<pre>freq = \"H\"\nprediction_length = 24\ncontext_length = 24 # usually prediction and context are set equal or similar\n<\/pre>\n\n\n\n<p>With <strong>freq <\/strong>(frequency) we say that we want to analyze the TimeSeries by hourly metrics. Prediction and Context length are set to 1 day and they are respectively how many hours we want to predict in the future and how many in the past we\u2019ll use for the prediction. Usually, these values are defined in terms of days as the dataset is much bigger.<\/p>\n\n\n\n<p>We created two helper methods to read from parquet files:<\/p>\n\n\n\n<pre># Read single parquet file from S3\ndef pd_read_s3_parquet(key, bucket, s3_client=None, **args):\n    if not s3_client:\n        s3_client = boto3.client('s3')\n    obj = s3_client.get_object(Bucket=bucket, Key=key)\n    return pd.read_parquet(io.BytesIO(obj['Body'].read()), **args)\n\n# Read multiple parquets from a folder on S3 generated by spark\ndef pd_read_s3_multiple_parquets(filepath, bucket, **args):\n    if not filepath.endswith('\/'):\n        filepath = filepath + '\/'  # Add '\/' to the end\n    \n    s3_client = boto3.client('s3')   \n    s3 = boto3.resource('s3')\n    s3_keys = [item.key for item in s3.Bucket(bucket).objects.filter(Prefix=filepath)\n               if item.key.endswith('.parquet')]\n    if not s3_keys:\n        print('No parquet found in', bucket, filepath)\n    \n    dfs = [pd_read_s3_parquet(key, bucket=bucket, s3_client=s3_client, **args) \n           for key in s3_keys]\n    return pd.concat(dfs, ignore_index=True)\n<\/pre>\n\n\n\n<p>Then we actually read the datasets:<\/p>\n\n\n\n<pre># get all retrieved parquet in a single dataframe with helpers functions\ndf = pd_read_s3_multiple_parquets(data, bucket)\ndf = df.iloc[:, :8] # get only relevant columns\ndf['hour'] = pd.to_datetime(df['timestamp']).dt.hour #add hour column for the timeseries format\n# split in test and training\nmsk = np.random.rand(len(df)) < 0.8 # 80% mask\n# Dividing in test and training\ntraining_df = df[msk]\ntest_df = df[~msk]\n<\/pre>\n\n\n\n<p>Here we manipulate the dataset to make it usable with DeepAR which has its proprietary input format. We use <mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-vivid-cyan-blue-color\">df.iloc[:, :8]<\/mark> to keep only the original columns without the ones produced by AWS Glue Schema. We generate a new <strong>hour<\/strong> column to speed things up, finally, we split the dataset in 80\/20 proportion for training and testing.<\/p>\n\n\n\n<p>We then write back data to Amazon S3 temporarily as required by DeepAR, by building JSON files with series in them.<\/p>\n\n\n\n<pre># We need to resave our data in JSON because this is how DeepAR works\n# Note: we know this is redundant but is for the article to show how many ways \n# there are to transform dataset back and forth from when data is acquired\n\ntrain_key = 'deepar_training.json'\ntest_key  = 'deepar_test.json'\n\n# Write data in DeepAR format\ndef writeDataset(filename, data): \n    file=open(filename,'w')\n    previous_hour = -1\n    for hour in data['hour']:\n        if not math.isnan(hour):\n            if hour != previous_hour:\n                previous_hour = hour\n                # One JSON sample per line\n                line = f\"\\\"start\\\":\\\"2021-02-05 {int(hour)}:00:00\\\",\\\"target\\\":{data[data['hour'] == hour]['ozone'].values.tolist()}\"\n                file.write('{'+line+'}\\n')\n<\/pre>\n\n\n\n<p>The generated JSON documents&nbsp; are structured in a format like this:<\/p>\n\n\n\n<pre>{\"start\":\"2021-02-05 13:00:00\",\"target\":[69.0, 56.0, 2.0, \u2026]}\n<\/pre>\n\n\n\n<p>After that, we can write our JSON files to Amazon S3.<\/p>\n\n\n\n<pre>writeDataset(train_key, training_df)        \nwriteDataset(test_key, test_df)\n\ntrain_prefix   = 'model\/train'\ntest_prefix    = 'model\/test'\n\ntrain_path = sagemaker_session.upload_data(train_key, bucket=bucket, key_prefix=train_prefix)\ntest_path  = sagemaker_session.upload_data(test_key,  bucket=bucket, key_prefix=test_prefix)\n<\/pre>\n\n\n\n<p>We use <strong>sagemaker_session.upload_data()<\/strong> for that, passing the output location. Now we can finally define the estimator:<\/p>\n\n\n\n<pre>estimator = sagemaker.estimator.Estimator(\n    sagemaker_session=sagemaker_session,\n    image_uri=image_uri,\n    role=role,\n    instance_count=1,\n    instance_type=\"ml.c4.xlarge\",\n    base_job_name=\"pollution-deepar\",\n    output_path=f\"s3:\/\/{s3_output_path}\",\n)\n<\/pre>\n\n\n\n<p>We\u2019ll pass the Amazon SageMaker session, the algorithm image, the instance type, and the model output path to the estimator. We also need to configure some Hyperparameters:<\/p>\n\n\n\n<pre>hyperparameters = {\n    \"time_freq\": freq,\n    \"context_length\": str(context_length),\n    \"prediction_length\": str(prediction_length),\n    \"num_cells\": \"40\",\n    \"num_layers\": \"3\",\n    \"likelihood\": \"gaussian\",\n    \"epochs\": \"20\",\n    \"mini_batch_size\": \"32\",\n    \"learning_rate\": \"0.001\",\n    \"dropout_rate\": \"0.05\",\n    \"early_stopping_patience\": \"10\",\n}\n\nestimator.set_hyperparameters(**hyperparameters)\n<\/pre>\n\n\n\n<p>These values are taken directly from the official AWS examples on DeepAR. We also need to pass the two channels, training, and test, to the estimator to start the <strong>fitting process<\/strong>.<\/p>\n\n\n\n<pre>data_channels = {\"train\": train_path, \"test\": test_path}\nestimator.fit(inputs=data_channels)\n<\/pre>\n\n\n\n<p>After training and testing a model, we can deploy it using a <strong>Real-time Predictor<\/strong>.<\/p>\n\n\n\n<pre># Deploy for real time prediction\njob_name = estimator.latest_training_job.name\n\nendpoint_name = sagemaker_session.endpoint_from_job(\n    job_name=job_name,\n    initial_instance_count=1,\n    instance_type='ml.m4.xlarge',\n    role=role\n)\n\npredictor = sagemaker.predictor.RealTimePredictor(\n    endpoint_name, \n    sagemaker_session=sagemaker_session, \n    content_type=\"application\/json\")\n<\/pre>\n\n\n\n<p>The predictor generates an endpoint that is visible from the AWS console.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"70\" src=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image3-1024x70.png\" alt=\"Generated endpoint\" class=\"wp-image-2525\" srcset=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image3-1024x70.png 1024w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image3-400x27.png 400w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image3-768x52.png 768w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image3-1536x105.png 1536w, https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/image3.png 1600w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The endpoint can be called by any REST enabled application passing a request with a format like the one below:<\/p>\n\n\n\n<pre>{\n  \"instances\": [ \n    {\n       \"start\": \"2021-02-05 00:00:00\",\n       \"target\": [88.3, 85.4, ...]\n    }\n  ],\n  \"configuration\": {\n    \"output_types\": [\"mean\", \"quantiles\", \"samples\"],\n    \"quantiles\": [\"0.1\", \"0.9\"], \n    \"num_samples\": 100\n  }\n}\n<\/pre>\n\n\n\n<p>The \u201ctargets\u201d are some sample values starting from the period set in \u201cstart\u201d by which we want to generate the prediction.<\/p>\n\n\n\n<p>Finally, if we don\u2019t need the endpoint anymore, we can delete it with:<\/p>\n\n\n\n<pre>sagemaker_session.delete_endpoint(endpoint_name)<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Real-Time Inference: from concept to production<\/h2>\n\n\n\n<p>Real-time inference refers to the prediction given in real-time by some models. This is the typical use case of many recommendation systems or generally when the prediction concerns a single-use. It is used when:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>We are dealing with <strong>dynamic<\/strong> data.<\/li>\n\n\n\n<li>We have <strong>low-latency<\/strong> required.<\/li>\n\n\n\n<li>We want <strong>real-time<\/strong> predictions.<\/li>\n\n\n\n<li>It is characterized by a <strong>single prediction<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p>&nbsp;It's typically a bit more complex to manage compared to what we have done in the notebook and is typically defined in a separate pipeline, due to its nature of high availability and fast response time.<\/p>\n\n\n\n<p>When deploying using Amazon SageMaker API it is possible to create a deploy process that is very similar to how a web application is deployed or upgraded, taking into account things like traffic rerouting and deploy techniques like Blue\/Green or Canary. We want to share with you a summary guide for both methods so that you can try them by yourself!<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to deploy<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Create a model using <strong>CreateModelApi.<\/strong><\/li>\n\n\n\n<li>Create an HTTPS endpoint using <strong>CreateEndpointConfigApi<\/strong> entering as properties:<br>\n<ul class=\"wp-block-list\">\n<li>The model<\/li>\n\n\n\n<li>The production variants<\/li>\n\n\n\n<li>Instance type<\/li>\n\n\n\n<li>Instance count<\/li>\n\n\n\n<li>Weight<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\">\n<li>Finalize the creation of the endpoint using <strong>CreateEndpointApi<\/strong>.Pass the data of the two previous configurations and any <strong>tags<\/strong> to this last command.<\/li>\n<\/ol>\n\n\n\n<p><em>Note: through the production variants we can implement different Deploy strategies such as A\/B and BLUE\/GREEN.<\/em><\/p>\n\n\n\n<p><strong>Deploy Blue \/ Green<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Create a new version of the model.<\/li>\n\n\n\n<li>Create an endpoint configuration by copying the data from the old one.<\/li>\n\n\n\n<li>Update the production variants by adding the new configuration.<\/li>\n\n\n\n<li>Call <strong>UpdateEndpointApi<\/strong> with the new configuration. The <strong>Green<\/strong> infrastructure is added, here\u2019s where we can do <a href=\"https:\/\/docs.aws.amazon.com\/AmazonCloudWatch\/latest\/monitoring\/CloudWatch_Synthetics_Canaries.html\" target=\"_blank\" rel=\"noreferrer noopener\">synthetic testing<\/a>.<\/li>\n\n\n\n<li>Redirect traffic to Green. If Green performs well, with another <strong>UpdateEndpointApi<\/strong> delete the old model.<\/li>\n<\/ol>\n\n\n\n<p><strong>Deploy A \/ B<\/strong><\/p>\n\n\n\n<p>To be used if we want to measure the performance of different models with respect to a high-level metric.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Create multiple models using the same configuration.<\/li>\n\n\n\n<li>Update or create a Configuration by modifying or creating the production variants.<\/li>\n\n\n\n<li>Set the balancing weights to 50\/50.<\/li>\n\n\n\n<li>Check functionality and performance.<\/li>\n\n\n\n<li>Gradually change the % of traffic.<\/li>\n<\/ol>\n\n\n\n<p>In the end, exclude one or more models.<\/p>\n\n\n\n<p><em>Note: the multi-model for endpoint property allows for managing multiple models at the same time, the machine memory is managed automatically based on the traffic. This approach can save money through the optimized use of resources.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<ul>\n<li><a target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/iot\/latest\/developerguide\/iot-quick-start.html\" rel=\"noopener\">https:\/\/docs.aws.amazon.com\/iot\/latest\/developerguide\/iot-quick-start.html<\/a><\/li>\n<li><a target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/iot\/latest\/developerguide\/kinesis-firehose-rule-action.html\" rel=\"noopener\">https:\/\/docs.aws.amazon.com\/iot\/latest\/developerguide\/kinesis-firehose-rule-action.html<\/a><\/li>\n<li><a target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/glue\/latest\/ug\/tutorial-create-job.html\" rel=\"noopener\">https:\/\/docs.aws.amazon.com\/glue\/latest\/ug\/tutorial-create-job.html<\/a><\/li>\n<li><a target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/iot\/latest\/developerguide\/topics.html\" rel=\"noopener\">https:\/\/docs.aws.amazon.com\/iot\/latest\/developerguide\/topics.html<\/a><\/li>\n<li><a target=\"_blank\" href=\"https:\/\/docs.aws.amazon.com\/AmazonCloudWatch\/latest\/monitoring\/CloudWatch_Synthetics_Canaries.html\" rel=\"noopener\">https:\/\/docs.aws.amazon.com\/AmazonCloudWatch\/latest\/monitoring\/CloudWatch_Synthetics_Canaries.html<\/a><\/li>\n<li><a target=\"_blank\" href=\"https:\/\/mqtt.org\/\" rel=\"noopener\">https:\/\/mqtt.org\/<\/a><\/li>\n<li><a target=\"_blank\" href=\"https:\/\/machinelearningmastery.com\/gentle-introduction-concept-drift-machine-learning\/\" rel=\"noopener\">https:\/\/machinelearningmastery.com\/gentle-introduction-concept-drift-machine-learning\/<\/a><\/li>\n<li><a target=\"_blank\" href=\"https:\/\/parquet.apache.org\/\" rel=\"noopener\">https:\/\/parquet.apache.org\/<\/a><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Takeaways&nbsp;<\/h2>\n\n\n\n<p>In this article, we have seen how to develop a pipeline using AWS resources to ingest data from a device connected to the AWS ecosystem through AWS IoT Core.\u00a0<\/p>\n\n\n\n<p>We have also seen how to efficiently read and store data as it is sent by the devices using Amazon Kinesis Data Firehose, acting as a near real-time stream, to populate and update our data lake on Amazon S3.<\/p>\n\n\n\n<p>To perform the needed ETL data transformation we chose AWS Glue Studio, showing how easily it can be configured to create a crawler to read, transform, and put data back to Amazon S3, ready to be used for model training.<\/p>\n\n\n\n<p>We then explained why using the parquet format is better than a simple CSV format. In particular, we focused on the performance improvement with respect to CSV for import\/export, Athena query operations and how much more convenient is the pricing on Amazon S3 due to a significantly smaller file size.<\/p>\n\n\n\n<p>Amazon SageMaker can be used out-of-the-box with its set of prebuilt algorithms and in particular, we\u2019ve seen how to implement a forecasting model on our mocked dataset consisting of environmental and pollution data.<\/p>\n\n\n\n<p>Finally, we put the model into production, leveraging Amazon SageMaker APIs to create a deploy pipeline that takes into account the Concept Drift problem, thus allowing for frequent updates of the model based on the evolution of the dataset in time. This is especially true for time-series and forecasting models, which become better and better as the dataset grows bigger.\u00a0<\/p>\n\n\n\n<p>We are at the end of the journey for this week!. As always feel free to comment and give us your opinions and ideas. What about your use-cases? What kind of devices do you use? Reach us and talk about it!<\/p>\n\n\n\n<p>We are at the end of the journey for this week!. As always feel free to comment and to give us your opinions and ideas. What about your use-cases? What kind of devices do you use? Reach us and talk about it!<br>Stay tuned and see you in 14 days on\u00a0<strong>#Proud2BeCloud<\/strong>!<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h4 class=\"wp-block-heading\">About beSharp<\/h4>\n\n\n\n<p><strong>Proud2beCloud<\/strong>\u00a0is a blog by\u00a0<a href=\"https:\/\/www.besharp.it\/en\/\" target=\"_blank\" rel=\"noreferrer noopener\">beSharp<\/a>, an Italian APN Premier Consulting Partner expert in designing, implementing, and managing complex Cloud infrastructures and advanced services on AWS. Before being writers, we are Cloud Experts working daily with AWS services since 2007. We are hungry readers, innovative builders, and gem-seekers. On Proud2beCloud, we regularly share our best AWS pro tips, configuration insights, in-depth news, tips&amp;tricks, how-tos, and many other resources. Take part in the discussion!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Machine Learning is rapidly becoming part of our daily life: it lets software and devices manage routines without human [&hellip;]<\/p>\n","protected":false},"author":9,"featured_media":2571,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[248],"tags":[262,252,445,443,417,413],"class_list":["post-2584","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-ml-en","tag-amazon-kinesis-data-firehose-en","tag-amazon-s3-en","tag-aws-glue-en","tag-aws-iot-en","tag-etl-en","tag-internet-of-things-iot-en"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.7 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>IoT ingestion and ML analytics pipeline with AWS IoT, Kinesis and SageMaker - Proud2beCloud Blog<\/title>\n<meta name=\"description\" content=\"How to develop a pipeline using AWS resources to ingest data from a device connected to the AWS ecosystem through IoT Core.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Pipeline of real-time Data Ingestion and Analytics with AWS IoT, Kinesis and SageMaker\" \/>\n<meta property=\"og:description\" content=\"How to develop a pipeline using AWS resources to ingest data from a device connected to the AWS ecosystem through IoT Core.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/\" \/>\n<meta property=\"og:site_name\" content=\"Proud2beCloud Blog\" \/>\n<meta property=\"article:published_time\" content=\"2021-02-04T11:49:38+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-03-24T17:30:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/ingestion-di-dati-social.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"628\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Matteo Moroni\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Pipeline of real-time Data Ingestion and Analytics with AWS IoT, Kinesis and SageMaker\" \/>\n<meta name=\"twitter:description\" content=\"How to develop a pipeline using AWS resources to ingest data from a device connected to the AWS ecosystem through IoT Core.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/ingestion-di-dati-social.png\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Matteo Moroni\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"30 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/\",\"url\":\"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/\",\"name\":\"IoT ingestion and ML analytics pipeline with AWS IoT, Kinesis and SageMaker - Proud2beCloud Blog\",\"isPartOf\":{\"@id\":\"https:\/\/blog.besharp.it\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/ingestion-di-dati.png\",\"datePublished\":\"2021-02-04T11:49:38+00:00\",\"dateModified\":\"2023-03-24T17:30:16+00:00\",\"author\":{\"@id\":\"https:\/\/blog.besharp.it\/#\/schema\/person\/0b3e69eb2dcb125d58476b906ec1c7bc\"},\"description\":\"How to develop a pipeline using AWS resources to ingest data from a device connected to the AWS ecosystem through IoT Core.\",\"breadcrumb\":{\"@id\":\"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/#primaryimage\",\"url\":\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/ingestion-di-dati.png\",\"contentUrl\":\"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/ingestion-di-dati.png\",\"width\":1667,\"height\":1250,\"caption\":\"Ingestion di dati iot e pipeline di analytics ml mediante aws iot, kinesis e sagemaker\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/blog.besharp.it\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"IoT ingestion and ML analytics pipeline with AWS IoT, Kinesis and SageMaker\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/blog.besharp.it\/#website\",\"url\":\"https:\/\/blog.besharp.it\/\",\"name\":\"Proud2beCloud Blog\",\"description\":\"il blog di beSharp\",\"alternateName\":\"Proud2beCloud Blog\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/blog.besharp.it\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/blog.besharp.it\/#\/schema\/person\/0b3e69eb2dcb125d58476b906ec1c7bc\",\"name\":\"Matteo Moroni\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/blog.besharp.it\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/acad790b9bb4c6d62e076ecdc1debb35?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/acad790b9bb4c6d62e076ecdc1debb35?s=96&d=mm&r=g\",\"caption\":\"Matteo Moroni\"},\"description\":\"DevOps e Solution Architect di beSharp, mi occupo di sviluppare soluzioni Saas, Data Analysis, HPC e di progettare architetture non convenzionali a complessit\u00e0 divergente. Appassionato di informatica e fisica, da sempre lavoro nella prima e ho un PhD nella seconda. Parlare di tutto ci\u00f2 che \u00e8 tecnico e nerd mi rende felice!\",\"url\":\"https:\/\/blog.besharp.it\/author\/matteo-moroni\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"IoT ingestion and ML analytics pipeline with AWS IoT, Kinesis and SageMaker - Proud2beCloud Blog","description":"How to develop a pipeline using AWS resources to ingest data from a device connected to the AWS ecosystem through IoT Core.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/","og_locale":"en_US","og_type":"article","og_title":"Pipeline of real-time Data Ingestion and Analytics with AWS IoT, Kinesis and SageMaker","og_description":"How to develop a pipeline using AWS resources to ingest data from a device connected to the AWS ecosystem through IoT Core.","og_url":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/","og_site_name":"Proud2beCloud Blog","article_published_time":"2021-02-04T11:49:38+00:00","article_modified_time":"2023-03-24T17:30:16+00:00","og_image":[{"width":1200,"height":628,"url":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/ingestion-di-dati-social.png","type":"image\/png"}],"author":"Matteo Moroni","twitter_card":"summary_large_image","twitter_title":"Pipeline of real-time Data Ingestion and Analytics with AWS IoT, Kinesis and SageMaker","twitter_description":"How to develop a pipeline using AWS resources to ingest data from a device connected to the AWS ecosystem through IoT Core.","twitter_image":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/ingestion-di-dati-social.png","twitter_misc":{"Written by":"Matteo Moroni","Est. reading time":"30 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/","url":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/","name":"IoT ingestion and ML analytics pipeline with AWS IoT, Kinesis and SageMaker - Proud2beCloud Blog","isPartOf":{"@id":"https:\/\/blog.besharp.it\/#website"},"primaryImageOfPage":{"@id":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/#primaryimage"},"image":{"@id":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/#primaryimage"},"thumbnailUrl":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/ingestion-di-dati.png","datePublished":"2021-02-04T11:49:38+00:00","dateModified":"2023-03-24T17:30:16+00:00","author":{"@id":"https:\/\/blog.besharp.it\/#\/schema\/person\/0b3e69eb2dcb125d58476b906ec1c7bc"},"description":"How to develop a pipeline using AWS resources to ingest data from a device connected to the AWS ecosystem through IoT Core.","breadcrumb":{"@id":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/#primaryimage","url":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/ingestion-di-dati.png","contentUrl":"https:\/\/blog.besharp.it\/wp-content\/uploads\/2021\/02\/ingestion-di-dati.png","width":1667,"height":1250,"caption":"Ingestion di dati iot e pipeline di analytics ml mediante aws iot, kinesis e sagemaker"},{"@type":"BreadcrumbList","@id":"https:\/\/blog.besharp.it\/iot-ingestion-and-ml-analytics-pipeline-with-aws-iot-kinesis-and-sagemaker\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/blog.besharp.it\/"},{"@type":"ListItem","position":2,"name":"IoT ingestion and ML analytics pipeline with AWS IoT, Kinesis and SageMaker"}]},{"@type":"WebSite","@id":"https:\/\/blog.besharp.it\/#website","url":"https:\/\/blog.besharp.it\/","name":"Proud2beCloud Blog","description":"il blog di beSharp","alternateName":"Proud2beCloud Blog","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/blog.besharp.it\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/blog.besharp.it\/#\/schema\/person\/0b3e69eb2dcb125d58476b906ec1c7bc","name":"Matteo Moroni","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/blog.besharp.it\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/acad790b9bb4c6d62e076ecdc1debb35?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/acad790b9bb4c6d62e076ecdc1debb35?s=96&d=mm&r=g","caption":"Matteo Moroni"},"description":"DevOps e Solution Architect di beSharp, mi occupo di sviluppare soluzioni Saas, Data Analysis, HPC e di progettare architetture non convenzionali a complessit\u00e0 divergente. Appassionato di informatica e fisica, da sempre lavoro nella prima e ho un PhD nella seconda. Parlare di tutto ci\u00f2 che \u00e8 tecnico e nerd mi rende felice!","url":"https:\/\/blog.besharp.it\/author\/matteo-moroni\/"}]}},"_links":{"self":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/posts\/2584","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/users\/9"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/comments?post=2584"}],"version-history":[{"count":0,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/posts\/2584\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/media\/2571"}],"wp:attachment":[{"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/media?parent=2584"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/categories?post=2584"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.besharp.it\/wp-json\/wp\/v2\/tags?post=2584"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}