<\/figure><\/div>\n\n\nThe beauty of Firehose is that you can also add it as a subsequent step! In our simple coffee application, we are not using it, and images and analyses are saved directly in Amazon S3 and Amazon Aurora Serverless MySQL by Lambda Functions. Anyway, if the app grows bigger we can integrate it flawlessly!<\/p>\n\n\n\n
Analysis Step<\/h2>\n\n\n\n Once your data is in storage, it is time to analyse them. Methodologies can differ greatly and common examples range from simple queries run in relational databases to long and complex analytical jobs run in Redshift data warehouses and to near real-time processing using Amazon Kinesis connected EMR or ElasticSearch.<\/p>\n\n\n\n
In our case, we can just run simple queries using our web application backend and display the results in the browser.<\/p>\n\n\n\n
However, in the future, we may be interested in running much more advanced queries on our data and maybe doing some data quality inspection and machine learning training. So we need to have these data out of Amazon Aurora and into Amazon S3 in order to analyze them with AWS Glue jobs and Databrew and if needed to load them easily with Apache Spark either from AWS Glue or AWS EMR. To do this, we can follow several paths: for example, we could use AWS DataMigration service to move the data to Amazon S3 as Parquet files or maybe we could create a AWS Glue Job, load the data using AWS Glue Connection to RDS and Spark and then write them into Amazon S3.\u00a0<\/p>\n\n\n\n
After this would need to run an AWS Glue crawler in order to create DataCatalog that will be used by Amazon Athena and AWS Glue for queries and jobs.<\/p>\n\n\n\n
Here however we will show you a different and sometimes much more <\/strong>flexible path to export, clean, and catalog our data from a relational database: Amazon Athena custom data source<\/strong>.<\/p>\n\n\n\nBy default, Amazon Athena comes with Amazon S3 – AWS Glue data Catalog integrations but AWS recently added the possibility to add customized data sources such as JDBC connected databases, AWS CloudWatch or to query Amazon S3 but using a custom Apache Hive metastore. In our case we are interested in connecting to MySQL Amazon Aurora Serverless so we need to go to Amazon Athena Home, configure a workgroup named AmazonAthenaPreviewFunctionality and then add an Amazon S3 query output path:<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\nAfter this, we can go back to Amazon Athena home and select Connect Data Source:<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\nWe are presented with a web page where we need to select the type of data source: we go for Query a data source (beta) ad MySQL:<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\nAfter that, you are requested to enter the name and description of the new catalog and to select or create a Lambda Function to manage the connection. Choose the name you like the most and click Configure new AWS Lambda Function.<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\nYou are presented with this page where you need to enter the JDBC connection URI for Amazon Aurora and select the subnet and security group for the Lambda function that Amazon Athena will use to establish the JDBC connection. Choose them wisely<\/strong> otherwise the Lambda won\u2019t reach the Amazon Aurora instance!<\/p>\n\n\n\n
<\/figure><\/div>\n\n\nSecret Name prefix is used to store the DB creds in AWS Secret Manager. This is essential for a production environment. Leaving it blank means no integration will be created. After you select deploy and the Lambda you just created in the Amazon Athena dashboard you\u2019ll see a new catalog different from the standard AwsGlueCatalog:<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\nNote that at first, Databases and tables won\u2019t appear. Fear not: If you go to the lambda functions you\u2019ll see failures and in Amazon Cloudwatch you\u2019ll see an error like:<\/p>\n\n\n\n
Catalog is not supported in multiplexer. After registering the catalog in Athena, must set 'iotarticolo_connection_string' environment variable in Lambda. See JDBC connector README for further details.: java.lang.RuntimeException<\/pre>\n\n\n\nGo on and set the required Lambda function env variable by using the same JDBC connection string used as DefaultConnection string in the preceding step. After this, the connection will work and you\u2019ll be able to query your DB directly from Amazon Athena! Sweet!<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\nHowever, at a closer look we immediately notice that something is afoul with the data: here is a screen of what we can read directly from MySQL:<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\nAs you can see, Amazon Athena was smart enough to convert tinyint(1) data to bool but could not fetch mysql datetime columns. This is due to a very well-known problem with jdbc connector and the easier fix is to just create a new field where the datetime is a string in Java datetime format:<\/p>\n\n\n\n
UPDATE coffees SET coffees.coffee_hour_str=DATE_FORMAT(coffee_hour, '%Y-%m-%d %H:%i:%s');\nALTER TABLE coffees ADD COLUMN coffee_hour_str VARCHAR(255) AFTER coffee_hour;<\/pre>\n\n\n\nAt this point, Amazon Athena will be able to read the new field.And now we are ready for a beautiful trick:<\/strong> let\u2019s just go to the AWS Glue dashboard and create a new Database. A database is just a logical container for metadate. You can choose the name you prefer:<\/p>\n\n\n\n
<\/figure><\/div>\n\n\nAt this point we can go back to Amazon Athena and run a query like this:<\/p>\n\n\n\n
CREATE table iotarticologlue.coffees WITH ( format='PARQUET', external_location='s3:\/\/besharp-athena\/coffees_parquet', parquet_compression='GZIP' ) AS SELECT photo_url,smile,beard,mustache,glasses,coffee_hour_str FROM \"iotarticolo\".\"iot\".\"coffees\" WHERE photo_url LIKE 'https:\/\/%';<\/pre>\n\n\n\n
<\/figure><\/div>\n\n\nThis will create a new Table in the Database we just added to our AWS Glue data catalog and save all the data in Amazon S3 as a GZIP Parquet file. Furthermore, you could also change the compression (e.g. Snappy or BZIP) if you like.<\/p>\n\n\n\n
The query will also filter out the date with a bad Amazon S3 url in photo_url!<\/p>\n\n\n\n
So we now have a super-fast way to export our DB to Amazon S3 as parquet while automatically creating the AWS Glue catalog (the query does also that for free under the hood).<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\nAnd now it is trivial to visualize this new catalog in AWS Glue databrew: go to dashboard and create a new project:<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\nnow create a new dataset in the add dataset section:<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\nAnd create the project! <\/p>\n\n\n\n
If you encounter an error try to set the object name to parquet in Amazon S3 and crawl again the table with AWS Glue crawlers (Databrew is pretty new too!)<\/p>\n\n\n\n
And voil\u00e0 a beautiful data visualization of our dataset complete with column statistics!<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\nConclusion<\/h2>\n\n\n\n In this article, we described a very simple IoT application<\/strong> using Amazon Rekognition<\/strong> and Amazon Aurora<\/strong>. We explained how it can be enhanced with Amazon Kinesis firehose and finally, we used Amazon Athena to transform and clean the collected data and save them very easily to parquet to be analyzed with AWS Glue Databrew, Amazon Athena, and other AWS tools such as EMR.<\/p>\n\n\n\nHave you ever tried something similar for your Data Analysis process? <\/p>\n\n\n\n
Feel free to write to us about your solutions: we\u2019ll be glad to offer you a \u201cconnected\u201d coffee \ud83d\ude00 <\/p>\n\n\n\n
That\u2019s all for today. <\/p>\n\n\n\n
Keep reading and see you in 14 days on #Proud2beCloud!<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"With the Internet of Things quickly becoming a thing of the present (rather of the future…) the number of devices […]<\/p>\n","protected":false},"author":9,"featured_media":2035,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[475],"tags":[415,262,427,252,396,423,411,419,421,425,413],"yoast_head":"\n
Using AWS to ingest and analyze data from an IoT device: a simple example with Aurora and Athena - Proud2beCloud Blog<\/title>\n \n \n \n \n \n \n \n \n \n \n \n \n\t \n\t \n\t \n \n \n \n \n \n \n\t \n\t \n\t \n