“There are some secrets which do not permit themselves to be told”<\/em> – Edgar Allan Poe – The Man of the Crowd<\/p>\n\n\n\nSometimes, infrastructures can be perfect, but they still can be abused. <\/p>\n\n\n\n
This is the case of an e-commerce application replatforming gone berserk. <\/p>\n\n\n\n
After an initial assessment and design, we suggested modifications to better integrate in a cloud environment. <\/p>\n\n\n\n
Stateless sessions were implemented, thanks to Elasticache for Redis, Database scalability using Aurora Serverless for MySQL, and, finally, some application security thanks to the usage of roles and SecretsManager to store database credentials and external API keys instead of plain text config files. <\/p>\n\n\n\n
Everything was fine when, after a deployment, the application was slowing down and failing load balancer health checks. <\/p>\n\n\n\n
It turned out that since SecretsManager was so easy and handy to use, it was used for everything, even ordinary application parameters like bucket names, endpoints, and configuration thresholds. Every request to a page was accessing at least one or two times different secrets, and, to worsen things, there were traffic spikes due to marketing campaigns. At the end of the month, the AWS billing dashboard recorded 25,000,000 API calls, resulting in 150 $ added to the running costs. <\/p>\n\n\n\n
But there\u2019s more: sometimes, the application stopped responding because the metadata service (used to authenticate API calls using IAM roles) throttled the access. After all, it thought the application was trying to make a DoS.<\/p>\n\n\n\n
After explaining the issue and proposing a solution involving parameter store and environment variables, everything worked like a charm again, and the porting went fine. <\/p>\n\n\n\n
High Unavailability<\/h1>\n\n\n\n \u201cCongratulations. You are still alive. Most people are so ungrateful to be alive. But not you. Not anymore\u201d<\/em> – Saw.<\/p>\n\n\n\n<\/p>\n\n\n
\n
<\/figure><\/div>\n\n\n<\/p>\n\n\n\n
Speaking of application cloud-oriented refactoring of applications, this is a simple mistake that we sometimes see happen: if an application is highly available, please don\u2019t tie vital requests to a single point of failure endpoint (like an FTP server running on an EC2 instance to retrieve configuration files). <\/p>\n\n\n\n
This is a short story, but there’s a lot of suffering behind it.<\/p>\n\n\n\n
Infernal backup<\/h1>\n\n\n\n
<\/figure><\/div>\n\n\n<\/p>\n\n\n\n
\u201cWe all go a little mad sometimes.\u201d<\/em> \u2013 Psycho (1960) <\/p>\n\n\n\nDisaster recovery of the on-premise workloads on the Cloud is a hot topic, especially for enterprises and on-premises architectures. <\/p>\n\n\n\n
Designing a resilient solution is not easy. The first step is to find a backup strategy that can fit and guarantee the right RPO and RTO.<\/p>\n\n\n\n
Once the backup strategy and requirements are defined, it’s time to find the right software that fits our needs. <\/p>\n\n\n\n
In a large enterprise, the first choice is to use the existing solution but adapt it to run into the Cloud. No problem, as long as there is at least a form of integration, typically with Amazon Glacier or Amazon S3. This was our case, but\u2026 <\/p>\n\n\n\n
To make things more resilient, someone installed the software on an EC2 instance in a dedicated AWS account and configured it to backup on-premise machines using the existing Direct Connect connection through the Transit Gateway attachment. <\/p>\n\n\n\n
You already can tell in which direction the AWS billing can go! To make things worse, all endpoints were centralized in a dedicated networking account for better manageability and observability. <\/p>\n\n\n\n
So, to summarize, a single gigabyte backed up from the on-premise was: <\/p>\n\n\n\n
\nUsing the Direct Connect connection<\/li>\n\n\n\n Traversing the transit gateway attachment to reach the EC instance in the Backup Account<\/li>\n\n\n\n Traversing the Transit Gateway to reach the network account<\/li>\n\n\n\n Traversing the centralized S3 Interface endpoint<\/li>\n<\/ul>\n\n\n\nAt the end of the month, a spike of 6,000 $ was in the Networking section of the AWS Bill. <\/p>\n\n\n\n
Conclusion<\/h2>\n\n\n\n What have we learned? If something seems easy and trivial in a complex environment, you should look better for cues. <\/p>\n\n\n\n
Even this year, we had our fair dose of scary horror stories. As always, every mistake is not intentional, and the good thing is that we can always learn new things, avoiding repeating ourselves in the future. <\/p>\n\n\n\n
I want to thank everyone at the AWS Community Day in Rome who shared their stories after my speech about last year’s article. Next year’s episode is already written! \ud83d\ude42<\/p>\n\n\n\n
What kind of horror did you find in your AWS accounts? Let us know in the comments! (we also accept anonymous e-mails :D)<\/p>\n\n\n\n
\n\n\n\nAbout Proud2beCloud<\/h4>\n\n\n\n Proud2beCloud<\/strong> is a blog by beSharp<\/a>, an Italian APN Premier Consulting Partner expert in designing, implementing, and managing complex Cloud infrastructures and advanced services on AWS. Before being writers, we are Cloud Experts working daily with AWS services since 2007. We are hungry readers, innovative builders, and gem-seekers. On Proud2beCloud, we regularly share our best AWS pro tips, configuration insights, in-depth news, tips&tricks, how-tos, and many other resources. Take part in the discussion!<\/p>\n","protected":false},"excerpt":{"rendered":"Boys and girls of every ageWouldn’t you like to see something strange?Come with us, and you will seeThis, our town […]<\/p>\n","protected":false},"author":13,"featured_media":6430,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[482],"tags":[],"yoast_head":"\n
Nightmare Infrastructures episode 2 - beSharp\u2019s Halloween special - Proud2beCloud Blog<\/title>\n \n \n \n \n \n \n \n \n \n \n \n \n\t \n\t \n\t \n \n \n \n \n \n \n\t \n\t \n\t \n