{"id":6408,"date":"2023-10-27T09:00:00","date_gmt":"2023-10-27T07:00:00","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=6408"},"modified":"2024-02-02T11:58:12","modified_gmt":"2024-02-02T10:58:12","slug":"nightmare-infrastructures-episode-2-besharps-halloween-special","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/nightmare-infrastructures-episode-2-besharps-halloween-special\/","title":{"rendered":"Nightmare Infrastructures episode 2 – beSharp\u2019s Halloween special"},"content":{"rendered":"\n

Boys and girls of every age<\/em>
Wouldn’t you like to see something strange?<\/em>
Come with us, and you will see<\/em>
This, our town of Halloween<\/em> – The Nightmare before Christmas.<\/p>\n\n\n\n

\"beSharp\u2019s<\/figure>\n\n\n\n

Last year, we saw some scary infrastructures<\/a>. Are you ready for the new episode? <\/p>\n\n\n\n

In this article, we\u2019ll see some strange infrastructure designs and practices we have encountered, telling stories about Cloud anti-patterns that will become an absolute nightmare in the long term<\/strong>.<\/p>\n\n\n\n

Hold your breath; we\u2019re about to start! <\/p>\n\n\n\n

The undead<\/h2>\n\n\n
\n
\"\"<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

The modern concept of zombies is influenced by the Haitian Vodou religion, where some people believe that a witch doctor can revive a dead person as their slave using magic or a secret potion. <\/p>\n\n\n\n

Sometimes, ECS tasks are resurrected by a service, even if they should be buried (and stopped).<\/p>\n\n\n\n

We saw it happen in a development environment when a small, pretty php container had a problem due to an error in a failed pipeline. <\/p>\n\n\n\n

The task struggled to stay alive, but the brutal Application Load Balancer health check shot the container in the head; the ECS service did its best to revive the task, pulling it from the ECR repository. <\/p>\n\n\n\n

No one noticed the issue until the end of the month when the billing increased to over 800$ due to 16 Terabytes of traffic through the NAT Gateway.<\/p>\n\n\n\n

In this case, the ECS circuit breaker was willing to help, but no one asked him. <\/p>\n\n\n\n

To avoid making ECS zombies, please involve him next time you deploy a container! <\/p>\n\n\n\n

It\u2019s a matter of trust<\/h2>\n\n\n
\n
\"\"<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

I’m lying when I say, “Trust me”<\/em>
I can’t believe this is true…<\/em>
Trust hurts, why does trust equal suffering?\u201d<\/em>
Megadeth, Trust<\/p>\n\n\n\n

This is a short but even scarier episode. While investigating a problem with a failed pipeline deployment, we saw this trust policy in a role with administrator permissions on every resource: <\/p>\n\n\n\n

{\n\n  \"Version\": \"2012-10-17\",\n\n  \"Statement\": [\n\n    {\n\n      \"Effect\": \"Allow\",\n\n      \"Principal\": {\n\n        \"AWS\": \"*\"\n\n      },\n\n      \"Action\": \"sts:AssumeRole\"\n\n    }\n\n  ]\n\n}<\/code><\/pre>\n\n\n\n

We first detached the policy and, as a proof of concept, during a call with the customer, we used our personal AWS account to assume that role. Pretty scary, uh? <\/p>\n\n\n\n

I see dead lambdas<\/h1>\n\n\n\n

Like in \u201cThe Sixth Sense\u201d…<\/p>\n\n\n\n

<\/p>\n\n\n

\n
\"\"<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

It was a cold winter night, and, during a storm, our on-duty cell phone started ringing desperately. A serverless application struggled to survive, and our API gateway desperately gave 5xx errors. <\/p>\n\n\n\n

Our fellow colleague started investigating, and strangely, everything was quiet. Too quiet. No logs for the lambda associated with the troubled application route were recorded in CloudWatch. <\/p>\n\n\n\n

When making requests with Postman or curl, everything worked like a charm. <\/p>\n\n\n\n

Since everything was working again, the investigation was postponed until the next morning, but… After an hour, the phone started ringing again. And, still, no traces of failures, even in the logs. <\/p>\n\n\n\n


Determined to solve the mystery, another colleague joined the investigation and, while reviewing the configuration… It suddenly appeared! <\/p>\n\n\n\n

Our customer, in the past, was having some trouble because the lambda was timing out, so it \u201creserved some capacity\u201d. It turned out that the \u201creserved concurrency\u201d was set to 1. <\/p>\n\n\n\n

According to the AWS documentation: \u201cReserved concurrency is the maximum number of concurrent instances you want to allocate to your function. When a function has reserved concurrency, no other function can use that concurrency<\/em>\u201d. <\/p>\n\n\n\n

But there\u2019s a catch: reserved concurrency is also the maximum number of concurrent lambda instances that can be executed, so setting this value to one effectively throttles and limits the lambda, so if two simultaneous users call the API route, API Gateway will return a 5xx error. <\/p>\n\n\n\n

After removing the concurrency, everything was working fine. Consider using provisioned concurrency if you want to have lambdas ready to serve requests. This article explains how these two parameters impact the execution and performance Lambda function scaling – AWS Lambda (amazon.com)<\/a>.<\/p>\n\n\n\n

Too many secrets<\/h1>\n\n\n
\n
\"\"<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

“There are some secrets which do not permit themselves to be told”<\/em>
– Edgar Allan Poe – The Man of the Crowd<\/p>\n\n\n\n

Sometimes, infrastructures can be perfect, but they still can be abused. <\/p>\n\n\n\n

This is the case of an e-commerce application replatforming gone berserk. <\/p>\n\n\n\n

After an initial assessment and design, we suggested modifications to better integrate in a cloud environment. <\/p>\n\n\n\n

Stateless sessions were implemented, thanks to Elasticache for Redis, Database scalability using Aurora Serverless for MySQL, and, finally, some application security thanks to the usage of roles and SecretsManager to store database credentials and external API keys instead of plain text config files. <\/p>\n\n\n\n

Everything was fine when, after a deployment, the application was slowing down and failing load balancer health checks. <\/p>\n\n\n\n

It turned out that since SecretsManager was so easy and handy to use, it was used for everything, even ordinary application parameters like bucket names, endpoints, and configuration thresholds. Every request to a page was accessing at least one or two times different secrets, and, to worsen things, there were traffic spikes due to marketing campaigns. At the end of the month, the AWS billing dashboard recorded 25,000,000 API calls, resulting in 150 $ added to the running costs. <\/p>\n\n\n\n

But there\u2019s more: sometimes, the application stopped responding because the metadata service (used to authenticate API calls using IAM roles) throttled the access. After all, it thought the application was trying to make a DoS.<\/p>\n\n\n\n

After explaining the issue and proposing a solution involving parameter store and environment variables, everything worked like a charm again, and the porting went fine. <\/p>\n\n\n\n

High Unavailability<\/h1>\n\n\n\n

\u201cCongratulations. You are still alive. Most people are so ungrateful to be alive. But not you. Not anymore\u201d<\/em> –  Saw.<\/p>\n\n\n\n

<\/p>\n\n\n

\n
\"\"<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

Speaking of application cloud-oriented refactoring of applications, this is a simple mistake that we sometimes see happen: if an application is highly available, please don\u2019t tie vital requests to a single point of failure endpoint (like an FTP server running on an EC2 instance to retrieve configuration files). <\/p>\n\n\n\n

This is a short story, but there’s a lot of suffering behind it.<\/p>\n\n\n\n

Infernal backup<\/h1>\n\n\n
\n
\"\"<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

\u201cWe all go a little mad sometimes.\u201d<\/em>
\u2013 Psycho (1960) <\/p>\n\n\n\n

Disaster recovery of the on-premise workloads on the Cloud is a hot topic, especially for enterprises and on-premises architectures. <\/p>\n\n\n\n

Designing a resilient solution is not easy. The first step is to find a backup strategy that can fit and guarantee the right RPO and RTO.<\/p>\n\n\n\n

Once the backup strategy and requirements are defined, it’s time to find the right software that fits our needs. <\/p>\n\n\n\n

In a large enterprise, the first choice is to use the existing solution but adapt it to run into the Cloud. No problem, as long as there is at least a form of integration, typically with Amazon Glacier or Amazon S3. This was our case, but\u2026 <\/p>\n\n\n\n

To make things more resilient, someone installed the software on an EC2 instance in a dedicated AWS account and configured it to backup on-premise machines using the existing Direct Connect connection through the Transit Gateway attachment. <\/p>\n\n\n\n

You already can tell in which direction the AWS billing can go! To make things worse, all endpoints were centralized in a dedicated networking account for better manageability and observability. <\/p>\n\n\n\n

So, to summarize, a single gigabyte backed up from the on-premise was: <\/p>\n\n\n\n