{"id":7350,"date":"2024-10-30T17:42:02","date_gmt":"2024-10-30T16:42:02","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=7350"},"modified":"2024-10-30T17:42:06","modified_gmt":"2024-10-30T16:42:06","slug":"nightmare-cloud-infrastructures-episode-3","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/nightmare-cloud-infrastructures-episode-3\/","title":{"rendered":"Nightmare Cloud Infrastructures: episode 3"},"content":{"rendered":"\n
\n

“You\u2019ve seen horrible things. An army of nightmare creatures. But they are nothing compared to what came before. What lies below. It\u2019s our task to placate the Ancient Ones. As it\u2019s yours to be offered up to them. Forgive us. And let us get it over with.”<\/em><\/p>\n\n\n\n

The Cabin in the Woods<\/strong><\/p>\n<\/blockquote>\n\n\n\n

Hello, dear horror lovers, and welcome back to our little scary corner!<\/p>\n\n\n\n

As many of you already know, at this time of year, we tell you stories about errors and design problems that turned out to be monsters to fight.<\/p>\n\n\n\n

In this article, we will see what went wrong, why, and how to avoid certain things in the future…<\/p>\n\n\n\n

As always, no finger is pointed at anyone; sometimes, bad decisions depend on context and time.<\/p>\n\n\n\n

So, sit back, relax, and keep reading!<\/p>\n\n\n\n

If you missed them, here are the first<\/a> and second<\/a> episodes.<\/p>\n\n\n\n

Complex complexity<\/h2>\n\n\n\n

“Sometimes dead is better.”<\/em><\/p>\n\n\n\n

Pet Sematary<\/strong><\/p>\n\n\n\n

<\/p>\n\n\n\n

\"\"<\/figure>\n\n\n\n

<\/p>\n\n\n\n

What can you do when you find a potential problem that feeds itself and grows without anyone trying to stop this behavior? <\/p>\n\n\n\n

It started on a sunny summer day during an interview for a Well-Architected review. We stumbled upon a critical workload running on EKS on EC2 and a second EKS cluster running on Fargate. As always, we are looking for improvements in manageability and elasticity, so we asked if there was a plan to adopt Fargate for the EKS workload running on EC2 instances. <\/p>\n\n\n\n

There was no plan: the workload could run on Fargate, but Karpenter managed the elasticity of the cluster. Fair enough, Karpenter is a great tool, and it just got out of beta status. We were pretty happy about this finding because automatic resource decommissioning and assignment were considered, even if the manageability could be increased. We briefly discussed the elasticity and the R&D done to evaluate solutions because the application’s load was pretty unpredictable; good job!<\/p>\n\n\n\n

So we had a look at the EKS cluster running on Fargate and found out that it was used.. to run Karpenter! <\/p>\n\n\n\n

We had to ask: did they migrate the EC2 Cluster from Autoscaler to Karpenter? The answer was no: they built a second Cluster to migrate the workload. <\/p>\n\n\n\n

We were puzzled: why didn’t they use the Fargate platform (or, even better, add a Fargate profile and start running pods on it) if a migration was needed? The answer was simple: “Because we use Karpenter!” We were puzzled but had to move on; we noted this for the report (and this article!). <\/p>\n\n\n\n

The complexity of running a tool and maintaining its configuration to scale resources that can scale without any particular intervention will lead to errors and problems (especially during upgrades and changes). Instead of managing one component, now you have to take care of an additional one and evaluate their interaction. <\/p>\n\n\n\n

In this case, we saw a pattern: what if someone adds a GUI to Karpenter in the future? We can bet we will find the GUI running while doing the next Well-Architected review! It’s a neverending loop of adding complexity to a system to manage its complexity.<\/p>\n\n\n\n

When we add a component, we have to evaluate its impact. Even if it seems to ease our job, we always have to ask ourselves: Is there a managed service designed for that?<\/p>\n\n\n\n

CI\/CR (Continuous Integration, Continuous Rollback)<\/h2>\n\n\n\n

“Diabolical forces are formidable. These forces are eternal, and they exist today. The fairy tale is true. The devil exists. God exists. And for us, as people, our very destiny hinges upon which one we elect to follow.\u201d<\/em><\/p>\n\n\n\n

The Conjuring<\/strong><\/p>\n\n\n\n

<\/p>\n\n\n

\n
\"\"<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

For our project, we need three different environments and, of course, CI\/CD pipelines for our front end and back end.” Nothing makes us happier than hearing this requirement in a project’s kick-off call! Finally, the “Developmenttestproduction” scenario was a distant memory.<\/p>\n\n\n\n

Since there was no particular technology stack, we used CodeBuild and CodePipeline and agreed on using three different branches for different environments: dev, test, and production, each associated with a pipeline running in the specific environment. <\/p>\n\n\n\n

To “promote” code between environments, a merge has to be made from dev to test, and so on. Someone can argue that there are better flows, but this seemed to adapt well to the development workflow already in place.<\/p>\n\n\n\n

After some IaC and CDK development, the dev and test environments were ready for both the front end and the back end. We were asked to identify releases on ECR images and assets on S3 buckets with the commit ID. It was easy and done, but we didn’t have an idea of what was about to happen.<\/p>\n\n\n\n

Production was ready after some time (and development), and the team requested a manual approval step to trigger a git tag on the repository and better identify releases. Only after a week or two did the situation become clear: we received a request to modify the pipelines for the test and production repository to select from which tag to deploy artifacts. <\/p>\n\n\n\n

This was to accommodate a request to cherrypick which version of frontend and backend to deploy in each environment to ease rollbacks “just in case”. <\/p>\n\n\n\n

Needless to say, this is a recipe to cook the perfect disaster: rolling back arbitrary versions of different components can lead to unknown and unexpected interactions. What if you have a bug in the testing environment (or, for worse, in production) that you need to investigate? You will have to roll back the entire environment chain, even in local development environments. <\/p>\n\n\n\n

There are plenty of articles on the cost of rolling back and having diverging environments. In some cases, strict corporate policies forbid a deployment that is not currently running and validated in the testing environment. <\/p>\n\n\n\n

We even didn’t take into consideration human error: manually selecting the wrong tag is easier than a pipeline that deploys the latest head from a branch\u2026 <\/p>\n\n\n\n

If something goes wrong in a deployment, it’s easier to fix and move on than to elaborate on a complicated workflow “just in case”!<\/p>\n\n\n\n

After some time and explanations, we continued to use the “normal” workflow, and guess what? No rollback was ever made!<\/p>\n\n\n\n

Bad Luck for the bad<\/h2>\n\n\n\n

“Despite my ghoulish reputation, I really have the heart of a small boy. I keep it in a jar on my desk.”<\/em><\/p>\n\n\n\n

Robert Bloch<\/strong><\/p>\n\n\n\n

<\/p>\n\n\n

\n
\"\"<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

Sometimes, bad luck also happens to villains. <\/p>\n\n\n\n

A company contacted us in a hurry after finding an unknown EC2 instance after a weekend. Since the IT team was small, they made an internal check, and no one started that huge instance. <\/p>\n\n\n\n

They were worried about some administrative credential leak that could have happened, so we fired up CloudTrail to find that an unknown IP address from a foreign country used an access key to first describe account quotas, limits, and regions and then started the instance. <\/p>\n\n\n\n

The region chosen was eu-west-3 (Paris), an uncommon region. Lucky for the customer, that region was used as the main region for the workload, and with few instances, spotting that kind of anomaly was easy and was done in a matter of minutes. <\/p>\n\n\n\n

After more investigation, it was found that the “usual” bots found a vulnerability in a website’s PHP framework, obtained the access key (which unfortunately had broad permissions), and used the credentials to spawn a crypto-miner in a region that seemed a good idea for them because it was not common. <\/p>\n\n\n\n

In the end, being a villain is not an easy task; bad luck can happen to anyone. <\/p>\n\n\n\n

Regarding our newly acquired customer, after an emergency review, we made a plan to fix many things. We are still working on all the issues, but relying on good practices is better than being lucky! <\/p>\n\n\n\n

By the way, having temporary credentials handled by IAM roles can avoid many headaches, trust us!<\/p>\n\n\n\n

Home disautomation<\/h2>\n\n\n\n

\u201cYou know, I think this Christmas thing is not as tricky as it seems! But why should they have all the fun? It should belong to anyone! Not anyone, in fact, but me! Why, I could make a Christmas tree! And there\u2019s not a reason I can find, I couldn\u2019t have a Christmastime! I bet I could improve it, too! And that\u2019s exactly what I\u2019ll do!\u201d<\/em>  <\/p>\n\n\n\n

Jack Skellington, the Nightmare before Christmas<\/strong><\/p>\n\n\n\n

<\/p>\n\n\n

\n
\"\"<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

Ehi, this is me! Damiano from the past, do you remember? You are making fun of a lot of disasters, but you are not perfect. As a nerd, you are vulnerable to the lure of new technologies, eager to try them at every opportunity, even at home. So, be brave, and tell them about your home automation project that turned the lights in your house into rebels.<\/p>\n\n\n\n

Ok, fair enough. This story is about how my laziness and passion for new technologies turned their back on me. <\/p>\n\n\n\n

It was 2018, and I purchased a SonOff smart switch to use in combination with Alexa to turn on a lamp in my living room because a light switch was unavailable. It wasn’t long before I learned about home automation with ESP8266 Wi-Fi chips, Home Assistant, Tasmota, and MQTT protocol. Everything was in the early years, and the best way to learn is always to experiment, even with what you use daily. <\/p>\n\n\n\n

And, as always, before learning the big picture and how things work, you have to make mistakes. <\/p>\n\n\n\n

In July, I was on holiday in Sardinia, and my parents phoned me because they saw a light turned on in my kitchen. They entered my home and were not able to turn it off. Knowing that I was tinkering with lights at home, they already knew that maybe something wasn’t working as expected.<\/p>\n\n\n\n

It was obviously my fault, and this is why.<\/p>\n\n\n\n

Having discovered ESP8266 chips and their capabilities, I started prototyping a “Wi-Fi remote switch extender” to control multiple light bulbs with a single switch, triggering different events for different actions, like double clicks, long presses, etc. The idea was to extend the capabilities of a simple light switch by embedding an ESP8266 chip and triggering webhooks to other smart devices (even a custom-made one).<\/p>\n\n\n\n

With this kind of thing, I could turn everything on or off, and control even lights that weren’t in the same room! <\/p>\n\n\n\n

Here is an (ugly) prototype from that year:<\/p>\n\n\n\n

<\/p>\n\n\n

\n
\"\"<\/figure><\/div>\n\n\n

<\/p>\n\n\n\n

The problem(s) related to the light that didn’t turn off was that: <\/p>\n\n\n\n