How to Achieve Operational Excellence in the Cloud


Sometimes, everything in our IT seems to go wrong. 

Things don't go the way we want them to, and the world seems to revolve around Murphy's Law. 

Instead, I like to think that we can be the architects of our (Cloud)destiny; by acting wisely with what is under our control things can still happen the right way…  and fix!

For example, those who decide to transform their business by embracing the Cloud paradigm usually tend to fall into recurring anti-patterns that prevent them from effectively managing disruptions, quickly restoring services, or mitigating negative impacts on customers. In these cases, neither fate nor wrong technology leads to failure, but the reason for failure is often the lack of a corporate culture focused on Operational Excellence.

Operational excellence focuses on identifying and implementing practices and processes that maximize value for our customers while reducing waste, costs, and inefficiencies. 

This means that, in addition to designing and implementing resilient services, we must also consider maintenance, modification, restoration, and continuous improvement for them. It becomes, therefore, crucial to have trained people and a structured approach to Operations.

Operation Excellence: The Fundamentals


Operational Excellence is based on some fundamental pillars; reducing human errors is one of this. This can be achieved by automating processes, simplifying procedures, and eliminating non-essential activities

The Cloud has made it easier for us to achieve this goal: infrastructures are defined through code becoming more replicable, and processes can be automated to become deterministic, limiting human error.


Efficiency, however, should not lead us to lower the bar on quality. 

Aiming for high-quality products and services and meeting (or exceeding!) customers' expectations is another key principle. This involves rigorous quality control, implementing continuous improvement practices, and preventing defects or errors.

A good practice is to rely on microservices. Using decoupled components reduces the affinity to failure and allows operations on single components without affecting the others. 

Also, it is important to introduce small and regular changes. This practice ensures incremental (and more secure) updates that, on the one hand, reduce the likelihood of significant errors and, on the other hand, enable quick adaptation to changes in market conditions, thereby increasing the confidence of our consumers (or users).

Continuous Improvement

Once efficiency and quality have been achieved, the focus must shift to evolution. 

It is essential to understand how to continuously improve what has been done and innovate while always keeping the customer focused.

The Plan-Do-Check-Act practice - also known as the Deming Cycle - is a fundamental process for improving operations. It consists of analyzing processes, setting new goals, implementing changes, and verifying results. This can help highlight weaknesses and, as a consequence, lead to updating and improving infrastructure management and maintenance procedures.

Learning from mistakes is necessary to enhance procedures and avoid the same risky behavior. Some helpful actions we can do are documenting disruptions and creating failure simulations. This last point, in particular, aims to test a company's mitigation capabilities as well as understand the impact of the most critical scenarios. Regular "Pre-mortem" exercises based on real-world disruptions, can be leveraged to grow and share practices and educate teams.

In addition to increasing the efficiency and effectiveness of internal processes, the product itself must be able to change and adapt to market preferences. Creating short experimentation cycles to obtain rapid feedback is the safest and most cost-effective way to innovate and steer activities toward customer satisfaction.

In this scenario, the Cloud comes to the help again with the "Fail Fast" Principle. By default, in fact, the Cloud allows for quickly identifying and resolving problems or testing new product features with minimum effort and resource waste

For example, testing and evaluating developments involving small portions of our audience (e.g. selecting user groups to validate changes, confirm them, or revert to the previous state) helps reduce the impact of malfunctions.


Getting insights is essential to get detailed information about Operations' performance, including processes, resources, and other aspects of the operational model.

The importance of insights lies in their direct positive impact on enabling continuous improvement, operational efficiency, and the company's overall success.

Defining Key Performance Indicators (KPIs) that assist in making informed and conscious decisions is crucial. For example, measuring the time required to complete a process (cycle time) and the time elapsed between customer request and delivery (lead time) is valuable. These KPIs provide an overview of the speed and readiness of the company to meet market needs.

In the case of software development, "cycle time" would represent the average time needed to complete a single user story or task, from assignment to delivery. "Lead time" refers to the total time from the beginning of a project or release planning to customer delivery.

Another aspect to always keep an eye on refers to the readiness and responsiveness of a system to emergencies or crises. Getting insights into how the organization handles unforeseen events is crucial to achieving safety and resilience.

In summary, insights for Operational Excellence drive Continuous improvement, enable adaptation to the market needs, and contribute to a more efficient, resilient, and customer-oriented organization.


Operational Excellence - focusing on efficiency, quality, and operational continuity - creates a solid foundation for daily operations, ensuring maximum performance and minimum waste. Operational Excellence, together with innovation, contributes significantly to the success of companies.

A company with a strong interest in Operation Excellence creates fertile land for effective experimentation, cross-functional collaboration, and the adoption of new technologies. In other words, for speeding up innovation. This helps companies to remain competitive in a constantly fast-evolving scenario.

When operational Excellence is implemented correctly, it shifts us from the mindset of "if something can go wrong, it will" to "expect the best, prepare for the worst." This emphasizes the importance of meticulous planning, preparation, and awareness of challenges while simultaneously fostering an optimistic expectation for success.

Have you considered how Operational Excellence can elevate your organization? Share your thoughts or reach out for more insights!

About Proud2beCloud

Proud2beCloud is a blog by beSharp, an Italian APN Premier Consulting Partner expert in designing, implementing, and managing complex Cloud infrastructures and advanced services on AWS. Before being writers, we are Cloud Experts working daily with AWS services since 2007. We are hungry readers, innovative builders, and gem-seekers. On Proud2beCloud, we regularly share our best AWS pro tips, configuration insights, in-depth news, tips&tricks, how-tos, and many other resources. Take part in the discussion!

Nicola Ferrari
Cloud Infrastructure Line Manager @ beSharp and AWS authorized instructor champion.I live my life one level at a time getting superpowers by collecting caffeine hidden here and there in my daily map. I’m a hardened internet surfer (yes, I surfed the whole internet… twice!) and tech-addicted with a passion for computers and networking. Building great IT things all nice and tidy contribute to achieving my main goal: the pursuit of perfection!

Leave a comment

You could also like