{"id":6458,"date":"2023-11-10T09:00:00","date_gmt":"2023-11-10T08:00:00","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=6458"},"modified":"2023-11-07T15:50:29","modified_gmt":"2023-11-07T14:50:29","slug":"business-continuity-in-the-cloud-the-importance-of-fault-tolerance","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/business-continuity-in-the-cloud-the-importance-of-fault-tolerance\/","title":{"rendered":"Business continuity in the Cloud: the importance of Fault Tolerance"},"content":{"rendered":"\n
One of the principles that we strive to convey to our partners every day is that the Cloud is not a magical place where I put resources, pay a bill, and automatically obtain all the advantages commonly listed in articles, white papers, and provider presentations pages. <\/p>\n\n\n\n
The Cloud involves a shift in mindset and, consequently, a new way of making investments and viewing the “digital” aspects of companies. Especially for those who provide services to consumers, business continuity is becoming increasingly crucial, ensuring that we are ready to fulfill requests at any time or condition.<\/p>\n\n\n\n
It’s not only about understanding the financial impact of unexpected events but also about implementing all the procedures and practices that lead to maximizing our uptime. <\/p>\n\n\n\n
I would like to start with some data: here are some examples of service interruptions from big corporations, which demonstrate that the issue generated is not only a matter of customer trust but also a financial one. <\/p>\n\n\n\n
In 2015, a 12-hour Apple Store outage cost the company $25 million. In 2016, a five-hour power outage at a Delta Airlines operations center resulted in an estimated loss of $150 million. In 2019, a 14-hour outage cost Facebook approximately $90 million. In addition to this data, the twelfth Annual Global Data Center Survey (2022) from the Uptime Institute provides a critical overview and an idea of the industry’s future trajectory. Among the key findings of the report, I would like to focus on two specific points:<\/p>\n\n\n\n
When it comes to service outages, I believe no one was surprised by the results, but it’s certainly interesting to note that the bar of expectations is continually rising, and the consequences of downtime are having an increasingly significant impact on earnings and user trust.<\/p>\n\n\n\n
The second bullet point, in my view, relates to using Cloud technology without fully understanding its fundamentals. As we mentioned, the Cloud is not magical; simply moving a workload to the Cloud doesn’t automatically increase uptime. <\/p>\n\n\n\n
This article aims to be an attempt to explain how the Cloud empowers us and what we need to do to achieve fault tolerance<\/strong>, improve uptime<\/strong>, and address some aspects of a structured business continuity<\/strong> plan.<\/p>\n\n\n\n If we were in school, we should start with the definition: <\/p>\n\n\n\n fault tolerance refers to the ability of a system or application to continue functioning reliably even when failures or malfunctions occur in one or more of its components. <\/em><\/p>\n\n\n\n Being able to be fault-tolerant is important for addressing many of the objectives defined in business continuity plans. A well-designed fault-tolerant infrastructure brings us benefits such as reduced downtime, improved reliability, risk reduction, and maintaining service quality.<\/p>\n\n\n\n It’s important to clarify the concept of fault tolerance and how it differs from concepts like HA (High Availability), redundancy, and DR (Disaster Recovery), often mistakenly used as interchangeable synonyms:<\/p>\n\n\n\n High Availability<\/strong> (HA) refers to a system’s ability to remain operational, without significant interruptions, for continuous periods of time. This goal can be achieved through continuous monitoring, redundancy, and other measures. A high-availability system is designed to avoid failures and ensure continuous service availability.<\/p>\n\n\n\n Redundancy<\/strong> is used to enhance a system’s reliability. Designing infrastructures that include redundant components means that if one component fails, another can take its place to prevent interruptions. Redundancy can be applied at both hardware and software levels.<\/p>\n\n\n\n Disaster Recovery<\/strong> (DR) is a plan aimed at restoring a system or application in the event of catastrophic events or large-scale failures. This usually involves data backup, business continuity planning, and specific procedures to restore the system at an alternate location or with alternative resources.<\/p>\n\n\n\n High availability, redundancy, and DR are all important components to achieve fault tolerance, but they alone may not be sufficient to guarantee it<\/strong>. Fault tolerance requires a thoughtful design that considers a wide range of scenarios and provides adequate measures to manage them without significant disruptions.<\/p>\n\n\n\n Fault tolerance is, therefore, a design and implementation strategy that significantly contributes to continued reliable operation, thus protecting a company’s reputation and maintaining customer trust.<\/p>\n\n\n\n Those approaching the Cloud should keep in mind that providers are not, do not want to be, and cannot be responsible for our applications in their entirety. Furthermore, depending on the service or building block we use for our infrastructure, the provider shifts the responsibility bar. Taking AWS as an example, as shown in the image, more IaaS services shift the responsibility towards the Customer, while managed services shift more of the responsibility towards the provider.<\/p>\n\n\n\n <\/p>\n\n\nWhat is Fault Tolerance?<\/h2>\n\n\n\n
Shared Responsibility Model<\/h2>\n\n\n\n