For Azure customers, the good news is that Microsoft’s global cloud infrastructure has delivered an average uptime of 99.995% for its core compute services over the past 12 months.
The not-so-good news for those same customers is that over those same 12 months, the Azure cloud has “experienced three unique and significant incidents that impacted customers.”
In a blog entry posted earlier this week, Azure CTO Mark Russinovich addressed the issue head-on. Russinovich spelled out the broad nature of those failures—what he referred to as “incidents”—as well as the remedial steps Microsoft is taking to ensure that such problems become even less frequent.
The Three Azure Failures
Here’s how Russinovich described the three failures:
“However, at the scale Azure operates, we recognize that uptime alone does not tell the full story. We experienced three unique and significant incidents that impacted customers during this time period, a datacenter outage in the South Central US region in September 2018, Azure Active Directory (Azure AD) Multi-Factor Authentication (MFA) challenges in November 2018, and DNS maintenance issues in May 2019.”
As we all know, breakdowns, challenges, issues, outages, failures, incidents and imperfections are inescapable until we humans achieve a state of full perfection, which Gartner predicts will occur at 12:37pm Pacific Time on Oct. 17 in the year 5852.
Until then, customers and prospects need to push back hard on Microsoft—the world’s #1 cloud vendor—and the entire tech community to do everything possible to deliver relentlessly enhanced reliability, security and availability.
Toward that end, here are some of the steps Russinovich outlined in his blog post:
“Improve our understanding”
“Outages and other service incidents are a challenge for all public cloud providers, and we continue to improve our understanding of the complex ways in which factors such as operational processes, architectural designs, hardware issues, software flaws, and human factors can align to cause service incidents.”
“Multiple failures” and “intricate interactions”
“All three of the incidents mentioned were the result of multiple failures that only through intricate interactions led to a customer-impacting outage. In response, we are creating better ways to mitigate incidents through steps such as redundancies in our platform, quality assurance throughout our release pipeline, and automation in our processes.
The capability of continuous, real-time improvement is one of the great advantages of cloud services, and while we will never eliminate all such risks, we are deeply focused on reducing both the frequency and the impact of service issues while being transparent with our customers, partners, and the broader industry.”
Within Russinovich’s CTO office, Microsoft has created a Quality Engineering team that will work closely with the existing Site Reliability Engineering team to explore and create innovative reliability solutions.
Safe deployment
Aimed at ensuring “that all code and configuration changes go through a cycle of specific stages,” Microsoft has expanded this initiative to include software-defined infrastructure changes such as networking and DNS, Russinovich wrote.
Storage-account level failover
This one’s worth reading in full:
“During the September 2018 datacenter outage, several storage stamps were physically damaged, requiring their immediate shut down. Because it is our policy to prioritize data retention over time-to-restore, we chose to endure a longer outage to ensure that we could restore all customer data successfully. A number of you have told us that you want more flexibility to make this decision for your own organizations, so we are empowering customers by previewing the ability to initiate your own failover at the storage-account level.”
Expanding availability zones
In Azure’s 10 largest regions, availability zones provide “an additional reliability option for the majority of our customers,” the blog post says. Microsoft is planning to expand availability zones over the next 18 months to its next 10 largest Azure regions.
Project Tardigrade
Looking to spot and prevent hardware failures or memory leaks before they happen, this effort will enable Azure to freeze virtual machines for a few seconds and shift workloads to healthy systems, Russinovich wrote.
Low to zero-impact maintenance
Including hot patching, live migration and in-place migration, these novel approaches aim to require zero downtime for customers.
Fault injection and stress testing
I also recommend reading this one in Russinovich’s own words:
“Validating that systems will perform as designed in the face of failures is possible only by subjecting them to those failures. We’re increasingly fault injecting our services before they go to production, both at a small scale with service-specific load stress and failures, but also at regional and AZ scale with full region and AZ failure drills in our private canary regions. Our plan is to eventually make these fault injection services available to customers so that they can perform the same validation on their own applications and services.”
Clearly, pushing reliability upward from 99.995% is a big challenge. But implicit as well as explicit in Microsoft’s promise to customers of its Azure cloud is that Microsoft’s size, scale, technological expertise and financial resources will shield those customers from the disruptive chaos of modern enterprise technology.
And if Microsoft intends to retain its #1 spot in the Cloud Wars, all of its cloud customers—those who’ve been affected by those 3 “incidents” as well as those that haven’t—will be demanding that the plans outlined by Russinovich become reality.
And that they do so quickly.
Subscribe to the Cloud Wars Newsletter for in-depth analysis of the major cloud vendors from the perspective of business customers. It’s free, it’s exclusive, and it’s great!