The latest interruption of Microsoft cloud authentication: what went wrong

azureoutages.jpg

Credit: Microsoft

Microsoft published a preliminary analysis of the root cause of its Azure Active Directory outage on March 15, which brought down Office, Teams, Dynamics 365, Xbox Live and other Microsoft and third-party applications that rely on Azure AD for authentication. The approximately 14-hour outage affected a “subset” of Microsoft customers worldwide, officials said.

Microsoft’s preliminary analysis of the incident, published on March 16, indicated that “there was an error in the rotation of the keys used to support the use of Azure AD OpenID and other standard identity protocols for cryptographic signing operations,” according to the findings published on your Azure Status History Page.

Officials said that as part of normal security practices, an automated system removes keys that are no longer in use, but in recent weeks, a key has been marked as “retained” for longer than normal to support a complex migration between clouds . This resulted in the exposure of a bug, causing the retained key to be removed. Metadata about the subscription keys is published by Microsoft in a global location, its analysis notes. But after the metadata was changed around 3 pm ET (the start of the outage, applications using these protocols in Azure AD started to collect the new metadata and stopped trusting tokens / assertions that were signed with the key removed.

Microsoft engineers reverted the system to its previous state at about 5:00 pm Eastern Time, but it takes a while for applications to collect the reversed metadata and update with the correct metadata. A subset of storage features required an update to invalidate the incorrect entries and force an update.

Microsoft’s post explains that Azure AD is undergoing a multi-phase effort to apply additional safeguards to the back-end Secure Deployment Process to prevent these types of problems. The remove-key component is in the second phase of the process, which is not scheduled to be completed until mid-year. Microsoft officials said the interruption of Azure AD authentication that took place at the end of September is part of the same class of risks that they believe will bypass once the multi-phase project is completed.

“We understand how incredibly impactful and unacceptable this is and deeply apologize. We are continually taking steps to improve the Microsoft Azure platform and our processes to help ensure that such incidents do not occur in the future,” said the blog post.

A full analysis of the root cause will be published as soon as the investigation is completed, officials said.

Source