Microsoft outage caused by overloaded Azure DNS servers

Microsoft

Microsoft revealed that Thursday’s global outage was caused by a defect in the code that allowed the Azure DNS service to become overloaded and not respond to DNS queries.

At approximately 5:21 pm EST on Thursday, Microsoft experienced a global outage that prevented users from accessing or subscribing to various services, including Xbox Live, Microsoft Office, SharePoint Online, Microsoft Intune, Dynamics 365, Microsoft Teams, Skype, Exchange Online, OneDrive, Yammer, Power BI, Power Apps, OneNote, Microsoft Managed Desktop and Microsoft Streams.

The service was so widespread on Microsoft’s infrastructure that even the Azure status page, which is used to provide outage information, was inaccessible.

Azure status page inaccessible
Azure status page inaccessible
Source: Twitter

Microsoft ended up resolving the outage at approximately 6:30 pm EST, with some services taking a little longer to get back to working properly.

At the time, Microsoft said the outage was caused by a DNS problem, but did not provide further information.

The Azure DNS service has become overloaded

Last night, Microsoft published a root cause analysis (RCA) for this week’s outage and explained that it was caused by the overhead of the Azure DNS service.

Microsoft’s Azure DNS is a global network of redundant name servers that provide high availability and fast DNS services.

According to Microsoft, the Azure DNS service has begun to receive a “wave of DNS queries from around the world that targeted certain domains hosted on Azure. Although Microsoft does not explain what this anomalous increase was, it may have been a DDoS attack targeting certain domains.

Microsoft claims that its DNS service can normally handle a large number of requests through DNS caches and traffic shaping. However, a defect in the code prevented your DNS Edge caches from working properly.

“Azure DNS servers experienced an anomalous increase in DNS queries worldwide targeting a set of domains hosted on Azure. Typically, the Azure caching and traffic shaping layers would mitigate this increase. In that incident, a specific sequence of events exposed a defective code in our DNS service that reduced the efficiency of our DNS Edge caches. “

“As our DNS service became overloaded, DNS clients began to repeat frequent requests, which added workload to the DNS service. Since the client’s retries are considered legitimate DNS traffic, this traffic has not been eliminated by our systems peak volumetric mitigation. to decrease the availability of our DNS service, “explained Microsoft at RCA for this week’s outage.

Since almost all Microsoft domains are resolved using Azure DNS, it was no longer possible to resolve hostnames in those domains and access associated services when the DNS service was overloaded.

For example, the xboxlive.com domain uses the following Azure DNS name servers to resolve hostnames in this domain.

NS1-205.AZURE-DNS.COM
NS2-205.AZURE-DNS.NET
NS3-205.AZURE-DNS.ORG
NS4-205.AZURE-DNS.INFO

Since xboxlive.com is hosted on Azure DNS and that service was unavailable, users were no longer able to sign in to Xbox Live.

To avoid this type of interruption in the future, Microsoft says it is correcting the code defect in Azure DNS so that the DNS cache can properly handle large amounts of requests. They also plan to improve the monitoring and mitigation of anomalous traffic.

BleepingComputer contacted Microsoft to find out more about this anomalous peak, but received no response at this time.

Source