Oct 20th AWS Outage

AWS had an outage on Act 20th caused by a routine update for DynamoDB disrupted internal DNS services.

Root Cause

  • The problem began when a routine API update for DynamoDB inadvertently disrupted internal DNS services, which translate domain names into IP addresses for AWS’s internal communication between services.​
  • This caused widespread DNS lookup failures, preventing various AWS components from locating the DynamoDB endpoints required to authenticate and store data.​
  • AWS confirmed that the fault was entirely internal, not due to cyberattacks or external interference.

Problems like this come from where we least expect them. Systems are only as reliable as the weakest link, and usually that is a human somewhere …

More analysis:

https://www.perplexity.ai/search/how-long-was-the-aws-outage-on-dkLfX4nCSgeepjSyuD3RFg?0=d#0

Sometimes tribal knowledge helps in such cases where folks know where things can go wrong and can be debugged quickly but not always.

Distributed systems are hard to grapple with

Detailed analysis from AWS

The root issue was a race condition in the automated DNS management system for DynamoDB, which resulted in the DNS records for the regional endpoint being deleted and left unrecoverable by automation, requiring manual intervention.

These are the hard things. It is amazing that even in the most redundant/reliable systems in the world, you still eventually come back to some resource that is difficult to make redundant. Simplicity is the only reliable solution.