|
AT2k Design BBS Message Area
Casually read the BBS message area using an easy to use interface. Messages are categorized exactly like they are on the BBS. You may post new messages or reply to existing messages! You are not logged in. Login here for full access privileges. |
| Previous Message | Next Message | Back to Slashdot <-- <--- | Return to Home Page |
|
||||||
| From | To | Subject | Date/Time | |||
|
|
VRSS | All | A Single Point of Failure Triggered the Amazon Outage Affecting |
October 24, 2025 5:40 PM |
||
Feed: Slashdot Feed Link: https://slashdot.org/ --- Title: A Single Point of Failure Triggered the Amazon Outage Affecting Million Link: https://slashdot.org/story/25/10/24/2212255/a... An anonymous reader quotes a report from Ars Technica: The outage that hit Amazon Web Services and took out vital services worldwide was the result of a single failure that cascaded from system to system within Amazon's sprawling network, according to a post-mortem from company engineers. [...] Amazon said the root cause of the outage was a software bug in software running the DynamoDB DNS management system. The system monitors the stability of load balancers by, among other things, periodically creating new DNS configurations for endpoints within the AWS network. A race condition is an error that makes a process dependent on the timing or sequence events that are variable and outside the developers' control. The result can be unexpected behavior and potentially harmful failures. In this case, the race condition resided in the DNS Enactor, a DynamoDB component that constantly updates domain lookup tables in individual AWS endpoints to optimize load balancing as conditions change. As the enactor operated, it "experienced unusually high delays needing to retry its update on several of the DNS endpoints." While the enactor was playing catch-up, a second DynamoDB component, the DNS Planner, continued to generate new plans. Then, a separate DNS Enactor began to implement them. The timing of these two enactors triggered the race condition, which ended up taking out the entire DynamoDB. [...] The failure caused systems that relied on the DynamoDB in Amazon's US- East-1 regional endpoint to experience errors that prevented them from connecting. Both customer traffic and internal AWS services were affected. The damage resulting from the DynamoDB failure then put a strain on Amazon's EC2 services located in the US-East-1 region. The strain persisted even after DynamoDB was restored, as EC2 in this region worked through a "significant backlog of network state propagations needed to be processed." The engineers went on to say: "While new EC2 instances could be launched successfully, they would not have the necessary network connectivity due to the delays in network state propagation." In turn, the delay in network state propagations spilled over to a network load balancer that AWS services rely on for stability. As a result, AWS customers experienced connection errors from the US-East-1 region. AWS network functions affected included the creating and modifying Redshift clusters, Lambda invocations, and Fargate task launches such as Managed Workflows for Apache Airflow, Outposts lifecycle operations, and the AWS Support Center. Amazon has temporarily disabled its DynamoDB DNS Planner and DNS Enactor automation globally while it fixes the race condition and add safeguards against incorrect DNS plans. Engineers are also updating EC2 and its network load balancer. Further reading: Amazon's AWS Shows Signs of Weakness as Competitors Charge Ahead Read more of this story at Slashdot. --- VRSS v2.1.180528 |
||||||
|
||||||
| Previous Message | Next Message | Back to Slashdot <-- <--- | Return to Home Page |
|
Execution Time: 0.0174 seconds If you experience any problems with this website or need help, contact the webmaster. VADV-PHP Copyright © 2002-2025 Steve Winn, Aspect Technologies. All Rights Reserved. Virtual Advanced Copyright © 1995-1997 Roland De Graaf. |
