AT2k Design BBS Message Area
Casually read the BBS message area using an easy to use interface. Messages are categorized exactly like they are on the BBS. You may post new messages or reply to existing messages!

You are not logged in. Login here for full access privileges.

Previous Message | Next Message | Back to Slashdot  <--  <--- Return to Home Page
   Local Database  Slashdot   [34 / 108] RSS
 From   To   Subject   Date/Time 
Message   VRSS    All   Google Cloud Caused Outage By Ignoring Its Usual Code Quality Pr   June 16, 2025
 8:00 PM  

Feed: Slashdot
Feed Link: https://slashdot.org/
---

Title: Google Cloud Caused Outage By Ignoring Its Usual Code Quality
Protections

Link: https://tech.slashdot.org/story/25/06/16/2141...

Google Cloud has attributed last week's widespread outage to a flawed code
update in its Service Control system that triggered a global crash loop due
to missing error handling and lack of feature flag protection. The Register
reports: Google's explanation of the incident opens by informing readers that
its APIs, and Google Cloud's, are served through our Google API management
and control planes." Those two planes are distributed regionally and "are
responsible for ensuring each API request that comes in is authorized, has
the policy and appropriate checks (like quota) to meet their endpoints." The
core binary that is part of this policy check system is known as "Service
Control." On May 29, Google added a new feature to Service Control, to enable
"additional quota policy checks." "This code change and binary release went
through our region by region rollout, but the code path that failed was never
exercised during this rollout due to needing a policy change that would
trigger the code," Google's incident report explains. The search monopolist
appears to have had concerns about this change as it "came with a red-button
to turn off that particular policy serving path." But the change "did not
have appropriate error handling nor was it feature flag protected. Without
the appropriate error handling, the null pointer caused the binary to crash."
Google uses feature flags to catch issues in its code. "If this had been flag
protected, the issue would have been caught in staging." That unprotected
code ran inside Google until June 12th, when the company changed a policy
that contained "unintended blank fields." Here's what happened next: "Service
Control, then regionally exercised quota checks on policies in each regional
datastore. This pulled in blank fields for this respective policy change and
exercised the code path that hit the null pointer causing the binaries to go
into a crash loop. This occurred globally given each regional deployment."
Google's post states that its Site Reliability Engineering team saw and
started triaging the incident within two minutes, identified the root cause
within 10 minutes, and was able to commence recovery within 40 minutes. But
in some larger Google Cloud regions, "as Service Control tasks restarted, it
created a herd effect on the underlying infrastructure it depends on ...
overloading the infrastructure." Service Control wasn't built to handle this,
which is why it took almost three hours to resolve the issue in its larger
regions. The teams running Google products that went down due to this mess
then had to perform their own recovery chores. Going forward, Google has
promised a couple of operational changes to prevent this mistake from
happening again: "We will improve our external communications, both automated
and human, so our customers get the information they need asap to react to
issues, manage their systems and help their customers. We'll ensure our
monitoring and communication infrastructure remains operational to serve
customers even when Google Cloud and our primary monitoring products are
down, ensuring business continuity."

Read more of this story at Slashdot.

---
VRSS v2.1.180528
  Show ANSI Codes | Hide BBCodes | Show Color Codes | Hide Encoding | Hide HTML Tags | Show Routing
Previous Message | Next Message | Back to Slashdot  <--  <--- Return to Home Page

VADV-PHP
Execution Time: 0.0158 seconds

If you experience any problems with this website or need help, contact the webmaster.
VADV-PHP Copyright © 2002-2025 Steve Winn, Aspect Technologies. All Rights Reserved.
Virtual Advanced Copyright © 1995-1997 Roland De Graaf.
v2.1.250224