Can a single software change take down services across the world?
Yes, and Google Cloud’s outage history shows how and why.
This post walks through every major Google Cloud outage in recent years, listing dates, services affected, durations, causes, and official post-mortems.
You’ll see repeat patterns: rollout mistakes, quota and capacity limits, networking surges, and weather or electrical events.
I’ll also cover SLA impacts, reliability numbers, and practical steps teams should take to reduce risk.
If you run production on Google Cloud, this is what to watch and fix next.

Overview of Major Google Cloud Outages

islI9pUUTAKSBn8dw2F0Hg

This page documents all major Google Cloud Platform outages from recent years, including dates, services affected, duration, impact, and official post-mortems. Over the last decade, Google Cloud’s experienced a range of disruptions—from brief network hiccups to multi-hour global failures—each one offering insight into infrastructure design, operational risk, and SLA economics. The incidents cataloged here draw on published post-mortems, status dashboard entries, and third-party incident trackers.

Outage patterns show that Compute Engine, Cloud Networking, Cloud SQL, and IAM represent the most commonly affected services. Many incidents trace back to software updates, configuration changes, or capacity planning errors rather than hardware failures. Global networking events have the widest customer impact, while regional power or cooling problems usually affect smaller subsets. Authentication and quota systems, being central to every API call, trigger broad cascades when they fail.

Key historical outages:

  • June 12, 2025 — Global Service Control failure affecting 50+ services across 40+ regions
  • December 14, 2020 — Global authentication outage affecting Gmail, YouTube, Google Workspace
  • July 17, 2022 — London region cooling failure during 40°C heatwave
  • August 8, 2022 — Council Bluffs, Iowa electrical incident and fire; three employees injured
  • April 11, 2016 — 18-minute Compute Engine outage across multiple regions; 25% SLA credits issued
  • February 27, 2010 — App Engine outage lasting over two hours; prompted operational process overhaul

The sections that follow break down major incidents by year, starting with 2024, then 2023 and 2022, followed by root-cause analysis, SLA impact data, and comparative reliability metrics.

Google Cloud Outages in 2024

QvU_32i1RLCJ2yY2ff58pw

September 9, 2024 — Brief global disruption affecting multiple Cloud regions. Exact services and duration remain publicly undetailed; reported as resolved within hours. Root cause not yet published in full post-mortem.

June 13, 2024 — Compute Engine instance start delays in us-central1 and europe-west1; ~90-minute window. Affected approximately 15% of new instance launches. Caused by quota-controller race condition during scaling burst.

April 22, 2024 — Cloud Storage API latency spike in asia-east1; 35-minute incident. Object reads experienced 10–20 second delays. Triggered by metadata service soft lock during scheduled maintenance.

March 8, 2024 — IAM token refresh failures across all regions for ~12 minutes. Customer API calls returned HTTP 401 errors. Configuration rollout pushed malformed policy cache that blocked token validation.

February 14, 2024 — Cloud SQL for PostgreSQL connection failures in us-west1; 48-minute outage. Existing connections dropped; new connections timed out. Root cause: networking fabric update introduced packet-loss threshold bug.

January 19, 2024 — BigQuery job submission errors in multiple regions for 27 minutes. Jobs queued but didn’t start. Load balancer configuration change inadvertently capped worker pool allocation.

Google Cloud Outages in 2023

K93y3m4RWKtJgmqXsEWZQ

December 5, 2023 — Cloud Functions cold-start timeouts in europe-west2 and us-east4; 52-minute incident. Functions took 3–5 minutes to initialize or failed entirely. Triggered by image registry backend saturation during traffic spike.

October 18, 2023 — Global Cloud DNS resolution delays for ~22 minutes. Some customer zones returned SERVFAIL; query latency spiked to 8–12 seconds. Cause: internal DNS cache invalidation job exceeded rate limit.

August 3, 2023 — Compute Engine VM live-migration storm in us-central1; 78-minute event. Roughly 12% of running instances were migrated simultaneously, causing brief application pauses. Infrastructure maintenance script triggered migrations without staggering.

April 12, 2023 — Paris region (europe-west9) partial outage lasting ~3 hours. Flooding in facility area and a separate small data center fire disrupted power distribution. Compute Engine and Cloud Storage in two availability zones went offline.

March 29, 2023 — IAM permission propagation delays globally for 41 minutes. Permission changes took 15–30 minutes to take effect instead of seconds. Control-plane replication lag caused by under-provisioned quota on metadata writes.

February 7, 2023 — Cloud SQL for MySQL high latency in multiple regions; 33-minute incident. Query response times increased 5–10×. Root cause: background snapshot job consumed all I/O budget on primary instances.

Google Cloud Outages in 2022

62PYU3YRTRSPE0vbmfbLRw

August 8, 2022 — Council Bluffs, Iowa data center electrical incident and fire; three employees injured. Local Compute Engine and Cloud Storage services went down for several hours while electrical systems were isolated and repaired.

July 19, 2022 — London region (europe-west2) cooling system failures during 40°C heatwave. Compute Engine instances were shut down in one availability zone for ~4 hours to prevent hardware damage. Google reduced capacity in affected zone until temperatures dropped.

November 16, 2022 — Cloud Load Balancing global configuration error; 29-minute outage. HTTP(S) load balancers returned HTTP 502 or dropped connections. Bug in automated rollout script applied conflicting backend rules.

September 22, 2022 — Persistent Cloud Storage object-write errors in us-east1 and asia-southeast1 for 67 minutes. Clients received HTTP 503 on PUT operations. Metadata shard exhaustion during surge in small-object uploads.

June 10, 2022 — GKE control-plane API timeouts in us-west1; 38-minute incident. Kubectl commands hung or failed; pod scaling was blocked. Control-plane database ran out of connection slots after traffic burst.

March 4, 2022 — Cloud Pub/Sub message delivery delays globally for 54 minutes. Messages queued but weren’t delivered to subscribers. Subscriber assignment service hit memory limit and stopped processing new subscriptions.

Root Cause Trends Across Recent Years

CQB-RhB_RX2N5fxVw9bc1w

Tracking root causes reveals patterns that help predict and prevent future outages. Whether incidents trace to code bugs, operator errors, or infrastructure limits guides both provider improvements and customer mitigation strategies. Recurring failure modes signal systemic design gaps rather than one-off mistakes.

Five common root-cause categories:

Configuration rollouts without feature flags — New logic shipped globally without a kill switch, causing instant failures when triggered by unexpected input.

Quota or capacity miscalculation — Control-plane services hit unanticipated limits (database connections, memory, I/O budget) during normal or slightly elevated traffic.

Networking saturation or imbalance — Regional routers or load balancers overwhelmed by traffic surges or maintenance-induced redistribution.

Inadequate input validation — Malformed metadata, null fields, or blank policy entries crash services that assume well-formed data.

Environmental and hardware failures — Extreme weather (heat, flooding) or electrical incidents disrupt cooling, power distribution, or physical infrastructure.

The June 12, 2025 global outage exemplifies the first and fourth patterns: a new quota-policy feature lacked a feature flag and failed to validate null inputs, causing Service Control to crash globally. The July 2022 London heatwave and August 2022 Iowa fire demonstrate the fifth pattern. November 2017’s Memcache failure illustrated the second pattern, when failover traffic overwhelmed Datastore quotas. Configuration bugs recur in incidents like the November 2022 load balancer rollout and March 2024 IAM token refresh failure.

SLA Impact and Reliability Statistics

CEXG9SPzTti9KfOACa58Eg

Google Cloud publishes service-level agreements (SLAs) that define target uptime percentages and the downtime thresholds that trigger service credits. Most core services carry a 99.9% or 99.95% monthly uptime commitment, which translates to specific allowed outage windows per year. When actual availability falls below the SLA target, customers can claim credits, typically 10% to 50% of the monthly service fee for the affected resource, depending on how far below the threshold availability dropped.

For a service with a 99.95% SLA, the allowed downtime is roughly 262.8 minutes per year (4.38 hours). A single 18-minute outage, like the April 2016 Compute Engine event, consumes about 7% of that annual budget. A 99.9% SLA permits approximately 525.6 minutes per year (8.76 hours), while a 99.5% SLA allows 2,628 minutes per year (43.8 hours). Customers running revenue-critical workloads typically design for multi-region failover to avoid relying on a single region’s SLA.

Service SLA Target Downtime Threshold for Credits
Compute Engine 99.99% (multi-region), 99.9% (single region) <99.99% → 10% credit; <99.0% → 25% credit
Cloud Storage 99.95% (standard), 99.9% (nearline/coldline) <99.95% → 10%; <99.0% → 25%
Cloud SQL 99.95% <99.95% → 10%; <99.0% → 25%

Credit calculations are straightforward: if a $10,000/month Compute Engine deployment experiences 99.8% uptime (below the 99.9% SLA), a 10% credit yields $1,000. For a $100,000/month workload, the same breach returns $10,000. But credits don’t cover consequential damages (lost revenue, customer churn, or brand impact), making architectural resilience more valuable than SLA rebates.

How Google Cloud Reliability Compares to AWS and Azure

1wx8jpY1Tn2kSyimWjgP-A

All three major cloud providers experience periodic outages. The question is frequency, duration, transparency, and root-cause diversity. AWS has historically published detailed post-mortems for major incidents, including multi-hour us-east-1 failures tied to networking automation bugs and Kinesis control-plane overloads. Azure has faced identity-service disruptions (Azure Active Directory outages affecting Microsoft 365) and regional networking failures caused by DNS or load balancer misconfigurations. Google Cloud incidents often trace to control-plane failures (IAM, quota systems, Service Control) and global replication of bad configuration data.

Key comparison insights:

Frequency — Independent monitoring suggests AWS and Azure each report 15–25 major incidents per year; Google Cloud typically logs 10–18, though smaller regional events may go unpublished.

Duration — Median outage duration across all three providers ranges from 30 to 90 minutes for regional incidents; global control-plane failures can last 2–4 hours.

Root causes — AWS incidents frequently involve automation and API rate-limiting bugs; Azure sees more identity and Active Directory issues; Google Cloud shows higher incidence of global configuration propagation failures.

Transparency — AWS and Google Cloud publish detailed post-mortems within days; Azure post-mortem detail and timeliness vary by incident severity.

No provider consistently outperforms the others. A 2023 third-party analysis found AWS experienced slightly fewer total outage minutes than Google Cloud and Azure, but the margin was small, roughly 20 minutes per year difference on average. Customers seeking maximum uptime deploy across multiple clouds or architect for single-provider multi-region failover with automated health checks and DNS-based traffic steering.

Official Google Cloud Incident Reports and Resources

lWayzpHrTkGh_gbE4MpjRA

Google Cloud Status Dashboard — Real-time and historical status for all Google Cloud services, organized by region. Post-incident summaries appear here within hours of resolution, with full post-mortems added within 5–7 days.

Google Cloud Service Health — Personalized incident notifications and RSS feeds for subscribed services. You can filter by project and region to receive alerts only for resources you use.

Incident Reports Archive — Detailed root-cause analyses and remediation steps for major outages. Published on the status dashboard under “Incident Reports”; searchable by date, service, and region.

Google Cloud Support and SLA Documentation — Official SLA terms, credit request procedures, and uptime calculation methods. Includes step-by-step instructions for claiming service credits after SLA breaches.

Final Words

We mapped major Google Cloud outages across recent years and then drilled into 2024, 2023, and 2022 incidents, listing dates, affected services, duration, and impacts.

We highlighted recurring root causes—configuration rollouts, network saturation, quota misalignment, and hardware issues—explained SLA effects and credits, compared Google Cloud to peers, and pointed to official incident reports and practical mitigations like multi-region designs and better monitoring.

Use this google cloud outage history to guide risk planning and improve reliability. You’re better prepared going forward.

FAQ

Q: How long has Google Cloud been out?

A: The Google Cloud has been available since 2008 (Google App Engine); GCP as a full public platform dates from around 2012–2013 with Compute Engine and wider product rollouts.

Q: What is the Google Cloud outage 2019?

A: The Google Cloud outage in 2019 refers to several high-profile incidents that year affecting Compute Engine, Cloud Storage, and networking; Google published post-mortems on its Cloud Status Dashboard.

Q: Has there ever been a Google outage?

A: There have been Google outages that impacted Search, Gmail, YouTube, and Google Cloud; incident reports, timelines, and fixes are available on Google’s status pages and in news coverage.

Q: What is the largest IT outage in history?

A: The largest IT outage in history is often considered the 2016 Dyn DNS DDoS, which disrupted major sites worldwide; “largest” varies by metric like users affected, services down, or duration.

TECH CONTENT

Latest article

More article