What Major Outages Actually Teach Us - Opengear
← Back to Blog

What Major Outages Actually Teach Us

What Major Outages Actually Teach Us

Every major outage exposes an assumption that held under normal conditions and broke under stress. The assumption is rarely the obvious one. That’s what makes these outages worth studying carefully rather than cataloging. 

The industry only learns from incidents like these when the companies involved choose to share what happened. That choice is not automatic, and it is not easy. When operators at Cloudflare’s scale publish detailed post-mortems, every network engineering team benefits from the analysis. This post is written in that spirit. The goal is to extract patterns that help all of us build more resilient infrastructure, not to grade anyone’s response. 

IT and networking issues accounted for roughly 23% of impactful outages in 2024, according to Uptime Institute’s Annual Outage Analysis. And 80% of data center operators believe better management and processes would have prevented their most recent significant outage. That’s not a hardware problem. It’s an operational one. 

Two outages from 2025 show why. 

__________________________________________________________________________________

Cloudflare, November 18: a routine change that reached everywhere before anyone could stop it

Cloudflare published one of the most transparent outage analyses the industry has seen last year, and the lessons in this section come directly from the clarity of their write-up. That kind of openness is what makes shared learning possible. 

The November 18 incident started with a routine database permissions change. The kind of operational action that happens constantly in large infrastructure environments. That change caused a configuration file used by Cloudflare’s Bot Management systems to double in size unexpectedly. When the oversized file propagated across Cloudflare’s global network, it exceeded a hard limit that had lived quietly in the system as a performance optimization. A limit that had never been tested at that scale because the file had never been that large. The system crashed.  

The configuration file refreshes every few minutes and propagates to the entire fleet by design, because it needs to react quickly to changes in internet traffic. That same architecture meant the new version reached every machine before the issue could be identified. Recovery required stopping the propagation, deploying a known-good version of the file, and restarting affected systems across the network. Core traffic was largely restored a few hours after the initial failure. Cloudflare has since rolled out additional validation and kill-switch capabilities, which is exactly the kind of forward response the industry should be taking note of. 

The broader lessons apply to every operator running systems at scale: 

Internally generated content deserves the same validation as external input. Configuration files produced by internal systems often bypass the validation layers applied to user-facing inputs, simply because “internal” feels synonymous with “trusted.” The reality is that any content flowing into a production system at machine speed can produce surprising outcomes. Treating internal data flows with the same rigor as external ones is one of the cleanest resilience upgrades a team can make. 

Hard limits designed as performance optimizations can become unexpected boundaries. Limits written years earlier for efficiency reasons may still be in effect long after the conditions that justified them have changed. Periodically auditing those limits, and pairing them with explicit error handling, turns latent failure modes into known, managed ones. 

Global propagation systems benefit enormously from independent validation gates and instant rollback paths. The capability to disable a feature or roll back a change across the entire network without a code deployment is one of the most valuable resilience capabilities an operator can invest in. Cloudflare’s response shows exactly why.

__________________________________________________________________________________

Starlink, July 24: distributed physical infrastructure with a centralized software dependency 

The Starlink outage is a fascinating case of architectural paradox, and the engineering team’s public acknowledgment of the root cause has given the industry useful material to learn from. 

The network has over 7,600 satellites in low Earth orbit, making it physically the most distributed internet infrastructure on (and off) the planet. And it went down globally because of a software failure in the ground-based systems that coordinate the constellation. 

Those ground-based systems handle traffic flows, satellite handoffs, load balancing, and routing. User terminals do not maintain persistent connections to individual satellites. They are constantly handed off between clusters as the constellation moves overhead, and all of that coordination runs through a central software layer. When a flawed software update propagated to those systems, the coordination layer failed. The satellites were functioning. The physical infrastructure was intact. But without the software telling the infrastructure how to operate, the distribution of the physical layer did not translate into distributed resilience. 

Network monitoring firm NetBlocks confirmed connectivity dropped to 16% of normal levels at peak. Some terminals couldn’t connect to satellites at all. Others connected to ground station infrastructure but couldn’t route traffic beyond it. The physical link existed, but the software layer couldn’t direct it. Recovery took approximately 2.5 hours, with Starlink’s VP of Engineering confirming the cause as a failure of key internal software services. 

The architectural takeaways are valuable for any team designing distributed systems: 

Physical distribution and software centralization are independent resilience properties. A network can be physically distributed across thousands of nodes and still concentrate risk in the software layer coordinating those nodes. Designing the resilience of the coordination layer with the same care as the physical layer is what turns distribution into actual redundancy. 

Blast radius is a function of deployment scope, not change size. A small update pushed simultaneously to every coordination system can affect every endpoint simultaneously. Staggered rollouts, canary deployments, and deployment-scope controls are some of the highest-leverage investments an operator can make. 

Independent management access pays back most when production is unavailable. When a production network is degraded, the tools teams need to diagnose and recover can be affected by the same failure. Building management paths that do not depend on the production network being healthy is one of the most durable resilience patterns in infrastructure engineering. 

__________________________________________________________________________________

The pattern beneath both incidents 

The surface causes are different. The underlying patterns are the same, and they are patterns every operator can learn from. 

Routine operational actions can surface non-routine conditions. Permissions changes and software updates happen constantly. The failures they occasionally trigger are usually latent conditions in the architecture that have been waiting for a specific input. Building systems, and runbooks, that account for this reality is where operational maturity shows up. 

Blast radius is shaped more by propagation architecture than by the fault itself. Thinking carefully about how quickly and widely a change can spread, and where the gates are, is one of the most productive resilience conversations a team can have. 

Recovery is a control problem as much as a restoration problem. In both of these incidents, the path back to normal required teams to execute control actions under pressure. The speed and confidence with which those actions can be taken depends on having reliable access to the systems that need to be touched, regardless of what is happening on the production network. 

__________________________________________________________________________________

The operational dimension most plans underweight 

The Uptime Institute’s data puts a number on the operational dimension: 85% of human error-related outages stem from staff failing to follow procedures, or from flaws in the procedures themselves. That’s not primarily a technical problem, it’s a preparation problem. Procedures written for how systems are supposed to work are less useful than procedures written for how systems actually fail. Testing under realistic conditions rather than idealized ones. Building recovery paths that don’t assume the production network is available. 

Both the Cloudflare and Starlink incidents were recovered well, and both teams have been open about what they learned. That openness is what lets the rest of the industry design for the conditions these events exposed. The organizations that benefit most are the ones already thinking carefully about what they would do if normal conditions did not apply

__________________________________________________________________________________

Interested in how independent management access can strengthen your incident response? Schedule a demo with Opengear. 

Featured Posts

Visibility Without Control Is Just Awareness Visibility Without Control Is Just Awareness
What It's Actually Like When Your Network Goes Down at 4am What It's Actually Like When Your Network Goes Down at 4am
Introducing the OM1300: Intelligent Edge Automation Built for the Real World Introducing the OM1300: Intelligent Edge Automation Built for the Real World
Meet the CM8000: Compact, Modern, and Designed for Modern Out-of-Band Management Meet the CM8000: Compact, Modern, and Designed for Modern Out-of-Band Management
➔ Schedule
a Demo