SCHEDULE A DEMO

Carl Lundgren is a Principal Technical Support Engineer at Opengear. He spends his days, and occasionally his nights, helping network engineers regain access to infrastructure they can no longer reach. It is, by his own description, a job that requires patience, a working knowledge of cellular failover protocols, and a tolerance for situations where the person on the other end of the phone is not having a good time.

He has troubleshot enough unusual configurations to know that the outages that look simple on the surface often aren’t. Most of the time, the problem requires extra digging. He keeps a mental list of questions worth asking early: What changed recently? What does the customer actually have access to right now? Is this affecting one device or many? “You’re not just fixing a problem,” he says. “You’re figuring out what the problem actually is first. Those are two different things.”

When something unusual comes in, it doesn’t sit in a queue. The team works directly with engineering and product, something Carl says has changed significantly from how support used to work. “We used to operate in silos,” he says. “Today, communication is probably the biggest key to a lot of recent successes. We’re not problem-solving in isolation.”

__________________________________________________________________________________

Cloudflare, November 18: a routine change that reached everywhere before anyone could stop it

Cloudflare published one of the most transparent outage analyses the industry has seen last year, and the lessons in this section come directly from the clarity of their write-up. That kind of openness is what makes shared learning possible.

The November 18 incident started with a routine database permissions change. The kind of operational action that happens constantly in large infrastructure environments. That change caused a configuration file used by Cloudflare’s Bot Management systems to double in size unexpectedly. When the oversized file propagated across Cloudflare’s global network, it exceeded a hard limit that had lived quietly in the system as a performance optimization. A limit that had never been tested at that scale because the file had never been that large. The system crashed.

The configuration file refreshes every few minutes and propagates to the entire fleet by design, because it needs to react quickly to changes in internet traffic. That same architecture meant the new version reached every machine before the issue could be identified. Recovery required stopping the propagation, deploying a known-good version of the file, and restarting affected systems across the network. Core traffic was largely restored a few hours after the initial failure. Cloudflare has since rolled out additional validation and kill-switch capabilities, which is exactly the kind of forward response the industry should be taking note of.

The broader lessons apply to every operator running systems at scale: 

Internally generated content deserves the same validation as external input. Configuration files produced by internal systems often bypass the validation layers applied to user-facing inputs, simply because “internal” feels synonymous with “trusted.” The reality is that any content flowing into a production system at machine speed can produce surprising outcomes. Treating internal data flows with the same rigor as external ones is one of the cleanest resilience upgrades a team can make. 

Hard limits designed as performance optimizations can become unexpected boundaries. Limits written years earlier for efficiency reasons may still be in effect long after the conditions that justified them have changed. Periodically auditing those limits, and pairing them with explicit error handling, turns latent failure modes into known, managed ones. 

Global propagation systems benefit enormously from independent validation gates and instant rollback paths. The capability to disable a feature or roll back a change across the entire network without a code deployment is one of the most valuable resilience capabilities an operator can invest in. Cloudflare’s response shows exactly why.

__________________________________________________________________________________

Starlink, July 24: distributed physical infrastructure with a centralized software dependency 

The Starlink outage is a fascinating case of architectural paradox, and the engineering team’s public acknowledgment of the root cause has given the industry useful material to learn from. 

The network has over 7,600 satellites in low Earth orbit, making it physically the most distributed internet infrastructure on (and off) the planet. And it went down globally because of a software failure in the ground-based systems that coordinate the constellation. 

Those ground-based systems handle traffic flows, satellite handoffs, load balancing, and routing. User terminals do not maintain persistent connections to individual satellites. They are constantly handed off between clusters as the constellation moves overhead, and all of that coordination runs through a central software layer. When a flawed software update propagated to those systems, the coordination layer failed. The satellites were functioning. The physical infrastructure was intact. But without the software telling the infrastructure how to operate, the distribution of the physical layer did not translate into distributed resilience. 

Network monitoring firm NetBlocks confirmed connectivity dropped to 16% of normal levels at peak. Some terminals couldn’t connect to satellites at all. Others connected to ground station infrastructure but couldn’t route traffic beyond it. The physical link existed, but the software layer couldn’t direct it. Recovery took approximately 2.5 hours, with Starlink’s VP of Engineering confirming the cause as a failure of key internal software services. 

The architectural takeaways are valuable for any team designing distributed systems: 

Physical distribution and software centralization are independent resilience properties. A network can be physically distributed across thousands of nodes and still concentrate risk in the software layer coordinating those nodes. Designing the resilience of the coordination layer with the same care as the physical layer is what turns distribution into actual redundancy. 

Blast radius is a function of deployment scope, not change size. A small update pushed simultaneously to every coordination system can affect every endpoint simultaneously. Staggered rollouts, canary deployments, and deployment-scope controls are some of the highest-leverage investments an operator can make. 

Independent management access pays back most when production is unavailable. When a production network is degraded, the tools teams need to diagnose and recover can be affected by the same failure. Building management paths that do not depend on the production network being healthy is one of the most durable resilience patterns in infrastructure engineering.

__________________________________________________________________________________

The pattern beneath both incidents 

The surface causes are different. The underlying patterns are the same, and they are patterns every operator can learn from. 

Routine operational actions can surface non-routine conditions. Permissions changes and software updates happen constantly. The failures they occasionally trigger are usually latent conditions in the architecture that have been waiting for a specific input. Building systems, and runbooks, that account for this reality is where operational maturity shows up. 

Blast radius is shaped more by propagation architecture than by the fault itself. Thinking carefully about how quickly and widely a change can spread, and where the gates are, is one of the most productive resilience conversations a team can have. 

Recovery is a control problem as much as a restoration problem. In both of these incidents, the path back to normal required teams to execute control actions under pressure. The speed and confidence with which those actions can be taken depends on having reliable access to the systems that need to be touched, regardless of what is happening on the production network. 

__________________________________________________________________________________

The operational dimension most plans underweight 

The Uptime Institute’s data puts a number on the operational dimension: 85% of human error-related outages stem from staff failing to follow procedures, or from flaws in the procedures themselves. That’s not primarily a technical problem, it’s a preparation problem. Procedures written for how systems are supposed to work are less useful than procedures written for how systems actually fail. Testing under realistic conditions rather than idealized ones. Building recovery paths that don’t assume the production network is available. 

Both the Cloudflare and Starlink incidents were recovered well, and both teams have been open about what they learned. That openness is what lets the rest of the industry design for the conditions these events exposed. The organizations that benefit most are the ones already thinking carefully about what they would do if normal conditions did not apply.

__________________________________________________________________________________

Interested in how independent management access can strengthen your incident response? Schedule a demo with Opengear.