Something went wrong. That’s the message that billions of users saw last week when the world’s largest social media giant went dark for hours.
What could’ve just seemed like a minor inconvenience for everyone that likes to scroll through social media for a few minutes during the day, actually had much larger repercussions. Enterprises use these platforms to stay connected. Advertising is one of the biggest draws. The outage affected the more than 10 million brands and businesses who use the platform to promote their products. Organizations using the tech titan’s advertising services reported their sales dropping between 30% to 70%, compared to the same period a week earlier.1
Network outages aren’t uncommon, but what caused this disruption and how could it have been prevented?
Understanding The Outage
The outage was triggered by a system that manages their global backbone routers that coordinate network traffic between their data centers. It connects all its computing facilities together. As you can imagine, this consists of thousands of miles of fiber optic cables all over the world. During a routine maintenance job, a faulty configuration change occurred. A command was issued which caused a complete disconnection between their servers, data centers and the internet… that sounds bad, but it gets worse.
This faulty configuration change also blocked the ability of devices and employees to communicate, creating a cascade of network failures. Their BGP routes for DNS nameservers were withdrawn making it seem like their domains didn’t exist although those servers were still operational. This resulted with internet traffic not being able to resolve URLs or making routing decisions. This still sounds bad… and it still gets worse.
The tech titan’s data centers couldn’t be accessed because their networks were down. Their loss of all DNS broke the internal systems and many of the tools they’d use to try remediate the outage. Having their primary and Out-of-Band networks down, engineers were sent onsite to debug the issue, but like many employees of the social media giant, the disruption also caused them to be locked out of all buildings. Then once they were in, there’s many security layers that make it difficult to modify hardware, even once they can be accessed physically.
After 6 hours and about $100 million in revenue lost, the social media giant was back online, but this outage could’ve been resolved a lot quicker with the Opengear2.
Reducing Downtime With Opengear
When a disruption occurs, engineers need remote visibility of their entire network. Not being able to log on or be able to even badge into their buildings was a major challenge. This could’ve been overcome by using an Opengear device with Smart Out-of-Band and Failover to Cellular.
Providing continued internet connectivity for remote LANs and equipment using highspeed 4G LTE once the primary link is unavailable, Failover to Cellular automatically activates a secondary connection. This re-establishes inbound and outbound network access without manual intervention. Once failover is enabled, Opengear devices are able to detect failures sending ICMP ping requests from the primary network interface to a primary and secondary address remotely. If these requests fail, the primary connection has also been deemed as having failed. When the primary connection’s been restored, the devices automatically fail forward and resume normal operations. In this case, restoring access to devices and BGP routes.
The Opengear Network Resilience platform could’ve been leveraged to back up device configuration files prior to making network changes. This would’ve enabled the social media giant the ability to restore the known, good configuration files immediately upon discovering the change had caused the outage. Pushing the saved configuration files from the Opengear device back to the affected equipment would have restored the network quickly.
They’d have another set of tools, on a separate network, to remediate the issue. Having this immediate access would’ve significantly shortened the duration of the outage. The Network Resilience Platform is based on presence and proximity of a NetOps or Smart Out-of-Band console server at every location and is centrally orchestrated through Lighthouse software. Providing an independent management plane, organizations have secure, remote access to all their devices, even during an outage. Engineers can remotely identify and remediate issues.
It can be good to make the headlines, but not for something like this. A resilient network means your customers are always connected. Learn how we can help keep your network up and running because, Opengear means business.
1 https://www.cnbc.com/2021/10/06/facebook-outage-lost-ad-revenue-advertisers-could-seek-refunds.html
2 https://www.indy100.com/science-tech/facebook-losing-money-instagram-whatsapp-outage-b1932252