The third installment in the Road to Resilience webinar series covered why network resilience is a necessity when maintaining uptime in the Data Center. The discussion was hosted by Roy Chua, Principal at AvidThink, and featured Mark Harris, SVP of Marketing at The Uptime Institute and Ryan Hogg, Senior Product Manager at Opengear.
Multi-layer Approach to Uptime
The first topic discussed was how network resilience depends upon more than just the blinking lights on boxes in the data center. Harris explained that you need to take a multi-layer approach to evaluating risk to uptime. Everything under the business services umbrella is part of the risk equation. The Uptime Institute also looks at everything below the box and everything above the box. Opengear appliances live in the middle layer and are focused on performance in this IT equipment layer. Hogg points out that in this middle layer, “Opengear provides a point of presence in the data center so if you don’t have a physical person hands-on, you do have that ability to log in remotely and be able to access the equipment.”
Harris said, “You have to take all three of those layers and essentially have to multiply that risk. You have to consider the availability at the top layer where you’re doing a logical movement of workloads, and then multiply that by the ability for the middle layer, the boxes, to do work, and then multiply that by the ability for the platform itself to do work. The multiplication of those three layers is the actual risk factor for business service delivery.” Harris helps his customers state in business terms what results they desire. You need to be able to specify something like, “I desire the data center to be able to operate under these conditions with this kind of performance.”
It’s the ability to calculate the risks in all three of those layers that tells you if can you resiliently deliver business services.
Evaluating Risk and Creating Priorities
At a high level, Harris helps his customers clarify their needs. They should be able to state, “I need this set of applications all the time and the value to me is much higher than this other set of applications.” He works with customers on which level of tier they want to implement and they literally pull the plug on data centers as part of the evaluation involved in the certification process.
Harris offered a quick litmus test for more specific middle layer risk evaluating. He said, “Ask the technologists in your organization which business applications are being serviced by the boxes that they maintain. If they don’t know that answer, they probably haven’t done a good enough job connecting those dots. The person dealing with the boxes should absolutely know what business service is affected by the uptime of that box.”
Over time, the information about what’s going on in a network changes. If it’s not continually documented, you probably have a higher risk profile than your business expects. If you’re afraid to touch what’s in your data center because you don’t know what’s in there, that’s a red flag.
Network Failure Trends
The conversation shifted to a discussion of what trends in network failures have been in recent years. I was surprised to learn that the number of publicly reported outages is increasing, the duration is increasing and the cost to customers is increasing across the board. Hogg said, “With less access to the data center due to travel restrictions and health restrictions there’s less appetite to do resiliency testing.” Harris pointed out that today, companies often have redundancy built in, but that’s not enough.
In the past when a data center failed, things just stopped. These days other systems take over, but it raises your risk profile because the second failure has a much higher likelihood of occurring than the first failure. He added, “Also, your capacity goes down. Although you may have designed something to handle a million transactions a minutewhen failures happen and things reconfigure and workloads get migrated and boxes get rebooted you’re now at 900,000 transactions per minute. How well is your business going to operate when you need a million transactions but are running at 900,000?”
For years there’s been a push to the cloud, but a recent survey by The Uptime Institute showed that it’s growing very slowly. Harris said that for many years more than half of all work being done in business is going to continue to happen in non-cloud environments. He said, “You can’t just write a check to get out of the issue of having to run good management and good IT.”
Movement to the edge is increasing more rapidly. Harris reported that, “More than half of our survey respondents said that the edges are going to be part of their game plan moving forward.”
Hogg discussed the challenges of the edge and how Opengear addresses them. There’s a lot of risk to IT equipment, such as atmosphere changes in temperature, corrosives from diesel generators, or dust. All of that impacts the lifetime and the reliability of equipment at the edge. He added, “Being able to not only reach the equipment when it’s operating on site but deploying equipment at multiple sites requires a lot of travel. Opengear lets you do secure zero-day provisioning without requiring skilled hands on-site.” A contractor with a pallet of equipment can set it up, plug it in, and have it brought up remotely.
Watch the Full Webinar with The Uptime Institute Now
The second installment in the Road to Resilience webinar series covered why network resilience is a necessity when managing distributed infrastructure. The discussion was hosted by Roy Chua, Principal at AvidThink, and featured Rick Sloot, the COO of i3D.net and Alan Stewart-Brown, VP EMEA at Opengear.
How gamers benefit from an unseen high-performance global network
Rick Sloot is the Chief Operating Officer of i3D.net, a global low latency service provider. He explained that i3D.net has over 40 locations worldwide and among other things, support several AAA game publishers. I found out that they have the biggest development and advertising budgets and are generally considered to be high quality, bestselling games. These publishers, like EA and Ubisoft who produce massively popular games like FIFA can’t afford to take chances with the networking behind the scenes. I3D.net also supports Discord, a communications platform popular with gamers that has hundreds of millions of registered users.
Rick Sloot said that i3D.net started in 2005 with one center in the Netherlands. Now they’re up to 40 locations and will be expanding by 10 more opening in the next six months. I wasn’t surprised to hear that the gaming industry is doing very well during the pandemic. Rick mentioned that in the last six months since Covid started they have had games that have been around for five years and are now breaking all-time records in terms of numbers of players.
In the gaming industry, it’s even more critical to avoid and mitigate outages. Rick said, “if a website is offline for 10 seconds you just wait and you get connected, but if a game is offline or a game can’t connect for even two seconds it means a million players drop off. That’s why we consider ourselves performance hosting.”
Research on evolving attitudes of network resilience
Next, Alan Stuart-Brown presented some results from a 2019 survey of 500 global IT leaders. They were asked how the number of outages has changed over the last 5 years. I was surprised to learn that instead of decreasing, on average outages have increased.
Another interesting question was what are the biggest challenges in resolving network issues. Surprisingly, even in the pre-covid world, travel time was the biggest. I’m sure that’s even more of a challenge now. The second biggest issue was a lack of in-house engineering resources.
The end result of the survey was that organizations are increasing network resilience in a number of ways, the top three being bringing in new security systems, increasing automation, and enhancing network automation. These are all things that i3D.net depends on Opengear for.
The benefits of an independent Out-of-Band management network
Alan talked about why many Opengear customers needed an independent Out-of-Band management network and chose Opengear for it. He said, “i3D.net and many of our customers chose Smart OOB because they recognized that they just couldn’t get to places quickly if there was a problem.” It’s even harder now with Covid and a worldwide shortage of skilled engineering resources, particularly data center people.
But Opengear’s benefits don’t stop there. Alan related a recent customer story.
“A large global service provider that operates in 50 countries and uses Opengear and Lighthouse extensively, initially invested in Opengar to reduce the necessity of on-site visits. During an SD-WAN software update to one appliance in one location there was a human error that got sent to several hundred of the SD-WAN appliances and took the whole network out. They had the foresight to be using Opengear’s Network Resilience Platform. They were able to quickly remediate. It took less than an hour what would have taken days and days to fix.”
Not only that, they were able to roll back to a previous version of the SD-WAN firmware and do the whole thing through Lighthouse.
Alan brought up Opengear’s newest next generation Console Server. It’s a NetOps enabled device that includes a TPMC (Trusted Platform Module Chip) to allow secure day one provisioning, not just for emergency use. He concludes, “Customers can put their put their various configs and their boot images on the device, safe in the knowledge that it cannot be tampered with because the data is encrypted at rest and in transit and can only work via via Lighthouse so it adds another layer of protection and security.”
Roy Chua introduced the topic for the next in the series “Network Resilience: The Key to Uptime” featuring Mark Harris of the Uptime Institute.
Watch the Full Webinar with i3D.net Now
Watch the Recording
The Road to Resilience webinar series kicked off with “Out-of-Band: More Than A Safety Net?” The discussion was hosted by Roy Chua, Principal at AvidThink, and featured Gary Marks, President of Opengear, and Todd Rychecky, VP Sales Americas.
The panel talked through several topics, covering the evolution of Out-of-Band management (OOB), considered OOB’s broader everyday use as the basis of a more resilient network, and finally discussed the value of implementing a coherent Network Resilience solution.
Here are some key ideas discussed in this webinar:
How Out-of-Band has Evolved.
After Roy had outlined some of the more public risks of a non-resilient network, Todd recalled the history of Out-of-Band management. He explained how it started out as simple terminal servers on point of service devices and grew to more complex console servers for admin access to serial ports. As networks grew, the drive for OOB shifted to network engineers who wanted out-of-band access to router switches and data centers. Cellular OOB adoption in the data center was another major shift. And now the movement of compute out to the network edge is driving out-of-band growth. It shouldn’t matter if it’s a mile away, a hundred miles away, or around the world. Networks need the same level of uptime, reliability, and resilience to respond to any node, including IOT devices and sensors, autonomous vehicles, or perhaps someday soon, remote surgery. Network resilience is vital.
Liberate the Management Plane
Roy then asked Gary Marks to explain the importance of the management plane. Gary described how, in the early days of telecom, there was a separation of the data plane from the control plane and the management plane. But as networking expanded, there was a shift to purpose-built appliances with vendor-controlled hardware and software which brought these planes together. Over the last few years, with the introduction of SDN there’s been an effort to separate the data plane and the control plane and to have the ability to put the control plane in the cloud. But there’s still the issue of the management plane being captive. You can’t manage the network when the network goes down.
The idea of separating, or liberating, the management plane is to have an accessible network that allows you to remediate issues when there’s an interruption in the production network.
Gary Marks summed it up. “If you’re using the network to manage the network, you’re not truly resilient. You need to separate the network management plane from the rest of your network. And to reach this level of network resilience, you need a complete platform that includes hardware and software to gain true liberation of the management plane from the production and data planes.”
Site Reliability and Network Resilience
Todd pointed out that, with more edge network points and remote data centers, truck rolls have become really expensive. These days, you might not even be able to fly somewhere to reach your remote locations. You really need to build resilience into the foundation at the very beginning, as opposed to trying to retrofit after problems occur.
Google have popularized the job role of Site Reliability Engineer (SRE), and Roy suggested that there is now potential for an equivalent role of Network Resilience Engineer as networks achieve a larger scale and more locations. Todd agreed, explaining that the SRE sits between the DevOps team and the network team, and is often the primary user of the out-of-band network. They are particularly interested in the introduction of more advanced features, such as the ability to run Docker containers and Python scripts.
Gary noted that Opengear has been following the evolution of Devops into Netops, and is reflecting that with the new NetOps Console Server and added capabilities to the Lighthouse software. The added benefit of network reliability engineering plays to this NetOps evolution.
The Road to Resilience
This was the first in the series of webinars which will feature a variety of panelists discussing the latest approaches to Network Resilience.
If you weren’t able to attend but would like to view the recording, be sure to check out the video here.
Watch the Webinar
Console servers with internal cellular modems give you a wireless broadband IP connection for high-speed access to your remote sites. And Opengear Failover to Cellular™ provides continued Internet connectivity for remote LANs and equipment over high-speed 4G LTE or 3G networks when the primary Internet link is unavailable. While cellular wireless gives you great flexibility and availability, you need to make sure that you address potential security weaknesses from the start.
Here’s a brief look at the vulnerabilities and practical steps you can take to mitigate your risks. For more detailed suggestions, download our Whitepaper, Keeping Your Cellular Modem Secure.
When your cellular IP address is publicly available, typing that IP address in a browser gives you password protected access to a remote cellular modem. While this remote access is convenient, it is also available to an attacker.
The best option is to avoid using a publicly available cellular modem IP. If you can, restrict inbound connections to a VPN client to secure remote access. Access to the IP is only available through an authenticated and encrypted network tunnel. And if the VPN is configured with strong keys and ciphers, it will be virtually uncrackable. The trade-off is that that you lose the convenience and minimal configuration of direct SSH, and that a VPN client has to be configured at each remote access location. If that’s not feasible, there are other steps you can take to decrease exposure.
You can keep your public cellular connection enabled only when you need it. Your Opengear device can automatically turn the cellular connection on when failover occurs and off when the issue is resolved. The cellular connection is completely inactive during normal operation. This keeps the interface off the Internet as much as possible to both avoid high data charges and lessen exposure of this potential target from malicious actors.
When your IP is public, the username and password is the only thing protecting it from malicious actors. Make sure you configure an admin group account and require strong passwords. Also enable brute force protection (fail2ban) for the cellular IP which limits the number of authentication attempts a user can make before a temporary ban is put in place.
Lock down the firewall. Start with more restrictions than you think you’ll need. You should disable ping responses to thwart ping sweeps of public address space by attackers seeking targets. Disable unencrypted or insecure services and limit listening ports.
Consider running services on alternate ports to provide some degree of security by obscurity against attacks specifically targeting ports 22 and 443. And whenever possible, restrict access to trusted source networks only. If you have a known range of addresses, block all connections originating from other networks.
Finally, keep an eye on your data usage. Runaway data usage may indicate, among other things, a malicious worm script, so you need to monitor data usage in as many ways possible.
Feel secure in using your cellular modem by taking a few key steps, including exposing your cellular IP only when you need to, creating an admin group and having users with strong passwords, locking down the firewall, and using a VPN to restrict access. Check out our Keeping Your Cellular Modem Secure Whitepaper for specific steps you can take to ensure a secure and reliable network.