The State of Reliability Engineering

Posted on abril 20, 2021 by Stephanie Norcini

A Q&A with the Director of Site Operations at PayPal

TJ Gibson, Director of Site Operations at PayPal

As Director of Site Operations at PayPal, TJ Gibson runs the company’s network command center based in Scottsdale, Arizona. The 600-person organization is responsible for site reliability and cloud engineering. They handle functions including incident response, network operations, monitoring, and alerting. I recently interviewed TJ for an episode of the Living on the Edge podcast where we talked about site reliability engineering (SRE), among other things. Here are some of the highlights.

What does Site Reliability mean at PayPal and how has that changed over the last few years?

At PayPal, SRE really started before it was even a term. It was seen as a bug fixing team, there to clean up messes. In the last decade, the industry has come a long way in terms of technology to bolster resiliency and defining SRE as a best practice.

Our mission is to ensure that all of the products that we deliver and all of the capabilities exposed to customers have reliability, resiliency, fault tolerance, and usability baked into them from the beginning.

What does the phrase network resilience mean to you?

Network resilience means that a company is capable of surviving faults, which requires the ability to meet business needs and customer expectations in a way that is efficient and effective, and allows us to respond and react and absorb and grow to provide network resilience. From an industry perspective, I see an opportunity to make SRE more prominent, perhaps similar to how information security practitioners have been able to up-level and make a tighter connection to a company’s policy or regulatory obligations. I think SRE professionals will increasingly be able to step up and link what they do in the network stack more directly to business objectives.

Is security interwoven with SRE or do you still see that as a separate entity?

It is interwoven with everything we do and something that we hold ourselves accountable to our customers for. I think we’re in a little bit of a transition period here where SRE has kind of come into its own. It’s become more of a mature discipline within the industry. And I think that will stay true as we go forward. But I think some of those things that SRE is bringing today will become part of core workflows for products, network architecture, and data centers ― as opposed to always being a centralized IT function.

Do you see anything over the next few years that will change the way that SRE is implemented or do you think it will evolve more slowly?

More and more large enterprises are looking to the public cloud and moving workloads there. I think I saw recently that Capital One essentially was declaring victory in their cloud journey. That brings an entirely new perspective on large-scale applications and SRE. Going forward there are things that I think we haven’t quite yet accounted for.

Machine learning and artificial intelligence will bring aspects to our technology stacks that we just don’t fully understand from a resiliency and operations perspective. I think the frameworks, structure, and accountability with SRE will start to be baked into our cloud applications.

I know you’ve championed the importance of understanding the different phases in a technical career. How would you describe them?

I think early on in your career, the value that the business perceives from you as an individual really boils down to how much you know. And the deeper you know a particular technology or domain area, the higher your value is to the organization.

I think you hit a point usually after five to ten years when your value becomes more about who you know. It’s more about your ability to bring people together; to find the right answers and the right resources within the organization. That’s where a lot of people tend to struggle. It’s about being able to bring your experiences and skillset to bear on specific problems and understand the context quickly, get to relevance quickly, and be able to help find technology solutions.

Some IT veterans tend to fall back on the things that have worked for them in the past without learning the latest and greatest. They need to dive deeper into whatever technology stack is in front of them. But mostly, more senior roles are about being able to layer on their relationship skill set, their understanding of the business and how to translate requirements into business outcomes and technology solutions.

What studies or forums would you recommend for people interested in SRE?

There are several certifications. The Cisco Certified Internetworking Expert (CCIE) is the most well-known one. And recently Cisco added a DevOps section. I’m not aware of anything similar that focuses on SRE or reliability or even site operations.

USENIX, the Advanced Computing Systems Organization, has a yearly conference they call SREcon. They have some very deep tracks on how to use machine learning to bring better insights from your observability platform. Other tracks cover how to build networks for resilience on a global scale and how to provide SRE with hybrid cloud. There are also disciplines around problem management, root cause analysis and how to determine business logic failures differently from systemic technology failures.

I don’t think it’s enough to have only an application development background or only a networking background and be able to step cleanly into an SRE career field and be successful from day one. You really have to understand all of the constituent parts and how they play together to contribute to reliability, operability, and performance.

The pendulum the last couple of years seems to have swung back from everybody wanting to be a specialist in something to an understanding of the value of being a generalist where you can bring in parts of different skillsets or different backgrounds to combine them and get something done. And I think the world of SRE is probably a perfect situation when that makes sense.

It’s not enough to understand how to build the most reliable, most resilient most scalable network, if the application on top of it doesn’t know how to consume and use those benefits.

I’ve got to believe, in your role, you have a hundred “oh crap” stories. Do you have one that you’d like to share with us?

I had some advice early on in my career when I was doing some consulting and I was very nervous about standing up in front of a boardroom full of executives and trying to tell them where their vulnerabilities were. I just had no idea where the questions were going to come from and what the agendas were of the people in the room. One of my colleagues sat me down and he said, “Look, consulting is 10% technical and 90% people.” It’s true. My uh-oh moments really come down mostly to people things mixed with technology.

In the Air Force, my boss pushed a configuration file to every machine on our network of about 180 nodes spread across 40 countries. That file set the same IP address on every device on the network. It was the mid-1990s, so a lot of our automation that we take for granted today did not exist.

I spent three days talking to pilots and people loading airplanes all over the world about how to change the IP address to something that matched. That to me was one of the hairiest experiences, not just because it was so complex and so people-focused but because it highlighted how fragile some of these things are that we take for granted. One simple human mistake essentially shut the network down for three days. The irony was that my boss got a medal for fixing that problem.

When you recover from a situation like that, you can look for opportunities to build automation or gates or controls that would prevent that mistake from ever happening again. So it’s a win. The fact that we found it when we weren’t getting shot at was also fortunate.

Bonus question: Would you ever hire a former hacker?

I truly believe that a person’s background really has nothing to do with who they are today. We all change. If we were held to account for all the things we did when we were 17; if we had social media when we were in high school, for example, we would probably be second guessing a lot of decisions we made. It’s good to be open to evaluating every person’s talent and capabilities regardless of their past.

Watch the full webinar now.

The State of Reliability Engineering

A Q&A with the Director of Site Operations at PayPal

What does Site Reliability mean at PayPal and how has that changed over the last few years?

What does the phrase network resilience mean to you?

Is security interwoven with SRE or do you still see that as a separate entity?

Do you see anything over the next few years that will change the way that SRE is implemented or do you think it will evolve more slowly?

I know you’ve championed the importance of understanding the different phases in a technical career. How would you describe them?

What studies or forums would you recommend for people interested in SRE?

I’ve got to believe, in your role, you have a hundred “oh crap” stories. Do you have one that you’d like to share with us?

Bonus question: Would you ever hire a former hacker?

RECENT POSTS

SOLUÇÕES

PRODUTOS

RECURSOS

PARCEIROS

COMPANHIA

WEBSITES DA OPENGEAR