Today’s digital users have incredibly high expectations — studies show that apps have just one shot at winning users. And nothing hurts user experience like poor performance. If an app is too slow or unreliable, it is quickly ditched by users.
As a result, companies are starting to place more emphasis on ensuring high reliability within their applications and services. To support this goal, the practice of Site Reliability Engineering (SRE) has grown more prominent in recent years. Having originated at Google, the SRE doctrine introduces automation to maintain high performance and stability of digital services. But although the SRE role is gaining popularity, filling these positions amid talent shortages, even in a time of layoffs, isn’t always easy.
The DevOps Institute recently found that 85% of respondents cited a lack of staff with the necessary skills as their biggest challenge when implementing SRE. Not only that, but the SRE requires a specific type of person with the right skillsets.
I met with Narayanan Raghavan, senior director of site reliability engineering for OpenShift at Red Hat, to discover the benefits of adopting the SRE practice and some tips on filling SRE positions. According to Raghavan, the SRE role requires a skilled engineer with a systems mindset and empathy for end users.
Below, we’ll consider how CXOs, CSOs, CISOs, and other IT leaders can facilitate SRE practices and what traits they should look for in prospective engineers.
Benefits of Adopting an SRE Culture
SRE is loosely defined. It typically falls under DevOps but has been embraced in varying forms. Some everyday SRE activities include setting service level agreements, responding to incidents, and performing postmortems. Other hallmarks of the SRE role are reducing toil by “automating your job away” and promoting cross-department collaboration.
Site Reliability Engineering can bring many benefits to your software support practices — first and foremost, by increasing the observability of end-to-end outcomes. Whereas DevOps looks at individual “layers of the cake,” such as networking, storage, or the application layer, SREs, on the other hand, attempt to see the entire slice, all the way to understanding end customer outcomes, says Raghavan.
When things aren’t so segmented by technical layers, you begin to have more “skin in the game,” says Raghavan. And when the end customer experience becomes more important, building with customer empathy in mind becomes more important, too. This can result in more stable applications and fewer bugs. Also, since SREs can see the end-to-end picture, they can identify features in one use case that might apply to other development workflows. This can result in sharing reusable internal tools to automate things like development workflows, configurations, or delivery models.
Skills and Experience That Matter Most for SRE
So, when team leaders are trying to fill SRE positions, what experiences should they look for in prospective candidates? According to Raghavan, finding technically capable people is the easy part. The more challenging part is finding people who have soft skills and the mental flexibility to place themselves in another’s shoes. He describes this as the ability to “flex and parachute into the unknown.”
Another element is systems thinking. “I hire software engineers with a systems mindset, or I hire systems engineers who can write code,” said Raghavan. The goal is to find people who can improve reliability at scale to make the overall platform more boring. By boring, we mean fewer wake-up calls at 2 AM to mitigate nasty surprises and outages.
Organizations are producing exponential data, including a sea of observability data. Therefore, the ability to see trends in data is a necessary element of the SRE position. Raghavan describes this skill as “being able to sift through digital exhaust to get to the nuggets that matter.” Problem-solvers with a knack for data analysis will prove helpful.
Lastly, since the SRE role interfaces with many types of teams, the position requires self-starters who can communicate comfortably with stakeholders, be they internal developers, partner businesses, or end-consumers. The role, therefore, requires an intuitive collaborator.
Tips For Recruiting SREs
So far, we’ve identified an ideal SRE to have empathy, technical DevOps skills, a systems mindset, data-savviness, and natural communication skills. If this theoretical candidate sounds too good to be true, that’s because they probably are. According to Raghavan, leaders shouldn’t be looking for candidates that hit all the bullets in a job description. Instead, he encourages keeping an open mind to folks who don’t meet all these requirements but have the potential to grow.
“This kind of role is not for everybody,” said Raghavan. The SRE position requires you to wear multiple hats, which can be an emotional burden. It also involves engineering toil and building systems that scale, but not everybody is cut out for both worlds. When recruiting for the role, Raghavan walks interviewees through an extreme case to see how they react. Suppose multiple components have failed, and alerts are going haywire — how someone responds can demonstrate how they operate under stress. It can also test their ability to differentiate symptoms-based alerts versus cause-based alerts.
Another method to build a healthy SRE culture is providing the opportunity to make mistakes. This means celebrating failures and not playing the blame game. Blameless retrospectives, for example, is a good philosophy to embrace as it helps teams discover the root cause of incidents without pointing fingers. “Being able to fail is part of the learning process that many people miss out on,” said Raghavan.
SRE Team-Building: Final Thoughts
Many skill sets are integral to the SRE practice, but one especially important trait is being able to see the bigger picture. Since SREs focus on scale and cutting across the business, they can spot those “golden eggs” that all teams share, said Raghavan. Repurposing these capabilities means eliminating the need to reinvent the wheel across teams. “Once you do that, then it simplifies systems and processes across the board.”
Regarding team dynamics, we must also consider the nature of today’s software teams, which are often geographically distributed or involve a hybrid of remote and in-person work. As such, where you build your team matters. Leaders may have to creatively arrange the schedule to ensure that distributed employees are working only during business hours. Working asynchronously also underscores the need for soft skills and communication traits.
But finding the right talent with these skills is one thing — retaining it is another. Thankfully, the position has many qualifications that look good on a resume, which could entice engineers. “What engineer wouldn’t be excited to work on Kubernetes, OpenShift, running on AWS, GCP, and Azure and thinking about things like service mesh, and Istio?” said Raghavan. If you take the excitement around modern technologies and pair that with a supportive team dynamic, you can create a positive, sustainable SRE working environment that is hard to leave.
Want more tech insights for the top execs? Visit the Leadership channel: