Site Reliability Engineering (SRE) is still going strong in the world of software development. As a bridge between developments and operations, it’s a necessary part of any organization that wants to work like a well-oiled machine. Simply put, SRE tries to fix a widespread problem in organizations: siloing.
But not much is known about the job requirements of becoming a site reliability engineer. With this guide for up-and-coming SRE engineers, we aim to give you an understanding of the tools you need to rock this job since this is such a crucial, high-skill position.
Now, let’s see what it takes to become a site reliability engineer who can serve as a bridge between teams.
What does a site reliability engineer do?
In short, they are a mix between a developer and an IT operations expert. A site reliability engineer has tremendous skills for identifying blind spots and structural weaknesses in infrastructure and systems.
But, as is not the case with many IT operations professionals, they also have both the skill, will, and authority to proactively write and deploy code that addresses potential problems and solves or avoids incidents.
With this kind of non-siloed thinking, SRE connects perfectly with the efficiency-first culture of DevOps — and fixes many blind spots in this framework. Just by taking the time to identify reliability concerns and team-building with this in mind, it’s very likely that you’ll have managed to integrate both reliability and testing earlier in the software development pipeline.
Plus, site reliability engineering doesn’t just make things easier earlier. It lets IT teams voice concerns to the development teams, creating a constant feedback loop. You’ve probably heard operations teams complain endlessly about developers not being accountable for the apps they make. SRE nixes this precisely by allowing SRE teams to add value in a forward-thinking way and retroactively, creating a much tighter, cohesive bond.
And what’s this about DevOps vs. SRE? They’re not mutually exclusive. Companies using DevOps have also adopted an SRE approach for enhanced reliability. The main reason is that SRE allows enhanced observability and more metrics of automation-reliant dynamic applications. They are not separate from each other. They’ve shared methodologies that can increase their effectiveness.
SRE skills and qualifications
Site reliability engineers must master several skills before being fit for the job. Some of these are:
- Building software and systems while managing the platform infrastructure and applications.
- Having a Bachelor's degree in computer science or some equivalent, highly technical discipline. Previous success in technical engineering is going to be preferable.
- Understanding a variety of operating systems — most commonly, but not limited to, Linux — as you will be using them regularly.
- Managing the continuous integration/continuous development pipeline (CI/CD). You’ll probably be tasked with building this pipeline from scratch.
- Experience with cloud-based distributed technologies such as Ceph, HDFS, NFS, and S3, as well as dynamic resource management frameworks (like Kubernetes, Mesos, or Yarn).
- Deep knowledge of version control (such as Git) and monitoring tools like Grafana, as well as a variety of databases (such as NoSQL and MySQL).
- Manage and partner with development teams through taxing testing and release cycles.
- And lastly, there are the soft skills that you need to master. Communicate effectively with various people, teams, and in a multitude of situations. There are no real qualifications for these skills, but trust us, you’ll know if you don’t have them, or worse, your employers will.
Daily roles and responsibilities of a Site Reliability Engineer
So, before you even try to figure out if the job is for you, you need to have a nice look at what the daily responsibilities of SRE engineers are:
- Building software services for DevOps, ITOps & customer support teams. This means you will be proactively working in SRE teams to make the lives of IT and support staff easier. You’ll be tasked with creating in-house tools to manage incidents.
- Patching up support escalation cases. While you’ll see fewer critical incidents in production, this will still be a significant part of your day-to-day job. Plus, since you’ll know so much about what goes on in the software development pipeline, you’ll be great at routing people and tools to where they’re needed the most.
- Making on-call rotations and processes the best they can be. Thus, you’ll be tasked with plenty of on-call responsibilities, so you’d better keep that cellphone on. Also, site reliability engineers typically update runbooks, tools, and documentation, allowing them (or others) to respond to incidents proactively.
- Documenting share-ready knowledge. SREs are exposed to the complete development cycle. They’re able to create documentation throughout a cross-team, historical process. This will also mean that teams have knowledge bases when they need them.
- Conducting post-incident reviews that help. But site reliability engineers can help teams think about incidents and learn from their mistakes, so the same thing doesn’t happen again. This ability to nip things in the bud is one of the most optimizations of the software development lifecycle.
- Delivering solutions that leverage the best automation tools on offer. And, in some cases, delivering in-house, bespoke applications that improve employees' lives by reducing menial labor.
Tips to prepare for a SRE interview
Not every organization is going to operate in the same way. Yet, there are some pretty typical applications you can expect to have flung at you during an SRE interview.
Let’s consider this role-playing exercise and walk you through some possible answers.
“What’s the difference between SRE and DevOps?”
The answer to this question is going to change from team to team. But it’s a way for you to generally put your past uses of SRE in a positive light.
For instance, some organizations will have their own dedicated DevOps teams and engineers, while others will just follow DevOps principles as a general methodology. Regardless, suppose you can talk positively about how you’ve bridged the gap between developments and operations and were able to increase overall efficiency with ideas on how to do it in the future. In that case, you’ll breeze through this question.
“What’s the most appealing prospect of becoming an SRE?”
You won’t get very far unless you show you’re happy about the possibility of landing the job. While SREs aren’t always seen as exciting roles, they are vital and require skills few possess. This is your moment to speak at length about how you will build services that improve system reliability and increase customer and employee satisfaction.
And indeed, being part of an SRE team is exciting because you can create a lasting impact throughout the development lifecycle, from researchers to end-users.
“Are there issues with your current development pipeline?”
It can be a bit of a cheat question, where the interviewer may be trying to determine your ability to assess how well your deployment pipeline is working and whether you can make intelligent decisions to change it for the better.
You’ll use all the site reliability engineering strengths, like identifying deployment bottlenecks and monitoring deficiencies, letting stakeholders know about reliability concerns, and determining where your team can improve resiliency (without tanking productivity).
What you want here is to demonstrate your ability for high-level problem-solving.
“What is a success for you and your team, and how do you track it?”
This question ascertains how well you’ve been using monitoring and altering tools. Plus, it lets the interviewers know about how you determine what a “healthy” system looks like.
An SRE team must leverage internal and external outputs to judge a system’s overall health. Accordingly, you should be able to convert that information into actionable insights for the IT and technical folk.
“What’s your background with programming languages and other tools?”
They will want to know about your background first. The interviewer will want to get this out of the way quickly because you won’t be cut out for the job if you don't know your stuff.
“How are your IT and operations teams getting on, anyway?”
A bit of a trick question as well, because as you know, site reliability engineering is about building bridges between otherwise separated areas. Thus, the interviewer wants to know, first and foremost, whether you’ve been good at your job. Then again, they may ask questions about productivity bottlenecks in the best families. It is also one of the times you can mesmerize them with your SRE know-how and tell them how you’re working to increase efficiency across the board.
Most of the time, you’ll find, it’s about making information accessible to all concerned parties and increasing visibility between departments.
“What’s your on-call setup like? And how would you structure that ideally?”
As we’ve mentioned above, on-call efficiency and quality of life are some of the chief daily concerns of a site reliability engineer. So, an SRE interview will require you to demonstrate how you would set up an efficient, empathetic on-call experience.
When asked about this, remark the importance of providing a humane experience. While it’s true that tools and processes are essential, people are the primary concern, not how much automation is making your response times or resolution rates improve.
What does Site reliability engineering bring to the table? We believe it’s an integrated meta-team, a cross-team collaboration that makes everyone pull in the same direction. We live in an integrated world, and technology is not isolating us but optimizing us. It’s no different in software development.
Another significant part of SRE is autonomy, meaning that site reliability engineers will have a degree of freedom and independence that they don’t see in many other roles. If being able to make organizational changes or run experiments that lead to more excellent system reliability tickle your fancy, this is a career for you. In addition, you’ll probably improve the lives of your colleagues by orders of magnitude, and that’s no small thing.
And not just that, you’ll be learning about the whole gamut of software development and IT operations disciplines. This means that you’ll not only link together diverse teams, but you’ll also be constantly building your skill set. This will lead to you becoming a better development, but also much better as management as well.
Frequently Asked Questions
What other titles for Site Reliability Engineer are there?
Aside from site reliability engineer, some other common, almost synonymous names for the role are DevOps engineer, automation engineer, or reliability engineer.
What’s the difference between SRE and a software engineer?
Site reliability engineers are responsible for maintaining reliability as their chief concern. Meanwhile, engineers focus primarily on developing software, not systems-wide issues. While that doesn’t mean that both roles are independent and non-overlapping, this is a general guideline — both want the same things but focus on different aspects of the process.