What is it?
SRE stands for Site Reliability Engineering.
The concept of SRE was created in 2003 by Ben Treynor, a Google employee, and according to his own words, SRE is “what happens when a software engineer is tasked with operations tasks.”
SRE teams focus on using software (code) to solve problems, improve systems/services/ applications in production, and automate as many manual tasks as possible.
The goal of SRE teams is to enhance the continuous delivery cycle of a product by collaborating with development teams to understand and apply operational processes throughout the entire application development lifecycle. All of this is aimed at ensuring that any developed application or service is highly reliable and scalable.
It may be familiar, and you ay be wondering if SRE is the same as DevOps. The answer is no. Firstly, the concept of SRE emerged before the DevOps philosophy. This doesn’t mean they don’t share many ideas and aim to achieve similar goals, but they address different questions:
- DevOps focuses on what needs to be done to bridge the gap between development and Operations.
- SRE focuses on how things should be done to bridge that gap between development and Operations.
At Kiteris, we like to explain SRE as a framework within DevOps, which is why we use the analogy of SCRUM framework in relation to applying Agile principles.
The pillars of SRE.
As we have mentioned before, SRE seeks to answer the question of how things should be done. Therefore, SRE defines fundamental pillars that aim to provide an answer to this question.
Reducing organisational silos.
To achieve this goal, SRE promotes the distribution of ownership and responsibility among all teams in the company. This aims to make teams feel ownership of the project
and see all tasks as an indivisible set to achieve success. This prevents typical situations like “this is not mine,” “my part is already done,” or “this is not my problem.”
Accepting failures as normal.
It starts from the premise that things fail, especially when human intervention is involved. Therefore, SRE teams dedicate a significant amount of time to resolving critical issues and improving systems to be highly fault-tolerant.
To achieve this goal, SRE teams work on three distinct approaches:
- Anticipating problems before they occur. Generally, there is a gradual degradation in systems before a problem occurs that leads to a service outage.
- Runbook and postmortem. All SRE teams generate documentation for common operations and analyze incidents that are repetitive or have caused system instability issues. Documentation is crucial for distributed knowledge and improving incident resolution times.
- Automating. SRE teams focus on automating problem resolution as much as possible without impacting users. This significantly reduces average resolution times and minimizes high-stress situations for technical teams.
Gradual changes.
As you already know, SRE was born within Google. Like any cutting-edge company, one of its main business needs is to make very frequent product releases and improvements. SRE embraces change and continuous improvement, but always providing the necessary level of quality and scalability for a product to be as reliable as possible.
SRE teams will help development teams implement these best practices and understand the need to apply the necessary operational processes to achieve this goal.
Similarly, at Kiteris, we see SRE teams as a possible evolution of AMS teams, adding DevOps knowledge and focusing primarily on critical applications in 24×7 environments.
Automation.
We have mentioned that things fail, especially when human intervention is involved. Therefore, SRE aims to automate as many manual tasks as possible to provide value to development and operations teams. With less human intervention, the system becomes more reliable.
Measure, measure, and measure…
In order for an SRE team to know if things are going well, they need to have the necessary tools to understand what is happening and what will happen in their systems. This can be achieved by setting up monitoring alerts, implementing a comprehensive testing cycle before each deployment, or conducting code reviews to ensure best practices in development. Additionally, to help anticipate problems, SRE includes the concept of Observability, which we will delve into in more detail later on.
As we have repeatedly mentioned, SRE focuses on providing maximum reliability. To achieve this, it is essential to define a series of service levels that allow measuring this characteristic. SRE defines three types of differentiated Service Levels:
- SLA (Service Level Agreement) is the most well-known and usually involves contractual consequences in case of compliance or non-compliance.
- SLO (Service Level Objective) is the most important level for any organisation looking to implement SRE, as it defines the optimal availability goal to ensure customer/user satisfaction. These indicators are based on more specific aspects than SLAs, such as the total response time for a user registration process on a website.
- SLI (Service Level Indicator) are metrics that are part of an SLO and help SRE teams focus their efforts on correcting specific metrics that contribute to achieving the promised SLOs. Therefore, an SLI could be the latency of a specific query within a user registration form on a website.
We recommend reading the books created by Google, which detail and provide examples of the practices discussed.