SRE. Understanding what it is & who it can help your business

SRE. Site Reliability Engineering

What is it?  

SRE stands for Site Reliability Engineering.  

The concept of SRE was created in 2003 by Ben Treynor, a Google employee, and  according to his own words, SRE is “what happens when a software engineer is tasked  with operations tasks.”  

SRE teams focus on using software (code) to solve problems, improve systems/services/ applications in production, and automate as many manual tasks as possible.  

The goal of SRE teams is to enhance the continuous delivery cycle of a product by  collaborating with development teams to understand and apply operational processes  throughout the entire application development lifecycle. All of this is aimed at ensuring  that any developed application or service is highly reliable and scalable. 

It may be familiar, and you ay be wondering if SRE is the same as DevOps. The answer is  no. Firstly, the concept of SRE emerged before the DevOps philosophy. This doesn’t  mean they don’t share many ideas and aim to achieve similar goals, but they address  different questions:  

  • DevOps focuses on what needs to be done to bridge the gap between  development and Operations. 
  • SRE focuses on how things should be done to bridge that gap between  development and Operations.  

At Kiteris, we like to explain SRE as a framework within DevOps, which is why we use the  analogy of SCRUM framework in relation to applying Agile principles. 

The pillars of SRE. 

As we have mentioned before, SRE seeks to answer the question of how things should  be done. Therefore, SRE defines fundamental pillars that aim to provide an answer to this  question.

Reducing organisational silos. 

To achieve this goal, SRE promotes the distribution of ownership and responsibility  among all teams in the company. This aims to make teams feel ownership of the project 

and see all tasks as an indivisible set to achieve success. This prevents typical situations  like “this is not mine,” “my part is already done,” or “this is not my problem.” 

Accepting failures as normal. 

It starts from the premise that things fail, especially when human intervention is involved.  Therefore, SRE teams dedicate a significant amount of time to resolving critical issues  and improving systems to be highly fault-tolerant.  

To achieve this goal, SRE teams work on three distinct approaches: 

  • Anticipating problems before they occur. Generally, there is a gradual degradation  in systems before a problem occurs that leads to a service outage. 
  • Runbook and postmortem. All SRE teams generate documentation for common  operations and analyze incidents that are repetitive or have caused system  instability issues. Documentation is crucial for distributed knowledge and  improving incident resolution times. 
  • Automating. SRE teams focus on automating problem resolution as much as  possible without impacting users. This significantly reduces average resolution  times and minimizes high-stress situations for technical teams. 

Gradual changes. 

As you already know, SRE was born within Google. Like any cutting-edge company, one  of its main business needs is to make very frequent product releases and improvements.  SRE embraces change and continuous improvement, but always providing the necessary  level of quality and scalability for a product to be as reliable as possible.  

SRE teams will help development teams implement these best practices and understand  the need to apply the necessary operational processes to achieve this goal.  

Similarly, at Kiteris, we see SRE teams as a possible evolution of AMS teams, adding  DevOps knowledge and focusing primarily on critical applications in 24×7 environments. 

Automation. 

We have mentioned that things fail, especially when human intervention is involved.  Therefore, SRE aims to automate as many manual tasks as possible to provide value to  development and operations teams. With less human intervention, the system becomes  more reliable.

Measure, measure, and measure… 

In order for an SRE team to know if things are going well, they need to have the  necessary tools to understand what is happening and what will happen in their systems.  This can be achieved by setting up monitoring alerts, implementing a comprehensive  testing cycle before each deployment, or conducting code reviews to ensure best  practices in development. Additionally, to help anticipate problems, SRE includes the  concept of Observability, which we will delve into in more detail later on. 

As we have repeatedly mentioned, SRE focuses on providing maximum reliability. To  achieve this, it is essential to define a series of service levels that allow measuring this  characteristic. SRE defines three types of differentiated Service Levels:  

  • SLA (Service Level Agreement) is the most well-known and usually involves  contractual consequences in case of compliance or non-compliance.  
  • SLO (Service Level Objective) is the most important level for any organisation  looking to implement SRE, as it defines the optimal availability goal to ensure  customer/user satisfaction. These indicators are based on more specific aspects  than SLAs, such as the total response time for a user registration process on a  website. 
  • SLI (Service Level Indicator) are metrics that are part of an SLO and help SRE  teams focus their efforts on correcting specific metrics that contribute to achieving  the promised SLOs. Therefore, an SLI could be the latency of a specific query  within a user registration form on a website.  

We recommend reading the books created by Google, which detail and provide examples  of the practices discussed. 

https://sre.google/books/

 

    ¿Quieres más información sobre nuestros servicios?

    RESPONSABLE TRATAMIENTO: Kiteris Solutions S.L. FINALIDAD: Tratar sus datos para poder enviarle información sobre el servicio solicitado. LEGITIMACIÓN: Consentimiento del interesado. CESIONES: No se prevén cesiones, excepto por obligación legal o requerimiento judicial. DERECHOS: Acceso, rectificación, supresión, oposición, limitación, portabilidad, revocación del consentimiento. Si considera que el tratamiento de sus datos no se ajusta a la normativa, puede acudir a la Autoridad de Control (www.aepd.es).
    INFORMACIÓN ADICIONAL: www.kiteris.com/politica-privacidad

    Acepto que se traten mis datos para recibir información sobre el servicio y suscripción a nuestro newsletter

    Daniel Amado Author
    Manager en Kiteris
    follow me