Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations by Vladyslav Ukis

Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations by Vladyslav Ukis

Author:Vladyslav Ukis [Vladyslav Ukis]
Language: eng
Format: epub, pdf
Publisher: Addison-Wesley Professional
Published: 2022-09-01T16:00:00+00:00


Figure 5.6 Extended view of the user experience encompassing product reliability

Product management has an existing user experience focus in UX and UI design. The UX and UI design process generates artifacts as input to product development. The output of the product development process is the product deployed to production. This is where the users access the product. Their entire user experience is based on the product in production. Therefore, product management needs an additional new experience focus: product reliability in production. The combination of the two focuses, UI/UX design and product reliability in production, yields the best user experience in production for the user.

But how do you extend the user experience focus to include product reliability in production? This is where SRE has its strength. It allows the product owner to take part in a structured process together with operations engineers and developers that leads to sufficient product reliability. What do the product owners need to do?

1. Take part in the definition of indicators and objectives for services in production (SLIs and SLOs).

2. Take part in the definition of policies for violating the objectives (error budget policies).

3. Make prioritization decisions based on the policies (error budget–based decision-making).

4. Take part in the definition of the on-call setup for services in production.

The head of product management may have a question regarding the time investment needed by the product owners to take part in the SRE activities. The question might arise because the product owners are typically very busy juggling numerous strains of work within product management. Adding a new type of work, however important it might be, will chip away some time from other work areas. The answer to this question needs to be that the amount of time to take part in SRE activities will be rather limited. Following is the time investment breakdown:

1. SLI and SLO definitions

❍ Several meetings with the team to set up initial SLIs and SLOs from the user’s point of view

❍ Ongoing adjustments of SLOs based on feedback from SLO breaches

2. Error budget policy definition

❍ A meeting with the team to set up an initial error budget policy

❍ Occasional adjustments of the error budget policy based on feedback after incidents

3. Prioritization decisions based on error budget policies

❍ Use of SRE metrics to guide backlog prioritization decisions in terms of reliability

4. On-call setup definition

❍ A meeting laying down how on-call support will be ensured for a service: which roles will be going on call, at what times of the day, and with what kind of rotation frequency

❍ In the meeting, agreement with operations engineers and developers on the prioritization of the incident backlog by the people on call

Thus, with not a lot of time investment on the product owner side, much can be achieved:

• Influence of the reliability objectives for services (SLOs); for example, by product importance, criticality, customer segments, and so on

• Influence of the policy enacted when the service objectives are not fulfilled

• Reliability work prioritization decision-making based on real data from production reporting on



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.