NOTE: SRE concept is a little bit related human pyscholy which also called cultural concept.
Cultural concepts>> devops philosophy >> SRE methodology
Section 3: Applying site reliability engineering practices to applications (~23% of
the exam)
3.1 Balancing change, velocity, and reliability of the service. Considerations include:
● Dening SLIs (e.g., availability, latency), SLOs, and SLAs
● Error budgets
● Opportunity cost of risk and reliability (e.g., number of “nines”)
3.2 Managing service lifecycle. Considerations include:
● Service management (e.g., introduction of a new service by using a pre-service
onboarding checklist, launch plan, or deployment plan, deployment, maintenance, and
retirement)
● Capacity planning (e.g., quotas, limits)
● Autoscaling (e.g., managed instance groups, Cloud Run, GKE)
3.3 Mitigating incident impact on users. Considerations include:
● Draining/redirecting trac
● Adding capacity
● Rollback strategies
Pyschology SRE and Devops relation
Note inforamtion summarized from (Developing a Google SRE Culture ) for gcp devops exam purose
A PDF version of each module is available for your reference.
Module 1: Welcome to Developing a Google SRE Culture just a intro
Module 2: DevOps, SRE, and Why They Exist
Module 3: SLOs with Consequences
Module 4: Make Tomorrow Better than Today Postmortem
Module 5: Regulate Workload
Module 6: Apply SRE in Your Organization
DevOps is a philosophy, and SRE is a development methodology or technology.
SRE is a way to build devops
devops: Reduce organization silos, Accept faiulre as noraml, implement gradual change, leverage toolig & automation, measure everything.
SRE: Share ownership, blamelessness, reduce cost of failure, toil automation, measure toil and reliability.
comparison of DevOps and SRE principles:
Principle | DevOps philosophy | SRE Methodology | SRE cultural concepts |
---|---|---|---|
Collaboration | Reduce organization silos | Share ownership | shared vision and knowledge/ collaboration and communications |
Failure Handling | Accept failure as normal | Blamelessness, postmortem | psychological safety and blamelessness |
Change Management | Implement gradual change
CI / CD |
Reduce cost of failure
canary deployment |
design thinking and prototyping |
Automation | Leverage tooling & automation | Toil automation | psychology of and resistance to change, |
Measurement | Measure everything | Measure toil and reliability | goal setting, transparency, and data-driven decision making. |
Blamelessness is the notion of switching responsibility from people to systems
and processes. Focus on systems and processes, not people. Innovation requires some
degree of risk taking.
Blameless postmortem
Components of a postmortem
● Details of the incident and its timeline
● The actions taken to mitigate or resolve the incident
● The incident’s impact
● Its trigger and root cause or causes
● The follow-up actions to prevent its recurrence
SRE Cultural factors
Specifically, organizations developing SRE culture should focus on:
● Creating a unified vision
● Determining what collaboration looks like
● Sharing knowledge among teams
1.unified vision
company vision statement that serves as their guide for the work they
do. To give a sense of direction, your IT team’s vision should support the company’s
vision.
A team’s vision is everything about what drives its work, and includes core values,
purpose, mission, strategy, and goals.
values
- Your response to others
- Your commitments to personal and organizational goals
- The way you spend your time
- The way you operate as a team
By developing core values, you can help your team:
● Build trust and psychological safety with each other
● Be more willing to take risks
● Be more open to learning and growing
● Feel a greater sense of inclusion and commitment
A team purpose
- Explains why your team exists.
- Improves life and work satisfaction.
- Creates stronger connections.
- Helps reduce conflict.
Mission (long term goal / overall direction,)
A mission is a clear and compelling goal the team wants to achieve.
Google’s mission: To organize the world’s information and make it universally accessible and useful.
Strategy
- Can be a single initiative.
- Can be leveraged.
- Requires change
Your team’s strategy is how your team will realize its mission.
Strategic action takes on many forms.
● It can be a single initiative designed to meet a specific future goal.
● It can be leveraged: a single project can meet multiple future goals.
● It requires change: a change in investment of resources and people, and a
change in habits and how work gets done
Strategy building blocks
- Identifythreats and opportunities.
- Understand resources, capabilities, and practices.
- Consider strategies for addressing threats and opportunities.
- Create alignment on communicating and coordinating work processes.
Some basic building blocks of strategy are to:
● Look outside to identify threats and opportunities
● Look inside at resources, capabilities, and practices
● Consider strategies for addressing threats and opportunities
● Create alignment on communicating and coordinating work processes
Goal
A goal is what you strive to attain.
Google uses OKRs (Objectives and Key Results). also known as KPI Key performance Indicators
OKRs used to set ambitious goals and track our progress. OKRs can encourage
people to try new things, prioritize work, and learn from both success and failure.
And while the team may not reach every OKR, it gives them something to strive for together.
2. Collaboration and communication
effective communication has
always been a high priority in SRE. Collaboration between SRE teams has its
challenges, but has potentially great rewards, including common approaches to
platforms for solving problems, which lets teams focus on solving more difficult
problems
Service-oriented meetings
● Review state of service
● Weekly 30-60 minutes
● Designated lead
● Compulsory attendance
● Set agenda
where an SRE team reviews the state of the service or services in their charge to
increase awareness of all stakeholders involved and to improve the operation of the
service or services
Service-oriented meetings usually happen weekly for 30 to 60 minutes and should
have a designated lead.
Attendance should be compulsory for all the team members, because this is a
major opportunity to interact as a group. Setting a defined agenda is important. For
example, your agenda could be to cover upcoming production changes, metrics,
outages, paging events, non-paging events, and prior action items.
Team composition roles
● Tech lead who sets the technical direction of the team
● Manager who runs performance management
● Project manager who comments on a design doc and writes code
SRE team compositions and skills in module 6
3. Sharing knowledge among teams
Cross-training
● Trains team members to be flexible.
● Helps reduce costs.
● Improves morale.
● Reduces turnover.
● Boosts productivity.
● Enhances scheduling flexibility.
● Increases job satisfaction.
Job shadowing
Thirdly, job-shadowing is an effective means of generating knowledge sharing.
Some examples of job-shadowing benefits in IT teams include:
● Expert knowledge and exposure for new hires to what others in the team do
every day
● Hands-on experience for how the system should be maintained
● An opportunity to ask an expert any questions
● A good introduction to the concept of gradual change and a broader
explanation of what it means to the team
A way to spot opportunities for cross-functional collaboration
● A great way to understand the nuances of what a particular job role entails
● A psychologically safe environment where it’s normal to ask questions and
learn
● A way to pair up your team members that helps to scale and retain knowledge
Benefits of collaboration technology
Postmortem workflows include collaboration and knowledge sharing at every stage.
It’s crucial to use technology, such as Google Docs, that enables some key features:
● Real-time collaboration: Enables data and ideas collection
● An open commenting/annotation system: Makes crowdsourcing solutions
easy and improves coverage
● Email notifications: Directs collaborators or is used to loop others to provide
input
Psychological safety = The belief that a person will not be punished or humiliated
for speaking up with ideas, questions, concerns, or mistakes
Blamelessness fosters psychological safety. Hindsight bias ( convince themselves after an event that they accurately predicted it before it happened.) and discomfort discharge (I knew it before happened)
Hindsight bias is the tendency of people to overestimate their ability to have predicted an unpredictable
outcome. In working environments, it can lead to blaming the person in charge.
Discomfort discharge is when people (feel relief) blame others to discharge discomfort and pain at a neurobiological level.
culture of design thinking and prototyping.
Design thinking combines creativity and structure to solve complex problems.
Google uses design thinking as one method to teach teams and individuals to think creatively, which is an important step in the process of innovation.
Design thinking methodology has five phases.
- First, empathize. In this phase you want to observe and engage with your intended users to learn more about them and immerse yourself in their environments. to gain your users insight and their needs.
- Second, define the problem you are attempting to solve. Express the problem in the form of a point of view of the user, versus what you want to accomplish.
- Third, ideate. Now that you’ve defined the problem, you can start generating ideas for solutions. This is a time to “think outside the box.
- Fourth, it’s time to prototype. In this phase, you can get the ideas out of your head and into the real world. It’s meant to be experimental, so you can identify the best possible solution before committing.
- And finally, test. You’ll want to test your prototype solutions in a real-world setting with your intended users.
Ways to prototype
- Physical prototyping lego
- Paper and drawing pen paper
- Clickable a software
- Role play
- Video
based on our customer interactions, that with the help of
simple prototypes, customers are able to improve the most complex processes. By
activating imaginative thinking, individuals feel motivated and encouraged to have
audacious ideas that they might not have by sitting at their desks or during a regular
meeting. Some of the examples of prototypes we’ve seen from customers include a
video of a panel discussion, a heatmap, and a banner with post-it notes.
Toil
So what exactly do we mean by toil? Toil is work directly tied to a service that is
manual, repetitive, automatable, tactical, or without enduring value, or that scales
linearly as the service grows.
- Manual
- Repetitive
- Automatable
- Tactical
- Without enduring value
- Scales linearly as the service grows
Toil isn’t just administrative work or work you don’t want to do, because that kind of work can still be very important.
By eliminating toil, SREs can focus the majority of their time on work that will either reduce future toil or add service features, which generally focuses on improving reliability, performance, or utilization.
keeping your SREs working on toil less than 50% of the time.
Excessive toil
1. Career stagnation
2. Low morale
3. Confusion
4. Slower progress
5. Precedence
6. Attrition
7. Breach of faith
Toil can lead to career stagnation. Individual team members’ career progress will
slow down or stop if they spend too little time on projects. , you
can’t make a career out of it.
It promotes low morale. People have different levels of tolerance for how much toil
they can do, but everyone has a limit. Too much toil leads to burnout, boredom, and
discontent.
It creates confusion. At Google, they work hard to ensure that everyone who works in
or with the SRE organization understands that they are an engineering organization.
Individuals or teams within SRE that engage in too much toil undermine the clarity of
that communication and confuse people about the SRE role.
Toil slows progress. Excessive toil makes a team less productive. A product’s feature
velocity will slow if the SRE team is too busy with manual and reactionary work to roll
out new features promptly.
It sets precedent. If you’re too willing to take on toil, your developer counterparts will
have incentives to load you down with even more toil, sometimes shifting operational
tasks that should rightfully be performed by developers to SRE. Other teams may also
start expecting SREs to take on such work, further perpetuating the issue.
It promotes attrition. Even if you’re not personally unhappy with toil, your current or
future teammates might like it much less. If you build too much toil into your team’s
procedures, you motivate the team’s best engineers to start looking elsewhere for a
more rewarding job.
Lastly, toil causes breach of faith. New hires or transfers who join SRE with the
promise of project work will feel cheated, which is bad for morale.
value of Automation
1. Consistency
2. A platform (automated systems provide a platform that can be extended and applied to more systems)
3. Quicker resolutions
4. Faster action
5. Time saved
Consistency: Any action performed by a human is prone to error,
especially the same action performed hundreds of times. A person isn’t likely to be as
consistent as a machine. Lack of consistency leads to mistakes, oversights, issues
with data quality, and even reliability problems. Automation remedies this by creating
consistency
Next, automated systems provide a platform that can be extended and applied to
more systems. A platform also provides a way to centralize mistakes, so that a bug is
fixed once in one place. With humans, you’d have to communicate that fix across
multiple people, and there is more room for error and for the bug to be reintroduced.
Additionally, a platform can execute additional tasks faster and with more accuracy
than humans, and can also export performance metrics more easily than a manual
system.
Another value of automation is faster action. Machines react faster than humans, so
for large production services, automating is necessary for survival since the amount of
work required is usually beyond a manageable manual threshold.
Finally, automation saves time. Even though it may be a significant time investment
to code a particular automated process, once done there is no need for continual
training of humans and maintenance of the process. Once a task is automated,
anyone can execute it
psychology of and resistance to change
SRE practice of automation to eliminate toil, someindividuals will probably begin to resist. Individuals may feel as though their jobs are in jeopardy.
Human reaction to loss is much stronger than human reaction to gain.
Broadly, people and their emotions fall into four categories.
Navigators: These are the people who will make teams and businesses successful. As leaders, spot them and celebrate their behaviors. Use them as champions for the change.
Critics: They have passion and energy. Critics care, and they have valid fears, so it’s important not
to ignore them. as a leader Spend some time with them, because they will be very powerful advocates if you can persuade them.
Victims: Often, this type of individual just needs to get their emotions out. Victims tend to take organizational change very personally. Your role as a leader is to listen to them and empathize. Once they feel heard, then they can start to listen.
Bystanders: These people are tricky because you never know what they are thinking.
Often, bystanders have no idea what’s going on, and they will just continue as if
nothing, or no change, is happening. You should try to communicate with them to ascertain their feelings.
Sometimes one person can fall into several categories. Remember that it’s likely
you’ve experienced all of these faces of change at some point in your career.
Brains are hard-wired to reflect emotions.
People experience reactions to change not because they are trying to be difficult, but because it’s natural.
1. When you experience the feeling of being excluded from something, it triggers
response in the anterior cingulate (dorsal portion)—which is the same part of the brain
that deals with physical pain.
2. When you realize that something you were told in the past is unrealistic or untrue,
the prefrontal cortex switches to high alert, looks for other signs of deception, and
triggers feelings of heightened anxiety.
3. When you solve your own problems, you get a rush of adrenaline (positivity/natural
high).
4. The prefrontal cortex can only deal with a few concepts at a time. When you are
overwhelmed by unfamiliar concepts, your amygdala is triggered, making you feel
anxious, afraid, depressed, tired, or angry.
5. When you pay a greater amount of attention (attention density) to something, you
find it easier to adapt.
6. Habitual tasks feel easy and comfortable because they are hard wired and require
little conscious thought (controlled by the basal ganglia).
Google has some recommended ways to manage and account for people’s reactions to change in the teams you lead.
1. Exclusion is painful: Involve people in the change.
2. Deception anxiety: Set realistic expectations.
3. Self-solve adrenaline: Identify opportunities for co-creation and provide
coaching rather than solutions.
4. Amygdala hijack: Simplify messaging and focus on key concepts per user
group.
5. Attention density: Ensure that communications are engaging and training is
interactive.
6. Unconscious habit: Allow people time to build new habits.

Keeping the neuroscience of change in mind, let’s look at the stages of transition that
individuals experience when going through change. There are different versions of the
change curve, but this is the way we’ll look at it today.
As you can see, there is a beginning, middle, and an end, yet it is completely normal
for people to move backward and forward at different times.
present change as an opportunity, not a threat, to your teams. To do
this, you’ll want to connect with individuals on three levels:
1. Head, which is rational. ( strategic mission, vision, and rationale behind it.)
2. Heart, which is emotional. (talk about why people should care)
3. Feet, which is behavioral. (talk about the knowledge, skills, and resources you will
provide to make sure they are successful in this change)
Handling resistance to change
- Are all your leaders and managers role
modeling the new processes and behaviors? - Do people understand the reason for the
change? - Do people care about the change being
successful? - Do people have the knowledge and ability to
be successful in your new world? - Are the right reinforcement and recognition
programs in place?
Goals of measuring
everything:
Understand the status, analyze data make decision.
- 1st, the IT team and the business can understand the current status of the
service objectively. You’ve already learned how you can measure reliability with
SLIs and SLOs. - Second, the team can analyze the data and identify necessary actions to improve
the status. - And third, the IT team can collaborate with the business to start making better
decisions and impact across the broader organization.
You can’t improve what you don’t measure. In Site Reliability Engineering, there are
three core practices that align to this pillar: measuring reliability, measuring toil,
and monitoring.
quantifying reliability with error budgets, SLIs, and SLOs. (SLI metrics low speed like 300ms response time in 95% requestss)
Toil: identify and measure in minutes / hours.
Start simple. Count the number of tickets you receive. Count the number of
alerts. Collect alerts stats on cause and action. . You can also measure actual human time spent on toil by
collecting data, either in the ticketing system directly or by asking your team to
estimate the time spent on toil every day or week
Benefits of measuring toil
First, it triggers a reduction effort. Identifying and quantifying toil can lead to
eliminating it at the source. And second, it empowers your teams to think about
toil. A toil-laden team should make data-driven decisions about how best to spend
their time and engineering efforts.
Monitoring
measuring everything involves monitoring. Monitoring allows you to gain
visibility into a system. It’s best practice is to alert on symptoms rather than causes. better to have fewer symptom-based alerts . Google recommends alerting based on error budget burn
Google recommends monitoring for the four golden signals:
1. Latency
2. Traffic
3. Errors
4. Saturation
Goal-setting, transparency, and data-driven decision making
Let’s start with goal-setting. For this, you’ll want to create a data-driven goal setting
process. You should look at KPIs—for who and for what you are measuring—and an
approach—what to measure and how.
Google uses OKRs, objectives and key results, as KPIs. OKRs are usually graded on a scale of 0.0 to 1.0, where 1.0 indicates a fully
achieved objective.
Consider these things when grading OKRs:
● The optimal point for an OKR grade is 60-70%. Think big when developing
your OKRs!
● OKRs are not synonymous with performance evaluation. Instead, they
show individuals contributions and impact.
● Organizational OKRs are graded publicly so everyone can see their
progress.
● Frequent check-ins throughout the quarter help teams and individuals
maintain progress.
2. Transparency
transparency “is the only way to demonstrate to your employees that you believe they
are trustworthy adults and have good judgment. And giving them more context about
what is happening (and how, and why) will enable them to do their jobs more
effectively.
In terms of SRE practice and culture, there are some specific ways of promoting
transparency: sharing monitoring tools and sharing communications and
feedback loops.
Google uses an issue tracker called Buganizer. Everyone in the
company can access this tool. Specifically, the development and operations teams
can review issues and see progress towards their resolutions. Shared tools for
development and operations teams promote transparency.
However it’s not just tool-sharing that will help your organization maintain
transparency. A culture of sharing information needs to come from your systems as
well.
feedback loop
In communication transparency, it’s important to remember feedback loops.
Feedback loops are simple to understand: you produce something, measure
information on the production, and use that information to improve production. It is a
constant cycle of monitoring and improvement.
3. Data-driven decision making
Data-driven decision making is another important aspect of SRE culture. To make
truly data-driven decisions, you need to remove any unconscious biases.
Unconscious biases are social stereotypes about particular groups that people form
without realization.
Affinity Bias. This is a bias toward those who are similar to you. This could mean
similar in many different ways, such as race, gender, socioeconomic background, or
education level. People tend to gravitate toward those who are like them.
Confirmation Bias. This bias is the tendency to find information, input, or data that
supports your preconceived notions.
Labeling Bias. This bias is making opinions based on how people look, dress, or
show up externally.
Selective Attention Bias. This bias is when you pay attention to things, ideas, and
input from people whom you tend to gravitate toward.
All of these biases can create environments that are not very diverse or inclusive.
They can also impact the creativity, innovation, morale, engagement, and turnover in
an organization. When employees see decisions being made based on any of these
biases, it corrupts their work environment.
Remove bias
So how can you help remove these unconscious biases?
First, question your first impressions. Don’t stop with the first decision that comes
to your mind, especially when you’re determining whom to promote, hire, or add to the
team.
Next, justify decisions. If you’re held accountable, you’ll be less unconsciously
biased. Tell people why you decided what you did. If no one will listen to you, write it
down.
Lastly, make decisions collectively. Ask people to repeat back what they heard, and
keep each other’s unconscious biases in check.
Calling out unconscious bias can be a learning moment. So when you think you see
it, speak up! Sometimes you will be wrong, and that’s okay.
When you create a culture of data-driven decision making and removing unconscious
bias, it makes decision-making easier
Went through the SRE-SE loop recently, bookmarked the following in preparation
- https://syedali.net/engineer-interview-questions/
- https://sre.google/sre-book/effective-troubleshooting/
- https://sre.google/workbook/non-abstract-design/
- https://github.com/donnemartin/system-design-primer
- http://highscalability.com/blog/2011/1/26/google-pro-tip-use-back-of-the-envelope-calculations-to-choo.html
- https://www.aosabook.org/en/distsys.html
- https://www.brendangregg.com/Perf/linux_perf_tools_full.png
- https://netflixtechblog.com/linux-performance-analysis-in-60-000-milliseconds-accc10403c55
- https://www.brendangregg.com/usemethod.html
- https://github.com/mxssl/sre-interview-prep-guide
- https://jg.gg/2016/07/31/architecture-and-systems-design-interview/
- https://gist.github.com/hellerbarde/2843375
Ask a Question: