[uberconf 2023] Architect’s Guide to Site Reliability Engineering

Speaker: Nathaniel Schutta

@ntschutta

For more, see the table of contents


General

  • Agile – do more of what works
  • Conflicting incentives – ops – “don’t change if works”
  • Monoliths to service oriented to microservices – solves some problems and created new ones
  • SRE (site reliability engineer) – new role
  • We are good at giving something a new name and pretending have never done before. Ex: cloud computing – big pile of commute and slice of what need – ex: mainframe
  • Everything we do involve other people. Most problems are people problems, which we tend to ignore

History

  • Goes back to Apollo problem. First SRE was Margret Hamilton. She wanted to add error checking and did update the docs
  • Phone autocorrects on map – recalculating….
  • Traditionally systems run by system admins.
  • Now have hundreds/thousands of services
  • CORBA – facilitate communicate for disparate systems on diverse platforms. Also a good definition for microservices. EJBs too. Then SOA
  • APIs exploded because we all have smartphones/supercomputers in our pocket
  • Amazon had policy of everything being an API

Challenges

  • Who page if go through 20 service. It’s clearly another team’s problem
  • How monitor
  • How debug
  • How even find services
  • We argue about definition of made up words. Ex: microservice. Nate likes definition that it can be written in two weeks. How many services can a team support? If change alot, 4-6. If stable, 15-20
  • How do we define an application?
  • Conflicting incentives – release often
  • The way we do things might be the first way we tried vs a better way.

SRE

  • What if we asked software devs to define an ops team?
  • Software engineering applied to operations
  • Did that with testing too; made more like dev
  • Replace manual tasks with automation
  • Many SREs former software engineers.
  • Helpful to understand Linux
  • Can’t reply on quarterly “Review Board”. This is a very slow quality gate. However, most orgs don’t audit gate to see if useful.
  • Goal: move fast, safely
  • Doesn’t happen in spare cycles “when have time”. Can’t be on call all the time or be doing tickets/manual work all the time.
  • Humans can’t do the same thing twice. Ex: golf

What does SRE do/consider

  • Availability
  • Stability
  • Latency
  • Performance
  • Monitoring
  • Capacity planning
  • Emergency response
  • Understand SLOs
  • Embrace/manage/mitigate risk. Risk is a continuum and a business decision
  • Short term vs long term thinking. Heroics works for a while, but isn’t sustainable. Often better to lower SLOs for a short time to come up with better solution.
  • Focus on mean time to recovery. No such thing as 100% uptime.
  • Runbook is helpful. Not everyone is an expert on the system. Even if do know, brain doesn’t work well in middle of the night or under pressure. Playbook produces 2x improvement in mean time to recovery
  • People fall to level of training, react worse when stressed.
  • Alerting. Need to know what is important/critical and when it is important. Ex: can ignore car oil change message for a bit but not for too long
  • Logging best practices. Logs tend to be nothing or repeating the same thing 10x.
  • Four golden signals – latency, traffic level, error rate, saturation
  • Automate everything; manual toil drives people out of SRE
  • Which services most important
  • Establish an error budget. Can experiment when more stable. Can’t deploy when error budget used up. Helps understand tradeoffs.
  • Production readiness reviews.
  • Get everyone on same page with what service does – dev, archs, etc. Improves understanding and can find bottlenecks
  • Checklists – quantifiable and measurable items.
  • Think about how it can fail and what happens if it does
  • Chaos engineering

Outage impact

  • What do customers expect? Used to be 5×12 – 6am-6pm M-F. Most things are 24×7 now
  • What do competitors provide? Need to do same
  • Cost – more failover is super expensive
  • When cloud goes down, it is news
  • Depends if needs redundant backup. How much venue lose than cost of being down?

Postmortems

  • Don’t want to make same mistakes. Learn from yours and others. Avoid them becoming a blame session
  • Outages will still happen; must learn from them so that bad thing doesn’t happen again
  • Living documents – status as the outage is happening, impact to business, root causes
  • Tactical vs long term/strategic fix
  • Action items to avoid in future
  • Cultural issues
  • Wheel of misfortune – role play disaster, practice
  • Recognize people for participating
  • Need senior management to encourage
  • Provide a retro on postmortem to improve process
  • Education – if you already understood that, we’d give you something harder to do

SLO/SLA

  • 99% – 7.2 hours a month, 14.4 minutes a day
  • 99.9% – 8.76 hours a year, 1.44 minutes a day
  • 99.99% – 4.38 minutes a month; 8.66 seconds a day
  • 99.999% – 4.38 minutes a month; 864 millis a day
  • Google K8S Engine availability is 99.5%. Can;t exceed service provider
  • SLA (service level agreement) means financial consequence of missing. Otherwise, it is a SLO
  • More to always better; can’t be infinity
  • Can tighten later; hard to losen
  • If a system exceeds their SLA, can’t rely on that. Could stop at any time.
  • Might have internal SLO that is tighter than the advertised one.
  • Everyone wants five nines until they see the cost. “If you have to ask, you can’t afford it”

Fitness functions

  • Tests to make sure architecture still does what want it to do
  • If know when breaks, can tie back to code change and fix

Next steps

  • Build an SRE team if don’t have one
  • Applications changing rapidly
  • Need to enable environment to move fast and safely
  • Must work well together

My take

Nate’s style is a ton of slides with a mostly few words/sentence on each. It’s a fun style. It also means the font size is super large and I don’t need to wear my glasses for most slides. I had to step out for a few minutes for the restroom. [I’ve been doing an excellent version hydrating!] Hard to step out, but easy to catch when came back.