Speaker: Shradha Khard
For more, see the table of contents
Notes
- Site Reliability Engineering
- Operations is a software problem.
- SRE is what you get when you treat ops as software and staff it with software engineers
- Software dev: idea -> strategy -> dev (design, code, test)-> ops(build, deploy, support) -> deliver (real world)
- Ops – maintenance, system upgrades and isntalls, security, compliance, cost, support help desk escalations, vendor contracts
- Conflict – dev wants new features, ops want to make sure doesn’t break
DevOps
- SRE implements DevOps.
- SRE is a substream
- Ensures durable focus on engineering. Need to make sure product up and running. 50% time automate to make sure that happens
- ex: augment S3 bucket
- See how fast can make changes without violated SLO
- Error budget – metric for how unreliable a system is allowed to be
- Monitoring is not just logging in system. Need to alert and ticket too
- Change management
- Demand forecasting/capacity planning
- Provisioning
- Efficiency and Performance
- SRE doesn’t replace DevOps people who deploy to cloud
Enabling SRE/How to Start
- Centralized SFE team (core platform, networking)
- Embedded (full team members of project team, teach devs how to manage, work with core team)
- Need same skillset as dev to be SRE
Metrics
- MTTR – mean time to recovery – how long to get system healthy again. Emergency response helps with this
- Lead time to release or rollback
- Improve monitoring to catch and detect issues earlier
- Estabilish error budget to have budget based risk management
Service levels
- SLA (service level agreement) – legal agreement. Often involves compensation if not
- SLO (service level objective) – number which SLI should be before needing improvement
- SLI (service level indicator) – metric over time. Quantitive measure – ex: throughput, latency, error rate, utlization
- 3 nines (99.9%) – 10 mnutes per week, 8.8 hours per year
- 4 nines – 1 minute per week, 52 minutes per yeaar
- 5 nines – 6 seconds per week, 5 minutes per year
Incident Management
- Goals: Restore service to normal and minimize business impact
- Be able to get the people who can help solve it
- Log of events so can see when started
- Blameless post mortems
Books
- Google book ”Seeking SRE”
- Google book ”The Site Reliability Workbook”
- Book: Implementing Service Level Objectives
My take
There was a lot of info, but easy to follow. It was great to see a structured intro vs that random things I’ve read online