Jason Hand from Microsoft – @jasonhand
For other QCon blog posts, see QCon live blog table of contents
Definitions
- We use terms where not everyone on same page as to meanings.
- Ex: what does “complex” mean
- Types of systems
- Whether can determine cause and effect
- Ordered vs unordered
- Ordered – Obvious (Can take it apart/put it back together. Know how works. ex: bicycle), complicated (ex: motorcycle)
- Unordered – obvious, complicated, complex (ex: people on road, human body), chaotic (ex: NYC)
- Sociotechnical systems – th epeople part is hard
Complex system
- Causality can only be examined/understood/determined in hindsight
- Specialists, but lack broad understanding of system
- Imperfect information
- Constantly changing
- Users good at surprising us with what system can/can’t do
Learning
- Takes time
- Takes success and failure. Need both
- Learning opportunities not evenly distributed
- Sample learning opportunities – code commits, config changes, feature releases and incident response. Commits occur much more often than instances
- However, the cost to recovery is low for the more frequent opportunities
- High opportunity – low stakes and high frequency. GIt push is muscle memory
- Low opportunity – high stakes and low opportunity
- Frequency is what creates the opportunity
Incident
- Everyone would agree impacting the customer is an incident
- If didn’t affect the customer, not always called an incident.
- If not called an incident, no incident review.
- Missed learning opportunity
- We view incidents as bad.
- Incidents are unplanned work.
- Near misses save the day, but don’t get recognized or learned from
- Systems are continuously changing; will never be able to remove all problems from system
Techniques to learn
- Root cause analysis is insufficient. Like a post mortem, it is just about what went wrong.
- Needs to be a learning review
- Discuss language barriers, tools, confidence level, what people tried
- Discuss what happened by time and the impact
- ChatOps better than phone bridge because can capture what happened. Nobody is going to transcribe later. Having clean channel for communication helps.
- However, incidents not linear.
- Book: Overcomplicated
- If someone just does one thing, the learning doesn’t transfer. Need operational knowledge and mental models
Learning Reviews
- Set context – not looking for answers/fixes. Looking for ways to learn even if no action items
- Set aside time/effort to be curious
- Asking linear questions (ex: five whys), don’t get to reality system
- Invite people who weren’t part of incident response. They should still learn and can provide info about system
- Understand and reduce blind spots
My impression
Good talk. It’s definitely thought provoking. And suggests small things one can do to start making things better