State of Chaos Engineering
Speaker: Nora Jones
See the list of all blog posts from the conference
Known ways for testing for availability
- Unit test
- Integration test
- Regression Test
- Chaos Engineering – much less common. Doesn’t replace need for “traditional” tests
Chaos Engineering
- Most people know what Chaos Monkey is; far less know about Chaos Engineering. The former is a tool’; the later is a strategy. The former is mature; the later is emerging
- “chaos” means different things to different companies. Common things: experimenting, distributed system, make system stronger through experiments
- Goal is to run chaos all the time, not just on deployment
Why to start
- Can’t keep blaming your cloud provider. Need to own failure
- Failures will happen anyway. Why are we afraid of that?
- Computers are complicated and they will break
“Chaos Carol”
Introducing chaos
- think about where you are now and expected response
- How many people should know the chaos is intentional? Helpful to know running an experiment.
- Define “normal” system and behavior
- Relate chaos to automated tests, SLAs and customer experiments
- Start in QA, not Prod. This estabilishes a baseline
- Only run during business day
Ways to create chaos
- Start small – graceful restarts or degregedation
- Randomly turn things off
- Recreate things that have already happened – good once reach a steady state
Culture and implementation
- People need to understand revealing problems is good (vs causing problems)
- Start with opt in so people have control
- Monitoring is important. Use dashboards to communicate
- Automatically shut down experiment if goes too far astray
- Have your incident/Jira/PagerDuty tickets gone down
- Don’t forget about your company’s customers. Focus on business goals and not causing customer pain
Cascading failure
- Try later on
- Start in QA
- May fail in unexpected ways – the tool broke QA for a week
- Problems lie dormant for a long time
Testing
- FIT – Failure Injection Testing
- F# library: https://github.com/norajones/FailureInjectionLibrary
- Types of chaos failures – exceptions, latency
- After FIT, focus on minimizing blast radius and concentrating failures
- Targeted chaos – important to have a steady state before introduce so know what caused by introduction
The choose your own adventure was a fun series of choices to think about viable options. Or not viable in some cases.