failover | Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Managing Millions of Data Services @ Heroku
Speaker: Gabe Enslein

See the list of all blog posts from the conference

AWS S3 failure

February 28, 2017 – AWS S3 outage – pager duty failed to give message
Down for about 6 hours
Heroku recovered before everyone went to bed (10pm Eastern)
Avoid failure by having failover strategies
Would have taken 35 years to recover if had to do all tasks manually
No Heroku customers lost any data

Concepts

Layers of abstraction simplify evelopment
Everything rus on hardware at some level down
Abstractions can hide real problem
Can be harder to reproduce problems
Can model many tasks as state machines – both deterministic and non-deterministic moels

“just” implies it is easy. Be skeptical. How easy to repeat? How often is “just”

Automate yourself out of a job – recurring and one off work

If haven’t gotten a heartbeat in a while, don’t know health.

States
Not all states used by all systems

installing
available
uncertain
unavailable
retiring
retired
archived
terminated
restart
upgrading

Check on

Backups
Replication
Security
Performance

Manual fixes can cause more problems than started with. Immutuable infrastrucure enforces the “just”. Script the exceptions; don’t manually tinker. “Break Glass” in case of emergency procedures still help. Modeling emergency remedies help so computer can fix when detects instead of waking someone up.

Infrastructure is code, not a second class citizen. Test it for functionality, performance and regression.

Then March 15, 2017, there was a Linux denial of service and admin escalation vulnerability. Needed to see none of the images were affected. Can fix image so customers get when start up.

Key Takeaways

Automate yourself out of regular operations
Have emergency automation in place – scripts, jobs, etc
Make routine failover strategies
Treat infrastructure as full units
Abstractions have their limits

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Java/J2EE Software Development and Technology Discussion Blog

Tag Archives: failover

Share this: