Title: DevOps at Scale; Greek Tragedy in Three Acts
Speakers: Baruch Sadogursky (JFrog), Alena Prokharchyk (Rancher Inc)
See my live blog table of contents from Oracle Cloud
Slides: https://jfrog.com/shownotes/
General notes
- DevOps – intersection of dev + ops + qa
- DevOps – intersection of tech, process and people
Reactive ops
- Imaginary three person company Eager to learn
- Buzzwords – serverless, no ops. Both sound like don’t need to do anything. But…
- Basic tools all in cloud – jira, github, travis ci, oracle cloud
- Challenge: many logs because of microservices
- Challenge: time zone of cloud provider vs your time zone
- Two years ago “the internet broke” – NPM registry took down a dependency (left pad)
Scale
- Hire ops person
- Grow staff
- Add Scrum
- Add exploratory testing [surprised hardly anyone knew what this was; it’s just trying stuff an testing]
- Developer on call
- More tools – Confluence, Artifactory, Sumologic (analyze logs; like Splunk), Pingdom (monitoring)
More maturity
- Root cause analysis – includes syptoms, what happened to lead up to problem, next steps so can prevent happening again
- Want to have new problems; not the same one happening ove and over
- Retrospective
- Importance of disclosure – ex: gitlab lost 6 hours of data last year. Were forgiven because so transparent
Perfect storm
- More scale – 5 ops people, 1 performance engineer, 74 deveopers, chief architect, customer success team (bridge between developers and customers like developer on call but a bunch of them that know system)
- SAFE – scrum at scale
- System testing
- Ops team – two ways to do devops 1) hire brighest engineers in world (like Netlfix) where know dev and ops perfectly. Rare to be able to do this. 2) Specializations exist. The ops people set up everything and then evangelize it so devops can happen. Often called the tools team.
- Escalation path: SME and manager on call. The manager can work on relationships. Also makes fixing faster since know will be escalated to management.
- SOC II – regulation/audit for service organization. Requires separation of controls so people who write code cannot deploy to prod. Can’t write code and control system that deploys it. Interesting. Ex: the tooling team doesn’t allow skipping integration tests. So the people writing apps don’t control the deployment pipeline
- Problem: Need to find out if have any code that uses a certai license. Lots of work to do manually. [easy if you have IQServer! Or a JFrog product; I didn’t catch which one]
- Problem: Guessing how much to spend on servers for next year. Guess. Nobody will shut down server if need more resources
- Problem: Will it scale. Guess. False confidence. It didn’t. Greek tragedy; everyone dies.
To avoid problems
- Performance/scalability testing
- License and seucrity management
- Code in monitoring. Ex: docker. By extending base image, get things built in.
- Tools support process – JFrog commercial :).
- Showed pie chart of where time goes. Interesting way of looking at fragmentation. The pie hart had a lot of slices!
- Can’t have a non-functional definition of done
- Majority of industry is in fire alarm/reactive improvement mode. But still strive to proactive improvement. It is hard and expensive.
Takeways:
- must be responsible for what you build. “You build it; you own it”
- Data is the key. Even if it is in Excel.
- Pain in instructional. Results in improvements. Continuous improvement. If something hurts, do it more often.
My take
This was a fun start. It tells a great entertaining story and includes information. Watch the video. My notes nor the slides do it justice!