[uberconf 2023] Architect’s Guide to Site Reliability Engineering

Speaker: Nathaniel Schutta

@ntschutta

For more, see the table of contents


General

  • Agile – do more of what works
  • Conflicting incentives – ops – “don’t change if works”
  • Monoliths to service oriented to microservices – solves some problems and created new ones
  • SRE (site reliability engineer) – new role
  • We are good at giving something a new name and pretending have never done before. Ex: cloud computing – big pile of commute and slice of what need – ex: mainframe
  • Everything we do involve other people. Most problems are people problems, which we tend to ignore

History

  • Goes back to Apollo problem. First SRE was Margret Hamilton. She wanted to add error checking and did update the docs
  • Phone autocorrects on map – recalculating….
  • Traditionally systems run by system admins.
  • Now have hundreds/thousands of services
  • CORBA – facilitate communicate for disparate systems on diverse platforms. Also a good definition for microservices. EJBs too. Then SOA
  • APIs exploded because we all have smartphones/supercomputers in our pocket
  • Amazon had policy of everything being an API

Challenges

  • Who page if go through 20 service. It’s clearly another team’s problem
  • How monitor
  • How debug
  • How even find services
  • We argue about definition of made up words. Ex: microservice. Nate likes definition that it can be written in two weeks. How many services can a team support? If change alot, 4-6. If stable, 15-20
  • How do we define an application?
  • Conflicting incentives – release often
  • The way we do things might be the first way we tried vs a better way.

SRE

  • What if we asked software devs to define an ops team?
  • Software engineering applied to operations
  • Did that with testing too; made more like dev
  • Replace manual tasks with automation
  • Many SREs former software engineers.
  • Helpful to understand Linux
  • Can’t reply on quarterly “Review Board”. This is a very slow quality gate. However, most orgs don’t audit gate to see if useful.
  • Goal: move fast, safely
  • Doesn’t happen in spare cycles “when have time”. Can’t be on call all the time or be doing tickets/manual work all the time.
  • Humans can’t do the same thing twice. Ex: golf

What does SRE do/consider

  • Availability
  • Stability
  • Latency
  • Performance
  • Monitoring
  • Capacity planning
  • Emergency response
  • Understand SLOs
  • Embrace/manage/mitigate risk. Risk is a continuum and a business decision
  • Short term vs long term thinking. Heroics works for a while, but isn’t sustainable. Often better to lower SLOs for a short time to come up with better solution.
  • Focus on mean time to recovery. No such thing as 100% uptime.
  • Runbook is helpful. Not everyone is an expert on the system. Even if do know, brain doesn’t work well in middle of the night or under pressure. Playbook produces 2x improvement in mean time to recovery
  • People fall to level of training, react worse when stressed.
  • Alerting. Need to know what is important/critical and when it is important. Ex: can ignore car oil change message for a bit but not for too long
  • Logging best practices. Logs tend to be nothing or repeating the same thing 10x.
  • Four golden signals – latency, traffic level, error rate, saturation
  • Automate everything; manual toil drives people out of SRE
  • Which services most important
  • Establish an error budget. Can experiment when more stable. Can’t deploy when error budget used up. Helps understand tradeoffs.
  • Production readiness reviews.
  • Get everyone on same page with what service does – dev, archs, etc. Improves understanding and can find bottlenecks
  • Checklists – quantifiable and measurable items.
  • Think about how it can fail and what happens if it does
  • Chaos engineering

Outage impact

  • What do customers expect? Used to be 5×12 – 6am-6pm M-F. Most things are 24×7 now
  • What do competitors provide? Need to do same
  • Cost – more failover is super expensive
  • When cloud goes down, it is news
  • Depends if needs redundant backup. How much venue lose than cost of being down?

Postmortems

  • Don’t want to make same mistakes. Learn from yours and others. Avoid them becoming a blame session
  • Outages will still happen; must learn from them so that bad thing doesn’t happen again
  • Living documents – status as the outage is happening, impact to business, root causes
  • Tactical vs long term/strategic fix
  • Action items to avoid in future
  • Cultural issues
  • Wheel of misfortune – role play disaster, practice
  • Recognize people for participating
  • Need senior management to encourage
  • Provide a retro on postmortem to improve process
  • Education – if you already understood that, we’d give you something harder to do

SLO/SLA

  • 99% – 7.2 hours a month, 14.4 minutes a day
  • 99.9% – 8.76 hours a year, 1.44 minutes a day
  • 99.99% – 4.38 minutes a month; 8.66 seconds a day
  • 99.999% – 4.38 minutes a month; 864 millis a day
  • Google K8S Engine availability is 99.5%. Can;t exceed service provider
  • SLA (service level agreement) means financial consequence of missing. Otherwise, it is a SLO
  • More to always better; can’t be infinity
  • Can tighten later; hard to losen
  • If a system exceeds their SLA, can’t rely on that. Could stop at any time.
  • Might have internal SLO that is tighter than the advertised one.
  • Everyone wants five nines until they see the cost. “If you have to ask, you can’t afford it”

Fitness functions

  • Tests to make sure architecture still does what want it to do
  • If know when breaks, can tie back to code change and fix

Next steps

  • Build an SRE team if don’t have one
  • Applications changing rapidly
  • Need to enable environment to move fast and safely
  • Must work well together

My take

Nate’s style is a ton of slides with a mostly few words/sentence on each. It’s a fun style. It also means the font size is super large and I don’t need to wear my glasses for most slides. I had to step out for a few minutes for the restroom. [I’ve been doing an excellent version hydrating!] Hard to step out, but easy to catch when came back.

[uberconf 2023] go(lang) for java developers workshop

Speaker: Ken Sipe

For more, see the table of contents

@kensipe


General

  • Normally this is a workshop. Today’s session is abridged
  • Lots in lab that aren’t in the slides/presentation
  • Labs: https://github.com/kensipe/go-labs
  • Solutions: https://github.com/codementor/wman

Go

  • Go Lang only for search. Call it Go in talking
  • Google made Go in 2009 and started using in 2010
  • Killer app – advanced cloud/k8s
  • 1.0 released in 2012
  • Release every six months
  • Can run online – https://go.dev/tour/welcome/1
  • VSCode has good plugin. IntelliJ makes a Go IDE – GoLand. IntelliJ plugin for Go not same as GoLand

Value proposition

  • Ease to use – productivity
  • Efficient execution
  • Type safe
  • Latency free garbage collection
  • Fast compilation
  • Native binary, but run on multiple platforms

More about Go

  • Some inspiration from C but not derived from c family
  • Language simplicity – no inheritance/generics/assertions/method overload/classes
  • Implement interface
  • Inference assignment
  • Have structs, but not OO

Syntax

  • format is defined. {} must be positioned right. { must be on same line as predecessor
  • No semicolons

Differences from Java

  • No classes
  • Main doesn’t have ceremony – func main
  • In Java, the project contains the src tree. In Go, you clone the whole project into your src tree (GOPATH). In later versions of Go, can scatter if want.
  • Functions in files – less of a conventions
  • return used a lot more than in Java – want to fail fast/get out of method as quickly as possible
  • Naming controls behavior. name is private; Name is public

Go commands and files

  • go mod init – creates a go project in the current directory (assuming in GOPATH)
  • go mod init github.com/user/proj – creates a new go module
  • Creates go.mod file listing version number by default
  • go.mod lists required dependencies once add them
  • go.sum – checksum for dependencies
  • go doc – command line help. ex: go doc fmt or go doc.fmt.Println
  • many linters as options
  • go mod verify – checks checksums
  • go mod tidy – remove any unused dependencies

Hello World

package main

import "fmt"

func main() {
	fmt.Println("Hello World")
}

Packages

  • Can import one at a time or a list – like namespaces
  • Built in packages: https://pkg.go.dev/std
  • Reusable packages: http://awesome-go.com
  • Ken makes three sections; by likelihood of changing dependencies listed later and alphabetized by group. GoLand handles this ordering.
import "fmt"
import "net/http"

vs

import (
  "fmt"
  "net/http"
)

Types

  • bool
  • string
  • int, int8, int16, int32, int64
  • uint, unit8, uint16, uint32, uint64, uintptr
  • byte (alias for uint8)
  • rune (alias for int32)
  • float32, float64
  • complex64, complex128
  • Array
  • Slice – initially, think of as a way of getting to an array. Can pass part of an array via a slice. Window view onto array.
  • Struct – create own structure
  • Pointer
  • Function
  • Interface
  • Map
  • Channel – stream of things, can be bounded or unbounded, can add events to channel
  • if just use “int”, get default for your platform
  • defining size matters for certain apps

Variables

  • var name string
  • name = “value”
  • var a, b, c int
  • var name = “hello”
  • var a, b, c = 1, 2, 3
  • var a, b = 1, true – don’t mix types on one line
  • Common to have terse variable names when very small scope
  • Leave out time if can be inferred
  • message := “hello” – most common way for creating a variable, can shadow variable if exists in different scope
  • a, b, c := 1, 2, 3
  • x, y := FunctionReturningTwoThings()
  • Will not compile if define variable that is not used
  • _, y := FunctionReturningTwoThingsWhereDoNotCareAboutFirstOne()

Pass by reference

  • Pass by value by default. If want pass by reference, need to pass a pointer
  • foo(&a) – to pass pointer
  • fun foo(count *int) – takes pointer
  • *count – use thing pointer preferences (ex: int)
  • count – gives you actual pointer
  • Strongly typed – can’t pass nil if has a declared return type. Can if pointer

User Defined types

  • type MyType string – alias to an existing type. Can add functions to your type.
type MyType struct {
  name string
  value string
}

var mine = MyType { "n", "v" }
var longer = MyType { name: "n", value: "v" }
var evenLonger = MyType{}
eventLonger.name = "n"
fmt.Println(mine.name)
  • When use longer format to create, order doesn’t matter when passing in values and can leave some out.

Const

const (
 PI = 3.4
 other1 = iota
 other2 = iota
)

iota is successive integers where don’t care about the numbers

Map

myMap := map[string]int {
  "n": 4,
}

myMap["n"]

Last comma needed to compile. This avoids changing lines in version control that weren’t affected by the change.

Function

  • os.Args – to get args passed to main
  • fmt.Println(a, b, c)
  • fun Mine(type string) string – return one string
  • fun Mine(type string) (string, string) – return two strings
  • return “a”, “b” – return two things
  • fun Mine(name … string) – takes a slice of strings; can call like varargs in Java

Can use variables for return. If don’t set all return values, “return” will fail. (Ok to use nil)

func Mine() (x string, y string) {
  x = "x"
  y = "y"
  return

Methods

  • Function only accessible on a type
func (x MyType) MethodName() {
}
var m = MyType{}
m.MethodName()

Interfaces

  • uses duck typing – if method has signature of interface, it is that interface
  • interface{} – no functions so everything implements

My take

Good intro. I was puzzled at the beginning about scope. I thought it was a talk but he said workshop. Then the instructions about creating a key were there. Then it was a talk. Which is what I was expecting. It would have been clearer to not include “workshop” in the title slide to match both the abstract and session. Avoids that confusion. I do like the lab was available for “later”. The references to “we will cover that later” with 20 minutes left felt like false promises.

The actual material and flow was great. I learned a lot and the callouts for what is different in Java was helpful. The examples were good for learning. I’m happy with the session and glad I attended. I just wish it was designed to be a 90 minute presentation with “see here for full” rather than trying to shoehorn something clearly much longer into 90 minutes.

He went a little over. I stopped blogging at the official end time because the rest was flying through a lot of the remaining slides too fast to follow, let alone understand/type.