how containers have panned out – adrian trenaman – qcon

For more QCon posts, see my live blog table of contents. Adrian is from Gilt.

History

  • No off the shelf software to run a flash sale business. Therefore Gilt has to do something custom.
  • Started with Ruby on Rails in 2007. Didn’t scale well enough
  • Moved to Java in 2011
  • Moved to microservices in 2015
  • In a 30 day period, moved bulk of Gilt to Amazon

Problems

  • Isolation problem – nobody should be able to take down someone else’s work
  • A noon outage in 2013 – what happened
  • Impedance mismatch problems. “Developers often think of machines as something that’s all theirs, magically provided by the hardware fairy.”

Machines for Gilt Japan

  • Run 20-40 containers per machine.
  • Load balancer between two racks of three boxes each.
  • Separate machines for the database and email.
  • From developer’s point of view, a machine is a machine.

What did Gilt Japan learn

  • Scalable by time of day
  • Solves impedance mismatch – developers see “a machine”
  • Limits damage one person can do
  • Infra/Devops engineer embedded into engineering team
  • Outstanding potential problems
    • Static infrastructure
    • Resource hogging

Docker topology

  • Dark canary – only for internal use
  • Canary – First prod install. Let it run for a while (ex through a noon cycle for Gilt)
  • Release – Once happy with canary, roll it out to other nodes
  • Gilt has a lot of read only traffic which limits damage you can do and reduces need for staging environment.
  • Gilt has one container per host/EC2 instance
  • Want to have as few moving parts/risk points in deployment process
  • “We could solve this now, or just wait six months and Amazon wil provide a solution”

Projects

  • ION Roller
    • Immutable deployment – Destroy original cluster when done with this process for Docker upgrades.
    • Slow to setup/tear down environments.
    • Can be expensive for continuous deployment
    • Open source, but in house.
  • Nova
    • Uses yaml to deploy
    • No Docker registry. Base images are on Docker. Releases aren’t needed on there so go straight to Amazon
    • Less boilerplate
    • Immutable deployment on mutable infrastructure. Docker container is immutable.
  • Fighting bit rot, chaos-monkey style
    • Don’t want things to run forever in Prod.
    • What if there is a security vulnerability
    • Every day, kill oldest AMI randomly. This forces latest AMI with fixes and fail early.
    • Doesn’t solve vulnerability in Docker container. Would need new release with new base image for that. Hasn’t happened to Gilt yet.
  • Sundial
    • For running batch jobs
    • Automatically reschedules if fail
    • Define a process – group of tasks with dependencies between them

EC2

  • Less configuration
  • Automatic rollout
  • Integrations
  • IAM roles are at instance level, not container level

Using Docker as a local build platform

  • Different projects use different versions of build tools
  • Docker can be used as a versioned build container.
  • A year from now, will still have everything need to run code

Lessons

  • Containers let separate what deploy from how.where deploy it
  • Still the wild west on how containers are deployed
  • Seek immutability in the container, not in the stack
  • The competitive advantage for Gilt is to be able to deploy quickly/frequently/safely to production and therefore can innovate faster. Gilt lets engineers deploy whenever they want without asking permission.

unprivileged containers – jessie frazelle – qcon

For more QCon posts, see my live blog table of contents.

Today

  • Docker typically runs as a privileged user.
  • Containers are meant to limit the damage from a compromise. The world an attacker can see inside the container is a limited one)
  • Want unprivileged containers so don’t need sudo/privileged access to launch container in the first place.

Chrome sandbox on Linux

  • uses Seccomp, Namspaces, Apparmor.
  • doesn’t need to be run as root.
  • each tab is in its own namespace – process only knows about itself
  • if Chrome can do this, why not Docker

General notes

  • cgroups (Controlgroups) limit what resources a process can use and how much.
  • Each time you docker build something it spawns a new container. Just blocking things wholesale would cause issues here.
  • I had trouble following what was current/future in the examples.

Future

  • Won’t need to run as root
  • Can customize sandboxes from defaults, better UX for dealing with security policies.
  • “postgres should maintain a postgres profile”

Impression: A lot of this was recorded demos (show typing commands as graphic/video that plays out.) For the namespaces, it was helpful seeing the examples. For the Docker part, some of it went over my head. I only know a little Docker. And my system admin Linux isn’t strong enough to understand the implications of everything she brought up either. I still go something out of it though. And learned things that would be interesting to read more about.

the bad things happen when you’re not looking – ryan huber – qcon

See the live blog table of contents. Gist is posted at https://goo.gl/ZAxCnH (github login required)

Ryan was the first security employee at Slack. He is doing an experiment where red slides means don’t take pictures or tweet about the slide. I really like that idea. It makes speaker intent clear.

How find out about a problem

  • Don’t want to find out from Brian Krebs that you’ve been breached
  • Don’t want hackers to tell you something strange is going on. They are done at that point and are showing off
  • Even worse – don’t notice

General Notes

  • Time to detect is important metric
  • Credential theft is biggest/one of the biggest
  • Goal – watch as many things as possible, but don’t be a dashboard. Want as little as possible on the dashboard. If it is mostly empty, things will get noticed when they are there.
  • Bad model – NetCool – train people to acknowledge all alerts and they miss things because bad habit
  • The defender’s advantage – if the attackers don’t know what you are looking for/trip wire, they dont know what to avoid
  • “Zero days are not invisibility cloaks” – other boxes can pick up on it
  • The hypothetcial malicious insider – a former security team member has a lot of knowledge. And an insider with credentials has access
  • Don’t overwhelm users. Confirm bulk actions in bulk not one at a time.
  • Canaries – need to validate monitoring, recording, etc.
  • Do table top red team exercises if not doing real ones.

Slack Security

  • Setup reliable logging platform
    • RELP (reliable event logging protocol)
    • steamstash/logstash -> Elastic search (Splunk is superior but costs more)
    • Two weeks of data is about 2 terrabytes of logged data. Almost never sits on disk
  • auditd – part of Linux. Run auditctl commands and kernel looks for matching events.
  • audisp – works with auditd to transform data
  • osquery – Facebook project for system monitoring using SQL
  • ElastAlert – yelp project to pick up on ElasticSearch events. Does queries on a timer against Elastic Search.
  • AlertCenter – have SecurityBot looking at alerts. Security bot posts to Slack asking user to type “acknowledge” on phone to confirm action. That way, know have phone and not just Slack account. If no reply in X hours, goes to Pagerduty. Automated triage to avoid flood of data. Instead of security team looking at all alerts, whole company is helping. This means the security team responds to less than 5 alerts a day.

Rules

  • Listeners – specific events
  • Time awake – nobody is awake for 24 hours. Trigger an alert when this happens
  • GeoIP – Doesn’t work perfectly. T-Mobile has feature that can travel abroad without paying roaming. This works by routing some traffic through Texas so your location keeps jumping between Texas and aboard
  • IPs – less unique IPs than you’d think. Worth looking at when user comes from new IP.