[2024 dev2next] distributed consensus algorithms

Speakers: Mykyta Protsenko, Alex Borysov

For more see theĀ table of contents


Scenario: need 5 people to meet: two in Ukraine, one east coast US and two West coast US. Need to meet weekly

Consensus properties

  • Fault tolerance – not reliant on one person to see result
  • Safety – only one value chosen
  • Liveness – get to consensus in finite about of time, can take multiple iterations

2 phase commit

  • One person asks and someone says can attend if everyone can. everyone says yes.
  • Commit and now official.
  • Simple but downsides. When group big, a lot of acks. If one no, transaction aborted and start over. Also, waiting for slowest node to reply.
  • If coordinator loses internet, everything blocked
  • Fails on fault tolerance and liveness

Paxos Protocol

  • Everyone has a ballot with unique ballot number
  • Propose a time with next ballot
  • Submit last vote null message to promise that will vote. Can be for or against.
  • Once majority promise to vote, sends actual begin ballot message
  • Then people actually vote
  • Consensus is majority. Sends message that reached consensus
  • Each participant must track last ballot tried, promise to vote and actual vote.
  • Choosing proposer and ballot don’t have to same person
  • Can only vote if confirmed a promise to vote in cluster
  • Learners can observe to be notified when consensus is reached
  • Fault tolerant because majority is enough
  • Safety because majority based
  • Doesn’t ensure liveness.
  • Can elect member as leader who can be the only one to propose
  • 2 round trips for consensus

Cassandra

  • Uses Paxos
  • Need to know order of data – linearizable consistency
  • Don’t mix transaction types. ex: use if exists/if not exists consistently.
  • Lightweight transactions are faster than two phase commit
  • Incurs performance penalty by design because more Paxos interactions.

Raft

  • Two message types
  • Leader based. All other nodes are followers
  • No reelections. Leader stays as leader until disappears
  • Like Paxos, use increasing numbers
  • Every node starts as a follower. on term 1 Followers notice no leader. One or more volunteers and increases term number. Others vote on leader. Only one vote for that term so can’t vote twice.
  • Once leader elected, followers send requests to leader who propagates
  • Log replication – can be applied (not final) or committed
  • Use commit index as tracker of what data was committed. Allows to see state
  • All followers have a heartbeat tracker. If leader disappears, the one who hasn’t heard from the leader in the longest time becomes a candidate to be new leader. If away and request leader, gets rejected because have one
  • If outside cluster and want to know status, asks leader
  • Fault tolerance – yes leader or follower can droo
  • Safety – guarantees one choice. Also only commit data from term
  • Liveness – in practice yes, but in theory no

Mongo DB

  • Uses Raft
  • If slow member is leader, there is a write bottleneck.
  • Can horizontally scale by replica set. Can hash keys so majority of requests aren’t all on one replica

Accord

  • New algorithm; not widely available
  • Leader based protocols create bottleneck
  • Fast and slow paths
  • If can get majority with fast path, can tell slower nodes later; even async
  • A node must be part of all fast paths majority so can share with others when back online
  • Fast path should be 3/4 of nodes to guarantee someone has latest state
  • Slow path remains as simple majority
  • ACID
  • Reorder buffer to reset transactions to be in order based on time differenitials

My take

I knew what two phase commit was. Everything else was new to me. Excellent start to the morning! The five people voting made it easier to follow. The reasons for them disappearing (Ukrainian soldier, Californian losing power) also helped pay attention. (Left a few minutes early to answer a phone call)

Leave a Reply

Your email address will not be published. Required fields are marked *