-
Notifications
You must be signed in to change notification settings - Fork 2
replication
2020 Sep 05
Simplify epaxos.
Major simplifications and optimizations with comparison to epaxos:
-
Get rid of the livelock issue with infinite SCC.
-
This protocol is very simple, only two message: Prepare and Accept. epaxos has 2 for replication, 3 for recovery.
-
Recovery is the as simple as a Prepare and standard replication protocol. Removed
defered recoveryfrom recovery. Instead, one Prepare+Accept is quite enough. -
Move
seqintodepsthus recovery doesnot need to considerseq. -
Accept-ed, Commit-ed status now are per-dep field.
-
In this paper
Ballotis a replica attribute, instead of an instance attribute. -
Simplify the proof of correctness.
-
R0,R1... orR[0],R[1]... : replica. -
a,b...x,y... : instance. -
a₀: the value ofawhenais initiated on its leader. -
La,Lb: is the leader replica of an instanceaorb. -
f: number of max allowed failed replica that. -
n: number of replicas,n = 2f+1. -
Qc: classic quorum:|Qc| = f+1. -
Qf: quorum of fast round:|Qf| = f+⌊|Qc|/2⌋ = ⌊(3n-1)/4⌋ -
a ~ b: interfere:aandbcan not exchange execution order in any instance(command) sequence. -
a → b: relation depends-on:adepends-onb. -
a < b: relation do-not-know:adoesnot seeb. -
a: → b: instance status depends-on: instanceathat depends-onb. -
a: < b: instance status do-not-know: instanceathat doesnot seeb. -
a ↦ b: relation exec-depends-on:aexecute afterb.
-
G-consistency: there has to be at least one relation between two interfering instances
a, bcommitted. Otherwisea, bmay be executed in different order two replicas. -
G-exec-linearizability: If
ais proposed afterbis committed,ais always executed afterb. -
Safety
The entire instance space is a 3d array:
R[i][j][idx]
Fields:
- i: replicaID.
- j: replicaID.
- idx: index of a instance.
Explain:
-
R[i]: all data on replicai -
R[i][j]: instances initiated by replicajthose are stored on replicai. -
R[i][j][idx]:idx-th instance initiated by replicaj.
| |
| |
| |
| c f c f c f |
| a b e a b e a b e |
| ---------------- ---------------- ---------------- |
| leader: [0] [1] [2] [0] [1] [2] [0] [1] [2] |
| ================ ================ ================ |
| replica: R[0] R[1] R[2] |
Ballot is same as the ballot in paxos, except it is a vector and is a barrier of both two instance and the relation between them.
Every replica has a attribute term, and a ballot is vector of the two leaders of a relation:
blt = {Ra: Ra.term, Rb:Rb.term}.
A replica stores terms of all replicas, playing the same role as last_blt
in paxos: Ri.last_blt = {R0: R0.term, R1: R1.term, ...}.
If the request blt is not greater-or-equal the replica last_blt, the reqeust
is declined, i.e., blt[a] >= last_blt[a] and blt[b] >= last_blt[b].
Ballot is used to identify a leader, just in the way paxos does.
A recovery process Prepare with a higher blt on a quorum to become the new leader.
E.g.:
Initially La Lb have ballot (1, 2), for Prepare and Accept request.
If a recovery process P1 starts to recover b, it increment ballot of Rb to 3.
Then it tries to recover b with ballot (1, 3).
Another recovery process P2 with ballot (2, 2) would block P1, and P1
also blocks P2.
type Ballot HashMap<ReplicaID, i64>;
An instance is the internal representation of a client request.
type InstanceID(ReplicaID, i64)
type View HashMap<InstanceID>
type Dep {
id: InstanceID;
accepted: Ballot;
committed: bool;
view: View;
}
type Instance {
id: InstanceID;
ballot: Ballot;
accepted_ballot: Ballot;
cmds: Vec<Commands>;
view: View;
deps: HashMap<InstanceID, Dep>;
}
-
id: is the instance id. -
ballot: is the highest seen ballot, -
accepted_ballot: is the highest ballot of Accept-ed request. -
cmdsis the commands a client wants to execute. -
viewis all instances an instance see and is used to determines order between instances when there is cycle of depends-on.a.view = {a} ∪ a.deps[0].view ∪ a.deps[1].view ∪ ...In practice, a simple int typeseqis able to play the same role asview. -
depsis the set of intefering instances thatadepends-on:a.deps = {x : a → x}with additional paxos related info.Two dependencies are different if:
x.id != y.id or x.view != y.view
TODO when handling Prepare, the seq must be the max of all interfering instances, not only the max one. Because former instance may be committed with a higher seq than latter instance. This would break the G-exec-linearizability.
Instances proposed by a same leader have a natural depends-on relation: A newer instance depends-on all preceding ones.
This doesnot affect the out-of-order commit.
If b is committed before a is proposed,
a see at least one b.view that is committed.
∴ If a.view ⊃ b.view then executing a after b satisfies G-exec-linearizability.
Because subset relation ⊃ is monotonic, thus there is no cycle.
And we define that a newer instance always executes after an older one. Obviously this still have G-exec-linearizability held.
A value of something is safe if: no other process(leader or recovery process) would choose a different value to commit.
∴ A safe value has to be forwarded to a quorum(Qf or Qc) so that other
process can see the previous safe value.
The action commit is to broadcast about what value is safe.
For the order of two interfering instances a ~ b,
there are 3 legal relation pairs:
a: → b and b: < a a depends-on b and b doesnot know of a;
a: → b and b: → a a b depends-on each other;
a: < b and b: → a a doesnot know of b and b depends-on a;
From G-consistency,
a: < b and b: < a is not allowed.
A leader of an instance a is the only process that can commit a.
This is guaranteed by Ballot: the process Prepare-ed on a quorum with the
highest Ballot is the only one to proceed.
The process that initiates a is the default leader of it.
E.g., La initiates a is leader of a then it tries to commit one of a: → b or a: < b.
With this protocol we need to ensure two things about an instance a to be safe, before committing it:
-
What to execute:
a.cmds.To commit
a.cmds, just simply forward it to aQc, becausea.cmdsnever changes. -
and the relations of
aand all other interfering instances.
-
Round-1: a leader, e.g.,
Lasends RPC, i.e., Prepare to read the depends-on relation betweenaand all other instances(a: → x, a: < y, a: → z, ...) from some replicas.If
Laobserved that the relations ofaare safe, commit it. Thus this commit with only one round of RPC is called a FastCommit. And this procedure is so called FastPath -
If the relations is not safe, continue with Round-2, aka, the SlowPath. On SlowPath,
Lafollows the paxos protocol, choose a relation foraand then send Accept then commit.
When received Prepare request of a,
from Def-relation:
A replica decides the relation of a, b by the order a and b
arrives, if it didnot yet receive a Commit message.
A replica replies a do-not-know b, i.e., a: < b if
b does not exist on this replica. Otherwise it has to
reply b depends-on a, i.e., b: → a, to satisfy
G-consistency.
If a replica has received Commit message with a: → b,
it replies b: < a to Prepare request of b, without
breaking G-consistency.
∴ On a replica, the initial relations are exclusive: a: < b and b: → a or a: → b and b: < a.
If the relations La received in Round-1 are not safe,
La needs another round of RPC to make the value safe.
In Round-2, the leader choose a relation and broadcast it.
The relation it chose must satisfy
Def-relation:
i.e., a: < b and b: < a can not both chosen.
∴ If a: → b is seen, the leader has to choose a: → b.
The chosen relation also needs to satisfy Def-safe in order to commit.
∴ The chosen relation have to constitute a quorum: Qc, it must
be chosen by any other future leader(recovery).
A recovery process re-commits an instance if the previous leader fails(e.g., timeout) to complete the commit.
Our goal is to recover an instance with only f+1 active replicas.
Recovery have to guarantee that: an already committed value must be chosen.
Paxos is quite enough to recover an Accept-ed value, thus this protocol only need to consider recovering a FastCommit-ed value.
If a value is FastCommit-ed in the Round-1, the protocol have to guarantee recovery always choose this value.
∴ The FastCommit-ed value must be seen by any other quorum, i.e., A FastQuorum is at least a Qc.
The values replied from a FastQuorum Qf must be all the same.
Proof:
If a recovery process can choose the FastCommit-ed value v from a Qc,
when it saw some v and some other different value v₁,
then the other values are not needed.
∴ all values in FastQuorum are all the identical.
If two interfering instances a ~ b are both FastCommit-ed,
the value of b can be determined from the value of a, i.e., a: < b ⇔ b: → a.
Proof:
a ~ b are both FastCommit-ed,
from Lemma-FastQuorum-at-least-Qc,
there is at least one replica on which the relation is a: → b and b: < a.
And from Lemma-FastQuorum-same-value,
a: → b is FastCommit-ed implies that b: < a is the only value FastCommit-ed and vice versa.
If b: < a is committed, then a can only be FastCommit-ed with a: → b.
Proof:
If b: < a is committed,
from Lemma-FastQuorum-at-least-Qc,
and Def-Accept,
there are at least |Qc| replicas that have b: < a.
∴ There are at least |Qc| replicas that have b: < a.
La always sees b: < a.
Thus from Def-Accept, La can only commit a: → b.
If a recovery process reached either of leader of a(La) or leader of b(Lb),
It is able to find the relation between a, b from the leader.
From Lemma-entangled-FastCommit,
We only need to recover the relation when recovery process doesnot reach either of La or Lb.
This places demands on the definition of FastQuorum:
A FastQuorum is a set of replicas to which La forwarded a and guarantees Recovery-safety.
A FastQuorum requires: |Qf'(a, b)| > |Qc|/2 + |Qc| - 3,
where Qf'(a, b) = Qf(a) \ {La, Lb}, i.e., Qf(a) excluding leader of a or b.
Proof:
Assumes
La and Lb have chosen their FastQuorum as Qf(a) and Qf(b).
Recovery process has chosen a classic quorum Qc for recovery.
From Recovery-FastCommit-without-leaders, assumes {La, Lb} ∩ Qc = ø.
To satisfy Recovery-safety, a Qf have to satisfies that:
in Qc, only one of a: < b or b: < a can be chosen.
∴ Qf'(a, b) ∩ Qc and Qf'(b, a) ∩ Qc must have intersection.
I.e., |Qf'(a, b) ∩ Qc| > |Qc|/2 for any Qc: {La, Lb} ∩ Qc = ø. Then we have: |Qf'(a, b)| - (n-2-|Qc|) > |Qc|/2 = |Qc|/2 + |Qc| - 3.
E.g.:
- For 5 replicas,
Qc=3,|Qf'(a, b)|=2. - For 7 replicas,
Qc=4,|Qf'(a, b)|=4. - For 9 replicas,
Qc=5,|Qf'(a, b)|=5.
∴ With |Qc| = f+1, and including La,
The minimal FastQuorum size is: |Qf| >= f + |Qc|/2.
In order to recovery a FastCommit-ed value, another guarantee must be met:
If b: → a is committed, a: < b is the only relation that can be FastCommit-ed.
Proof:
From Def-FastQuorum, |Qf(a)| = f + |Qc|/2.
If Lb ∉ Qf(a), a FastCommit-ed value may not be chosen.
To achieve this optimal size of |Qf(a)| = f + |Qc|/2, we need to recover a from a committed b.
From Lemma-entangled-LT, we already have: b: < a implies only a: → b can be FastCommit-ed.
Then we need that if b: → a is committed, a: → b can not be FastCommit-ed:
If La tries to FastCommit a: → b, La has to be sure that b: < a is committed.
i.e., La has seen at least one committed b: < a.
With the above constrain, we establish another entangled status:
If b: → a is committed, a: < b is the only relation that can be FastCommit-ed.
From Lemma-entangled-FastCommit, Lemma-entangled-LT and Lemma-entangled-GT,
we have that if a ~ b and b is committed,
a can only be FastCommit-ed with the opposite value, i.e.,
- Committed
b: < a⇒a: → b - Committed
b: → a⇒a: < b
Conditions must be satisified to FastCommit a:
a: < x constitutes a FastQuorum: |Qf| = f + |Qc|/2, or:
a: → x constitutes a FastQuorum: |Qf| = f + |Qc|/2 and
the leader received at least one reply with committed x.
-
Lasends Prepare request toQf(a). So doesLb. -
If
Lareceived identical replies fromQf(a), e.g.:-
a: → bandb: < ais committed, - or
a: < b, commit it.
-
-
Otherwise, send Accept with
a: → b. Then commit it if received OK replies from someQc.
From Def-exec-order,
we use view to determine execution order.
With view, if b₁.view != b₂.view, b₁, b₂ are different instances.
This pseudo code describes an unoptimized impl: it commits relations one by one. In practice, relations are batch processed.
fn interferings(a):
rst = {}
for x in local_instances:
if x ~ a:
rst.insert(x)
return rst
fn new_dep(x):
return Dep{
id: x.id,
accepted: None,
committed: false,
view: x.view,
}
fn update_dep(a, b):
a.deps[b.id].view ∪= b.view
if b.committed:
a.deps[b.id].committed = true
if b.accepted > a.deps[b.id]:
a.deps[b.id].accepted = b.accepted
fn union(a, b):
c = {}
c.deps.keys = a.deps.keys ∪ b.deps.keys
for id in c.deps:
update_dep(c.deps[id], a.deps[id])
update_dep(c.deps[id], b.deps[id])
return c
-
Leader: Initiate instance
a₀: filla₀.depswith all local instances that interferes witha₀. Then forwarda₀to all other replicas. -
NonLeader: handle Prepare request: If
req.ballot >= last_blt, the replica rejects the request and replieslast_blt. Otherwise, initiatea'the same way theLadoes witha₀. Then uniona'anda₀:a₁ = union(a', a₀).Reply
a₁. -
Leader: Handle Prepare replies: For every instance
xin replieda.deps:x: x ∈ a.deps, if it satisfies FP-condition: commitx. Otherwise enter SlowPath.
-
Leader:
union()at least|Qc|replies to builda₂. Send Accept request witha₂to replicas. -
NonLeader: handle Accept request: If
req.ballot >= last_blt, the replica rejects the request and replieslast_blt. Otherwise, storea₂. Then reply thelast_blt. -
Leader: Handle Accept reply: if at least
|Qc|OK replies are received, commit it.
| |
| Leader | Non Leader
| -------------------------------------- | ---
| fn init(a₀): | ---
| a₀.deps = {} |
| |
| for x in interferings(a₀): |
| if a.id in x.deps and x.committed: |
| continue |
| a₀.deps.insert(x.id, new_dep(x)) |
| |
| init(a₀) |
| send_prepare(a₀) |
| --- | ---
| | fn handle_prepare_request(req):
| | if not req.blt >= last_blt:
| | reply(last_blt)
| | a₀ = req.a₀
| | a' = a₀; a'.deps = {}
| | init(a')
| | a₁ = union(a', a₀)
| | reply(a₁)
| --- | ---
| fn handle_prepare_replies(replies): |
| |
| dep_ids = {x.id: x ∈ replies[i]} |
| for dep_id in dep_ids: |
| same = true |
| committed = false |
| r0 = replies[0].deps[dep_id] |
| for repl in replies: |
| r = repl.deps[dep_id] |
| same = same and r0 == r |
| if r.committed: |
| committed = true |
| |
| if same: |
| if (r0 == a₀.deps[dep_id] |
| or committed): |
| commit(a, replies, dep_id) |
| continue |
| slowpath(a, replies, dep_id) |
| |
| --- | ---
| fn slowpath(a, replies, dep_id): | ---
| |
| for repl in replies: |
| update_dep(a, repl.deps[dep_id]) |
| a.deps[dep_id].accepted = my_blt |
| |
| send_accept(a, dep_id) |
| --- | ---
| | fn handle_accept_request(req):
| |
| | a = req.q; dep_id = req.dep_id
| | if req.blt >= last_blt:
| | save(a.deps[dep_id])
| | last_blt = req.blt
| | reply(OK)
| | else:
| | reply(last_blt)
| |
| --- | ---
| fn handle_accept_replies(replies): |
| ok = 0 |
| for repl in replies: |
| if repl.ok: |
| ok++ |
| if ok >= Qc: |
| commit(a, dep_id) |
| |
| --- | ---
| commit(a, dep_id) |
| a.deps[dep_id].committed = true |
| send_commit(a, dep_id) |
Assumes:
- The instance to recover is
a. The recovery process tries to determine the relation ofawithb. - The recovery process is
P(P != La).
In the following steps,
if conflict ballot is returned in reply.last_blt, update local ballot and retry.
Increment ballot: blt₁ = blt₀; blt₁[La]++ and send Prepare of a with blt₁ to take leadership from La.
If a is not on local replica P running on, send empty Prepare to a Qc to
find out a then retry.
If no a is found, commit a nil instance on SlowPath and quit.
After Prepare-ed on a quorum Qc,
commit a or b if it can be committed:
- If
Lbis seen, wait for it to commitb. - Accept-ed
aorbis seen, choose the one with greatestbltand commit it on SlowPath and quit.
If a is commited, quit.
If no a is seen, commit a nil instance on SlowPath and quit.
If only b is committed, from Lemma-entangled to choose the
value of a and commit it on SlowPath and quit.
Otherwise, a and b can only be FastCommit-ed:
From FP-condition, a: → b can be FastCommit-ed only when b: < a is committed.
∴ The next step is to determine the first FastCommit-ed: a: < b or b: < a.
From Def-Prepare, Lb wont reply a: < b if b is not committed.
∴ If a: < b is FastCommit-ed, then Lb ∉ Qf(a).
From Def-FastQuorum, a FastCommit-ed value always occupies a majority in any Qc if Qc: Qc ∩ {La, Lb} = ø.
∴ Choose the one that counts more than |Qc|/2, commit it on SlowPath and quit.
If no value of a has more than half of |Qc|,
choose the union of all seen value, commit it on SlowPath and quit.
See exec-update-accumulated.md
Too long; Do not read the following sections.
There is only one value could be chosen to be safe.
∴ finally an instance is committed the same value on all replias.
∴ All replicas have the same set of instances and relations.
On implementation, a.deps is split into N subset, where N is number of replicas.
Every subset contains only instances from leader Ri:
a.deps[Ri] = {x | x.replicaID == Ri and a → x}.
And a.deps[Ri] records only the max instance id.
This way,
Dep.id Dep.accepted Dep.committed are all recorded with a int.
Because only Accept-ed a implies all instances on La before a are also
Accept-ed.
So is Commit-ed.
type DepBatch {
id: InstanceID;
accepted: Ballot;
committed: bool;
view: View;
}
optimization maybe:
a a a→x a←x a←x x x
a always view x view a, so a does not need to know x.
When handling Prepare, need to exclude instance known but not on local replica. This guarantees only need to check the highest interfering instance when recovery.
No way. a after b requires a view of all b view of.
Prepare and Prepare can be done together Because Accept only check ballot. Prepare does not change existent Prepare.
inexistent Prepare-ed value can be treat as existent value but dont know what it is.
∵ Prepare never change a value. ∴ reading an old value does not need to be protected with Ballot.
seeing a Accept-ed instance: A recovery always choose a possibly FastCommit-ed value. ∴ accept this Prepare request wont let any other one to choose a different value.
As an optimization, respect the accepted value is more efficient. re-populating an Accept-ed value does not break consistency.
Instance.view is a virtual attribute and is not need to be persisted on disk.
Because it can be calculated dynamically from deps:
a.view = {a} ∪ a.deps[0].view ∪ a.deps[1].view ∪ ...
From Def-leader-order,
Because newer instance on La always depends-on older instance,
∴ a.deps[La] will never change thus do not need to recover it.
If x is FastCommit-ed:
-
If
areachedLx, thenaknow ifxis committed, becauseLxis the first to commit. Although there is chancexis committed afterareachesLx,Lxbroadcastsx is committedvery likely earlier than another instance bringsx is committedthrough its Prepare request. -
If
adid not reachLx, thenamust have reachedg - {La, Lx}, this prevent other value ofa > yto commit. ∴a > xis safe to FastCommit.