Three layers, in increasing order of realism. All run from the Go test toolchain; the development environment provides go and a real etcd binary (e.g. from system PATH, Homebrew, or Nix store) — no mocks of etcd anywhere, ever. The client-side recipes are exactly where bugs live; mocking etcd would test the mock.
For code with no I/O: semaphore ranking ("is rank r < N; which key do I watch"),
key encode/decode, holder JSON, schedule → next-tick math (table-driven, includes DST
edges), backoff sequence. Plain go test, milliseconds.
TestMain execs a single-node etcd from $PATH into a t.TempDir() (unix socket
or random localhost port), tears it down after. Each test gets a unique key prefix so
tests parallelize within one etcd.
Per tool, the contract table from its spec is the test list. Non-obvious must-haves:
- elect: winner runs child with correct
CONCH_REV;Resignon clean exit hands over immediately (< 1s, not TTL);--whoagainst vacant office exits 1;--restartbackoff resets after a 60s-stable child. - sema: N+1 contenders ⇒ exactly N children running (poll
--who); releasing one promotes exactly the next-by-revision waiter; mismatched--maxlands in a separate prefix;--nonblockexits 75 fast; waiter timeout removes its queue key. - cron: three conchd instances against one etcd,
@every 1sjob ⇒ each tick has exactly one result; killed winner ⇒ tick claimed, no result, no rerun;rmduring run lets the run finish; restart doesn't claim past ticks. - core: child gets SIGTERM then SIGKILL after kill-after when lease is revoked
(use
etcdctl lease revokeas the partition stand-in); wrapper exits 70; child process group dies (spawn a grandchild, assert it's gone).
Property-style tests for the only promise that matters: bounded concurrency with honest overlap accounting.
Harness: K competing wrapper processes on one machine, each child appends
(holder, start-ns, end-ns) records to a shared file (O_APPEND). After a randomized
schedule of kills, lease revocations, and SIGSTOP/SIGCONT of wrappers (the GC-pause /
partition simulator), assert over the interval log:
- At every instant, ≤ N intervals overlap except during a window of ≤ TTL after each induced loss event (the documented overlap window — we assert it, including that it closes).
- After the last fault, the system converges: exactly the expected holders, all waiters either promoted or exited 75/70.
- Fencing tokens strictly increase per office/slot across successions.
These run with -race, are seeded (-run Invariant -seed=N reproducible), and gate
any change to internal/core.
conch-smoke.sh in the repo, run after each deploy against the real 3-node cluster:
conch elect smoke-$$ --who # expects vacant, exit 1
conch elect smoke-$$ -- true # acquire + clean exit on real quorum
conch sema smoke-$$ --max 2 --who
conch cron add smoke-$$ --schedule '@every 1m' -- logger "conch smoke"
# ... one tick later: cron ls shows exactly one result; then cron rm
Plus the two drills worth doing by hand once per schema version: power off the
fire-key-winning node mid-run (expect: no rerun, visible zombie row in ls), and stop
etcd on 2 of 3 nodes while an elect --restart service runs (expect: child killed ≤
TTL, restarts when quorum returns).
The CI pipeline runs code formatting checks, go vet, and test layers 1–3. Layer 3 (invariant tests) is executed with a fixed seed in CI for reproducibility, and can be run with a randomized seed in nightly jobs — the suite eventually tests itself.