Skip to content

Harden release registration and binary hash policy#99

Merged
Gajesh2007 merged 4 commits into
masterfrom
codex/harden-release-binary-hash
Apr 30, 2026
Merged

Harden release registration and binary hash policy#99
Gajesh2007 merged 4 commits into
masterfrom
codex/harden-release-binary-hash

Conversation

@anupsv
Copy link
Copy Markdown
Contributor

@anupsv anupsv commented Apr 28, 2026

Summary

This PR hardens the pre-existing release/binary-hash trust path before the verifier work:

  • fail closed when a known-good binary hash policy is configured but provider registration/challenge omits a usable attested binary hash
  • normalize and validate release metadata before storing it
  • verify release artifacts by downloading the configured R2 bundle, checking bundle hash, and checking the bundled bin/darkbloom hash
  • use constant-time release-key comparison and reject malformed release payloads
  • add regression coverage for release registration and provider binary-hash policy behavior

Validation

  • go test ./internal/api

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 28, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
d-inference Ready Ready Preview Apr 30, 2026 1:26pm
d-inference-console-ui-dev Ready Ready Preview Apr 30, 2026 1:26pm
d-inference-landing Ready Ready Preview Apr 30, 2026 1:26pm

Request Review

Comment thread coordinator/internal/api/release_handlers.go Fixed
@anupsv anupsv marked this pull request as ready for review April 29, 2026 00:05
@anupsv anupsv changed the title [codex] Harden release registration and binary hash policy Harden release registration and binary hash policy Apr 29, 2026
@hankbobtheresearchoor
Copy link
Copy Markdown
Contributor

Hermes Agent Review

Verdict: Minor suggestions (solid security hardening, a few gaps to address)

✅ Looks Good

  • Fail-closed binary hash policy — providers now rejected when policy is configured but hash is missing or unsigned. This is the right call. Previously resp.BinaryHash != "" && len(s.knownBinaryHashes) > 0 was open-mode-adjacent.
  • Constant-time release key comparison via subtle.ConstantTimeCompare — fixes the timing leak in the old token != s.releaseKey string comparison.
  • Release artifact verification — downloading the R2 bundle, verifying bundle_hash, extracting bin/darkbloom, and verifying binary_hash before storing is excellent trust-on-first-use hardening. Prevents a compromised release key alone from whitelisting arbitrary binaries.
  • URL origin validation — off-origin URLs, HTTP origins, credentialed URLs all rejected. Redirect following disabled via CheckRedirect: return ErrUseLastResponse. Solid.
  • Path traversal protection in cleanReleaseTarPath.. components and absolute paths rejected.
  • Non-regular file rejection — symlinks for bin/darkbloom rejected. TypeRegA accepted for legacy compat.
  • Strict JSON decodingDisallowUnknownFields() prevents store-only fields like active/created_at from being injected.
  • Hash normalizationnormalizeSHA256Hex ensures lowercase, validates length and hex encoding before comparison.
  • MaxBytesReader on request body, bundle size limits (2GiB), binary size limits (512MiB).
  • Attestation cross-check — challenge binary hash must match registration attestation hash. Prevents hash rotation after registration.
  • Good test coverage — 10+ new tests covering hash mismatches, oversized binaries, redirects, unsafe paths, symlinks, off-origin URLs, credentialed URLs, missing hashes, unsigned hashes.

⚠️ Warnings

  1. verifyReleaseArtifact writes to diskos.CreateTemp("", "darkbloom-release-*.tar.gz") writes the entire bundle to a temp file before verifying. For a 2GiB max bundle, this could fill /tmp under concurrent release registrations. Consider streaming verification: hash while downloading, then stream the gzip/tar without materializing to disk (you already hash while writing via MultiWriter, so the temp file is only needed for the Seek(0) to re-read for tar extraction — could use io.TeeReader + bytes.Buffer for the hash portion, or restructure to avoid the seek).

  2. Bundle download has no TLS pinning — the coordinator trusts the system CA pool when downloading from R2. If R2 CDN is compromised or MITM'd at the CA level, a malicious bundle could be served. Consider pinning the expected R2 certificate or at least validating the TLS certificate hostname matches the configured CDN URL. (Low priority given R2's CDN architecture, but worth noting for defense-in-depth.)

  3. SyncBinaryHashes marks policy as configured when any active release exists — even a release with an empty/invalid BinaryHash (which gets logged and skipped) will set releaseBinaryHashPolicyConfigured = true. This means adding a single bad release record could lock out all providers. Consider only setting policyConfigured = true when at least one valid hash is actually added to the set.

💡 Suggestions

  1. maxReleaseArtifactBytes = 2 << 30 — this is 2 * (1 << 30) = 2GiB, which is correct, but 2 << 30 is an unusual way to write it. 2 * 1024 * 1024 * 1024 or 2 << 30 are both fine, but a comment clarifying the intent would help future readers.

  2. releaseVersionPattern allows prerelease suffixes^[0-9]+\.[0-9]+\.[0-9]+(?:[-+][0-9A-Za-z.-]+)?$ accepts 1.0.0-+foo..bar--. Consider tightening to a proper semver pattern if that's the intent.

  3. eigeninference-bundle-macos-arm64.tar.gz in expectedReleaseArtifactURL — hardcoded the old name. If PR fix: Rename all EigenInference references to Darkbloom #98 (rename to Darkbloom) merges, this path will need updating. Consider making the bundle name prefix configurable or derived from a constant.

  4. Missing test: concurrent release registrations — the temp file + disk I/O path could race under concurrent POST /v1/releases. Worth a stress test.


Reviewed by Hermes Agent (hankbob)

@Gajesh2007
Copy link
Copy Markdown
Member

@codex go ahead review the code my boy

Comment thread coordinator/internal/api/release_handlers.go Fixed
@Gajesh2007 Gajesh2007 merged commit b5dd048 into master Apr 30, 2026
17 checks passed
@Gajesh2007 Gajesh2007 deleted the codex/harden-release-binary-hash branch April 30, 2026 14:12
ethenotethan added a commit that referenced this pull request May 14, 2026
* Fix Darkbloom analytics tracking

* Harden release workflow protections (#103)

* Harden release registration and binary hash policy (#99)

* Harden release registration and binary hash policy

* derive release download URL from allowlist

* Stabilize provider coordinator test

---------

Co-authored-by: Gajesh Naik <26431906+Gajesh2007@users.noreply.github.com>

* Remove stale Python integration test (#109)

* e2e: add local simulation environment skeleton

Introduces scripts/e2e-runner.py, a Python orchestrator that spins up the
real coordinator binary with test-friendly configuration (in-memory store,
mock billing, no trust requirements) alongside a simulated or real
provider, and runs HTTP/WebSocket-level assertions against the live stack.

Key components:
- Coordinator class: builds and spawns coordinator with EIGENINFERENCE_MIN_TRUST=none,
  EIGENINFERENCE_BILLING_MOCK=true, and in-memory store
- SimulatedProvider: pure-Python WebSocket client speaking the full provider protocol
  (register, attestation challenge/response, heartbeat, inference request/response)
- Test framework: decorator-based test registration, pass/fail summary, signal-safe
  cleanup via atexit + signal handlers
- Test stubs: test_basic (registration + discovery), test_inference (consumer
  request routing), test_multi_provider (two providers, same model)

TODO:
- RealProvider wrapper around darkbloom serve --coordinator
- Coordination between provider challenge cycle and consumer request timing
- API key handling for consumer vs admin routes
- Python dependency management (websockets, cryptography)

* Revert "e2e: add local simulation environment skeleton"

This reverts commit d02074e. The Python E2E runner adds noise on top of
the existing Go integration tests (internal/api/integration_test.go +
fullstack_integration_test.go) which already cover the full coordinator
protocol surface. The cross-language orchestration doesn't buy anything
over what httptest.Server + simulated providers already provide.

* Remove stale Python integration test

@ethenotethan

tests/integration_test.py is superseded by the Go-based coordinator
integration tests at coordinator/internal/api/:

- Test coverage for coordinator protocol (register, challenge, heartbeat,
  inference) is covered by integration_test.go using httptest.Server +
  Go simulated providers — same coverage, no binary build needed
- Full-stack GPU inference is covered by fullstack_integration_test.go
  with real vllm-mlx backends (gated behind LIVE_FULLSTACK_TEST=1)
- The Python test uses stale binary names ('eigeninference-provider'),
  old flags ('--backend mlx-lm'), and predates attestation challenges,
  E2E encryption, and the vllm-mlx backend migration
- No external dependency coverage (Postgres, Stripe, etc.) is lost — the
  coordinator main.go wiring for those is trivially tested elsewhere
- The Python SDK tests (4.5.x) belong in the SDK repo, not the infra repo

---------

Co-authored-by: Hank Bob <hankbob@researchoors.com>

* chore: remove unused dependencies (#112)

* chore: remove unused dependencies

* test: fix console ui test isolation

* chore: prune repo-wide dead code findings

* ci: run CI on any PR, not just master/main (#119)

* ci: remove racing deploy-dev-coordinator workflow (#137)

Cloud Build (deploy/gcp/cloudbuild.yaml) already deploys the coordinator
on the same trigger (push to master touching coordinator/** or deploy/gcp/**).
Having both paths active creates a race condition where two CI systems
simultaneously deploy to the same dev VM — see #115.

* feat: add Datadog observability stack for dev coordinator

Install Datadog Agent on the dev GCE VM (DogStatsD, APM, journald logs)
and wire the coordinator to emit structured metrics, split attestation
counters, model_type tags, reactive provider-count gauges, and a
completion-tokens counter. Rebuild the dev dashboard with 7 sections
covering metrics, logs, traces, and system health.

* fix: prevent double-decrement when untrusted provider disconnects

Disconnect now checks StatusUntrusted before decrementing the online
counter and model-provider gauges, since MarkUntrusted already
decremented them.

* feat: add fleet version and binary hash observability

New metrics:
- providers.per_version gauge (per provider binary version)
- providers.per_binary_hash gauge (per attested binary hash)
- coordinator.min_provider_version_set gauge (1 when configured)
- provider_version_below_minimum counter (tagged by gate and version)

Gates instrumented:
- registration (provider.go)
- challenge revalidation (provider.go)
- manifest sync (server.go)

Registry additions:
- ProviderCountByVersion()
- ProviderCountByBinaryHash()

Dashboard: Fleet Version & Binary Hash group with providers by version,
providers by binary hash, min provider version, below-minimum events,
and top binary hashes toplist.

* fix: update Dockerfile + cloudbuild for go.mod at repo root

go.mod moved from coordinator/ to repo root during the swift-provider
merge. Build context is now repo root, Dockerfile copies coordinator/
subdir explicitly.

* fix: chmod +x coordinator binary in Dockerfile

* fix: ensure coordinator binary is executable in builder stage

* fix: rename coordinator source dir in builder to avoid colliding with binary path

* fix: copy full repo in Dockerfile builder so go.mod resolves all packages

* fix: remove unused modelTypeTag and format Go files for CI

* fix: skip python/dangerous-modules check for swift runtime in private text gate

* billing telemetry + MarkUntrusted race fix + Swift routing tests

- Add Datadog histogram metrics for reservation amounts, settlement
  refunds, provider credits, and platform fees
- Add store.debit/credit.latency_ms histograms for DB operation timing
- Add billing.cost_clamped and billing.reservation_refunds counters
- Fix race in MarkUntrusted: hold r.mu write lock through counter
  decrement to prevent double-decrement with Disconnect
- Add unit tests for Swift provider privacy caps (with/without Python)
- Add E2E test for Swift provider routing via challenge-verified path
- Update dev-network-dashboard.json with Billing & Store group

* fix Heartbeat reviving untrusted providers causing onlineCount double-decrement

* revert orthogonal landing/console-ui/provider changes

* remove unbounded binary_hash cardinality, add input token metrics + store latency, fix dashboard group-by

* fix review feedback: ModelType() untrusted filter, routing.cost_ms by provider, billing in cents, dead comment

---------

Co-authored-by: Gajesh Naik <26431906+Gajesh2007@users.noreply.github.com>
Co-authored-by: anupsv <6407789+anupsv@users.noreply.github.com>
Co-authored-by: hankbob <hankbobtheresearchoor@gmail.com>
Co-authored-by: Hank Bob <hankbob@researchoors.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants