Harden release registration and binary hash policy by anupsv · Pull Request #99 · Layr-Labs/d-inference

anupsv · 2026-04-28T23:55:51Z

Summary

This PR hardens the pre-existing release/binary-hash trust path before the verifier work:

fail closed when a known-good binary hash policy is configured but provider registration/challenge omits a usable attested binary hash
normalize and validate release metadata before storing it
verify release artifacts by downloading the configured R2 bundle, checking bundle hash, and checking the bundled bin/darkbloom hash
use constant-time release-key comparison and reject malformed release payloads
add regression coverage for release registration and provider binary-hash policy behavior

Validation

go test ./internal/api

vercel · 2026-04-28T23:55:56Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
d-inference	Ready	Preview	Apr 30, 2026 1:26pm
d-inference-console-ui-dev	Ready	Preview	Apr 30, 2026 1:26pm
d-inference-landing	Ready	Preview	Apr 30, 2026 1:26pm

hankbobtheresearchoor · 2026-04-29T01:04:10Z

Hermes Agent Review

Verdict: Minor suggestions (solid security hardening, a few gaps to address)

✅ Looks Good

Fail-closed binary hash policy — providers now rejected when policy is configured but hash is missing or unsigned. This is the right call. Previously resp.BinaryHash != "" && len(s.knownBinaryHashes) > 0 was open-mode-adjacent.
Constant-time release key comparison via subtle.ConstantTimeCompare — fixes the timing leak in the old token != s.releaseKey string comparison.
Release artifact verification — downloading the R2 bundle, verifying bundle_hash, extracting bin/darkbloom, and verifying binary_hash before storing is excellent trust-on-first-use hardening. Prevents a compromised release key alone from whitelisting arbitrary binaries.
URL origin validation — off-origin URLs, HTTP origins, credentialed URLs all rejected. Redirect following disabled via CheckRedirect: return ErrUseLastResponse. Solid.
Path traversal protection in cleanReleaseTarPath — .. components and absolute paths rejected.
Non-regular file rejection — symlinks for bin/darkbloom rejected. TypeRegA accepted for legacy compat.
Strict JSON decoding — DisallowUnknownFields() prevents store-only fields like active/created_at from being injected.
Hash normalization — normalizeSHA256Hex ensures lowercase, validates length and hex encoding before comparison.
MaxBytesReader on request body, bundle size limits (2GiB), binary size limits (512MiB).
Attestation cross-check — challenge binary hash must match registration attestation hash. Prevents hash rotation after registration.
Good test coverage — 10+ new tests covering hash mismatches, oversized binaries, redirects, unsafe paths, symlinks, off-origin URLs, credentialed URLs, missing hashes, unsigned hashes.

⚠️ Warnings

verifyReleaseArtifact writes to disk — os.CreateTemp("", "darkbloom-release-*.tar.gz") writes the entire bundle to a temp file before verifying. For a 2GiB max bundle, this could fill /tmp under concurrent release registrations. Consider streaming verification: hash while downloading, then stream the gzip/tar without materializing to disk (you already hash while writing via MultiWriter, so the temp file is only needed for the Seek(0) to re-read for tar extraction — could use io.TeeReader + bytes.Buffer for the hash portion, or restructure to avoid the seek).
Bundle download has no TLS pinning — the coordinator trusts the system CA pool when downloading from R2. If R2 CDN is compromised or MITM'd at the CA level, a malicious bundle could be served. Consider pinning the expected R2 certificate or at least validating the TLS certificate hostname matches the configured CDN URL. (Low priority given R2's CDN architecture, but worth noting for defense-in-depth.)
SyncBinaryHashes marks policy as configured when any active release exists — even a release with an empty/invalid BinaryHash (which gets logged and skipped) will set releaseBinaryHashPolicyConfigured = true. This means adding a single bad release record could lock out all providers. Consider only setting policyConfigured = true when at least one valid hash is actually added to the set.

💡 Suggestions

maxReleaseArtifactBytes = 2 << 30 — this is 2 * (1 << 30) = 2GiB, which is correct, but 2 << 30 is an unusual way to write it. 2 * 1024 * 1024 * 1024 or 2 << 30 are both fine, but a comment clarifying the intent would help future readers.
releaseVersionPattern allows prerelease suffixes — ^[0-9]+\.[0-9]+\.[0-9]+(?:[-+][0-9A-Za-z.-]+)?$ accepts 1.0.0-+foo..bar--. Consider tightening to a proper semver pattern if that's the intent.
eigeninference-bundle-macos-arm64.tar.gz in expectedReleaseArtifactURL — hardcoded the old name. If PR fix: Rename all EigenInference references to Darkbloom #98 (rename to Darkbloom) merges, this path will need updating. Consider making the bundle name prefix configurable or derived from a constant.
Missing test: concurrent release registrations — the temp file + disk I/O path could race under concurrent POST /v1/releases. Worth a stress test.

Reviewed by Hermes Agent (hankbob)

Gajesh2007 · 2026-04-29T05:08:22Z

@codex go ahead review the code my boy

@ethenotethan

* Fix Darkbloom analytics tracking * Harden release workflow protections (#103) * Harden release registration and binary hash policy (#99) * Harden release registration and binary hash policy * derive release download URL from allowlist * Stabilize provider coordinator test --------- Co-authored-by: Gajesh Naik <26431906+Gajesh2007@users.noreply.github.com> * Remove stale Python integration test (#109) * e2e: add local simulation environment skeleton Introduces scripts/e2e-runner.py, a Python orchestrator that spins up the real coordinator binary with test-friendly configuration (in-memory store, mock billing, no trust requirements) alongside a simulated or real provider, and runs HTTP/WebSocket-level assertions against the live stack. Key components: - Coordinator class: builds and spawns coordinator with EIGENINFERENCE_MIN_TRUST=none, EIGENINFERENCE_BILLING_MOCK=true, and in-memory store - SimulatedProvider: pure-Python WebSocket client speaking the full provider protocol (register, attestation challenge/response, heartbeat, inference request/response) - Test framework: decorator-based test registration, pass/fail summary, signal-safe cleanup via atexit + signal handlers - Test stubs: test_basic (registration + discovery), test_inference (consumer request routing), test_multi_provider (two providers, same model) TODO: - RealProvider wrapper around darkbloom serve --coordinator - Coordination between provider challenge cycle and consumer request timing - API key handling for consumer vs admin routes - Python dependency management (websockets, cryptography) * Revert "e2e: add local simulation environment skeleton" This reverts commit d02074e. The Python E2E runner adds noise on top of the existing Go integration tests (internal/api/integration_test.go + fullstack_integration_test.go) which already cover the full coordinator protocol surface. The cross-language orchestration doesn't buy anything over what httptest.Server + simulated providers already provide. * Remove stale Python integration test @ethenotethan tests/integration_test.py is superseded by the Go-based coordinator integration tests at coordinator/internal/api/: - Test coverage for coordinator protocol (register, challenge, heartbeat, inference) is covered by integration_test.go using httptest.Server + Go simulated providers — same coverage, no binary build needed - Full-stack GPU inference is covered by fullstack_integration_test.go with real vllm-mlx backends (gated behind LIVE_FULLSTACK_TEST=1) - The Python test uses stale binary names ('eigeninference-provider'), old flags ('--backend mlx-lm'), and predates attestation challenges, E2E encryption, and the vllm-mlx backend migration - No external dependency coverage (Postgres, Stripe, etc.) is lost — the coordinator main.go wiring for those is trivially tested elsewhere - The Python SDK tests (4.5.x) belong in the SDK repo, not the infra repo --------- Co-authored-by: Hank Bob <hankbob@researchoors.com> * chore: remove unused dependencies (#112) * chore: remove unused dependencies * test: fix console ui test isolation * chore: prune repo-wide dead code findings * ci: run CI on any PR, not just master/main (#119) * ci: remove racing deploy-dev-coordinator workflow (#137) Cloud Build (deploy/gcp/cloudbuild.yaml) already deploys the coordinator on the same trigger (push to master touching coordinator/** or deploy/gcp/**). Having both paths active creates a race condition where two CI systems simultaneously deploy to the same dev VM — see #115. * feat: add Datadog observability stack for dev coordinator Install Datadog Agent on the dev GCE VM (DogStatsD, APM, journald logs) and wire the coordinator to emit structured metrics, split attestation counters, model_type tags, reactive provider-count gauges, and a completion-tokens counter. Rebuild the dev dashboard with 7 sections covering metrics, logs, traces, and system health. * fix: prevent double-decrement when untrusted provider disconnects Disconnect now checks StatusUntrusted before decrementing the online counter and model-provider gauges, since MarkUntrusted already decremented them. * feat: add fleet version and binary hash observability New metrics: - providers.per_version gauge (per provider binary version) - providers.per_binary_hash gauge (per attested binary hash) - coordinator.min_provider_version_set gauge (1 when configured) - provider_version_below_minimum counter (tagged by gate and version) Gates instrumented: - registration (provider.go) - challenge revalidation (provider.go) - manifest sync (server.go) Registry additions: - ProviderCountByVersion() - ProviderCountByBinaryHash() Dashboard: Fleet Version & Binary Hash group with providers by version, providers by binary hash, min provider version, below-minimum events, and top binary hashes toplist. * fix: update Dockerfile + cloudbuild for go.mod at repo root go.mod moved from coordinator/ to repo root during the swift-provider merge. Build context is now repo root, Dockerfile copies coordinator/ subdir explicitly. * fix: chmod +x coordinator binary in Dockerfile * fix: ensure coordinator binary is executable in builder stage * fix: rename coordinator source dir in builder to avoid colliding with binary path * fix: copy full repo in Dockerfile builder so go.mod resolves all packages * fix: remove unused modelTypeTag and format Go files for CI * fix: skip python/dangerous-modules check for swift runtime in private text gate * billing telemetry + MarkUntrusted race fix + Swift routing tests - Add Datadog histogram metrics for reservation amounts, settlement refunds, provider credits, and platform fees - Add store.debit/credit.latency_ms histograms for DB operation timing - Add billing.cost_clamped and billing.reservation_refunds counters - Fix race in MarkUntrusted: hold r.mu write lock through counter decrement to prevent double-decrement with Disconnect - Add unit tests for Swift provider privacy caps (with/without Python) - Add E2E test for Swift provider routing via challenge-verified path - Update dev-network-dashboard.json with Billing & Store group * fix Heartbeat reviving untrusted providers causing onlineCount double-decrement * revert orthogonal landing/console-ui/provider changes * remove unbounded binary_hash cardinality, add input token metrics + store latency, fix dashboard group-by * fix review feedback: ModelType() untrusted filter, routing.cost_ms by provider, billing in cents, dead comment --------- Co-authored-by: Gajesh Naik <26431906+Gajesh2007@users.noreply.github.com> Co-authored-by: anupsv <6407789+anupsv@users.noreply.github.com> Co-authored-by: hankbob <hankbobtheresearchoor@gmail.com> Co-authored-by: Hank Bob <hankbob@researchoors.com>

Harden release registration and binary hash policy

c851cfc

vercel Bot deployed to Preview – d-inference-console-ui-dev April 28, 2026 23:56 View deployment

vercel Bot deployed to Preview – d-inference April 28, 2026 23:56 View deployment

github-advanced-security AI found potential problems Apr 28, 2026

View reviewed changes

Comment thread coordinator/internal/api/release_handlers.go Fixed

anupsv marked this pull request as ready for review April 29, 2026 00:05

anupsv changed the title ~~[codex] Harden release registration and binary hash policy~~ Harden release registration and binary hash policy Apr 29, 2026

derive release download URL from allowlist

4f26219

vercel Bot deployed to Preview – d-inference-landing April 29, 2026 07:43 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev April 29, 2026 07:43 View deployment

vercel Bot deployed to Preview – d-inference April 29, 2026 07:43 View deployment

Stabilize provider coordinator test

06c11b5

vercel Bot deployed to Preview – d-inference-landing April 29, 2026 09:11 View deployment

vercel Bot deployed to Preview – d-inference April 29, 2026 09:11 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev April 29, 2026 09:12 View deployment

Gajesh2007 approved these changes Apr 30, 2026

View reviewed changes

Comment thread coordinator/internal/api/release_handlers.go Fixed

Merge branch 'master' into codex/harden-release-binary-hash

54938e5

vercel Bot deployed to Preview – d-inference-landing April 30, 2026 13:26 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev April 30, 2026 13:26 View deployment

vercel Bot deployed to Preview – d-inference April 30, 2026 13:26 View deployment

Gajesh2007 merged commit b5dd048 into master Apr 30, 2026
17 checks passed

Gajesh2007 deleted the codex/harden-release-binary-hash branch April 30, 2026 14:12

amiller mentioned this pull request May 10, 2026

Release registry has no public history; silent ADDs are a CT-analogous MITM vector #150

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden release registration and binary hash policy#99

Harden release registration and binary hash policy#99
Gajesh2007 merged 4 commits into
masterfrom
codex/harden-release-binary-hash

anupsv commented Apr 28, 2026

Uh oh!

vercel Bot commented Apr 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

hankbobtheresearchoor commented Apr 29, 2026

Uh oh!

Gajesh2007 commented Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

anupsv commented Apr 28, 2026

Summary

Validation

Uh oh!

vercel Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

hankbobtheresearchoor commented Apr 29, 2026

Hermes Agent Review

✅ Looks Good

⚠️ Warnings

💡 Suggestions

Uh oh!

Gajesh2007 commented Apr 29, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vercel Bot commented Apr 28, 2026 •

edited

Loading