Skip to content

Ingestor Update — declarative YAML, official image, scoped tokens #85

@saadqbal

Description

@saadqbal

Goal

Redesign data ingestion to be declarative, secure, and future-proof. Customers should be able to ingest a new dataset by writing an ingest.yaml and running helm install — no Dockerfile, no docker push, no user-built images for the common path.

Why now

Current flow (per-customer Dockerfile + image build + manual kubectl apply of ingestor-job.yaml) has accumulated:

  • Hardcoded credentials in tracebloc_ingestor/config.py
  • :latest tag and unsigned images in the job manifest
  • Username/password auth at job start (training pods already use the better pre-minted-token pattern via jobs-manager)
  • data_id column-mapping that can leak source PII (e.g., patient_id) to the central backend
  • Silent passthrough of regression target values (label field for regression / time-series / time-to-event tasks is the prediction target itself, not metadata)
  • Per-dataset images = supply-chain surface that's hard to scan, sign, and govern

Architecture (locked)

  1. Declarative YAML config with a Python escape hatch for genuine edge cases.
  2. PVC stays as the data plane (no object-store migration in this epic).
  3. Option A: jobs-manager (in client-runtime) owns ingestor Job creation and mints scoped tokens. The Helm chart becomes a thin client that submits the run to jobs-manager.
  4. data_id defaults to UUID; mapping a source column requires explicit opt-in in the YAML.
  5. label_policy field, required for regression-class tasks:
    • bucket (recommended): bin index + boundaries; default strategy: quantile, bins: 10
    • passthrough: customer explicitly accepts that exact target values leave the cluster
    • Classification / object-detection / text-classification: defaults to passthrough (categorical metadata, no behavior change)

What does NOT change

  • PVC + cluster-local MySQL remain the data plane.
  • The six existing central-backend endpoints stay as they are; the token minted by jobs-manager is scoped exactly to those.
  • Customer feature columns and file contents continue to never leave the cluster.

Cross-cutting acceptance criteria

  • A customer can ingest a new dataset with: helm install my-dataset tracebloc/ingestor -f ingest.yaml. No Dockerfile.
  • Ingestor pod gets BACKEND_TOKEN from jobs-manager, never reads CLIENT_PASSWORD.
  • Official ingestor image is on GHCR, semver-tagged, cosign-signed, pinned by digest in the chart.
  • YAML schema is JSON Schema-validated client-side before the run is submitted.
  • Existing user-built images keep working during deprecation window.

Children

Sequencing

#43, #44, #45, #21 can run in parallel. #86 blocks on all four. #46 follows #86. #87 stays deferred.

Out of scope (for follow-up epics)

  • Object-store data plane (Layer 3 from design discussion)
  • Dataset/DatasetVersion CRD with reconciler
  • Folding the four /global_meta/* calls into a single backend webhook
  • Cluster-local label orchestration (would let regression default to omit)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions