Goal
Redesign data ingestion to be declarative, secure, and future-proof. Customers should be able to ingest a new dataset by writing an ingest.yaml and running helm install — no Dockerfile, no docker push, no user-built images for the common path.
Why now
Current flow (per-customer Dockerfile + image build + manual kubectl apply of ingestor-job.yaml) has accumulated:
- Hardcoded credentials in tracebloc_ingestor/config.py
:latest tag and unsigned images in the job manifest
- Username/password auth at job start (training pods already use the better pre-minted-token pattern via jobs-manager)
data_id column-mapping that can leak source PII (e.g., patient_id) to the central backend
- Silent passthrough of regression target values (
label field for regression / time-series / time-to-event tasks is the prediction target itself, not metadata)
- Per-dataset images = supply-chain surface that's hard to scan, sign, and govern
Architecture (locked)
- Declarative YAML config with a Python escape hatch for genuine edge cases.
- PVC stays as the data plane (no object-store migration in this epic).
- Option A: jobs-manager (in
client-runtime) owns ingestor Job creation and mints scoped tokens. The Helm chart becomes a thin client that submits the run to jobs-manager.
data_id defaults to UUID; mapping a source column requires explicit opt-in in the YAML.
label_policy field, required for regression-class tasks:
bucket (recommended): bin index + boundaries; default strategy: quantile, bins: 10
passthrough: customer explicitly accepts that exact target values leave the cluster
- Classification / object-detection / text-classification: defaults to
passthrough (categorical metadata, no behavior change)
What does NOT change
- PVC + cluster-local MySQL remain the data plane.
- The six existing central-backend endpoints stay as they are; the token minted by jobs-manager is scoped exactly to those.
- Customer feature columns and file contents continue to never leave the cluster.
Cross-cutting acceptance criteria
- A customer can ingest a new dataset with:
helm install my-dataset tracebloc/ingestor -f ingest.yaml. No Dockerfile.
- Ingestor pod gets
BACKEND_TOKEN from jobs-manager, never reads CLIENT_PASSWORD.
- Official ingestor image is on GHCR, semver-tagged, cosign-signed, pinned by digest in the chart.
- YAML schema is JSON Schema-validated client-side before the run is submitted.
- Existing user-built images keep working during deprecation window.
Children
Sequencing
#43, #44, #45, #21 can run in parallel. #86 blocks on all four. #46 follows #86. #87 stays deferred.
Out of scope (for follow-up epics)
- Object-store data plane (Layer 3 from design discussion)
- Dataset/DatasetVersion CRD with reconciler
- Folding the four
/global_meta/* calls into a single backend webhook
- Cluster-local label orchestration (would let regression default to
omit)
Goal
Redesign data ingestion to be declarative, secure, and future-proof. Customers should be able to ingest a new dataset by writing an
ingest.yamland runninghelm install— no Dockerfile, nodocker push, no user-built images for the common path.Why now
Current flow (per-customer Dockerfile + image build + manual
kubectl applyof ingestor-job.yaml) has accumulated::latesttag and unsigned images in the job manifestdata_idcolumn-mapping that can leak source PII (e.g.,patient_id) to the central backendlabelfield for regression / time-series / time-to-event tasks is the prediction target itself, not metadata)Architecture (locked)
client-runtime) owns ingestor Job creation and mints scoped tokens. The Helm chart becomes a thin client that submits the run to jobs-manager.data_iddefaults to UUID; mapping a source column requires explicit opt-in in the YAML.label_policyfield, required for regression-class tasks:bucket(recommended): bin index + boundaries; defaultstrategy: quantile,bins: 10passthrough: customer explicitly accepts that exact target values leave the clusterpassthrough(categorical metadata, no behavior change)What does NOT change
Cross-cutting acceptance criteria
helm install my-dataset tracebloc/ingestor -f ingest.yaml. No Dockerfile.BACKEND_TOKENfrom jobs-manager, never readsCLIENT_PASSWORD.Children
BACKEND_TOKENand remove hardcoded credentials (P1, prerequisite)Sequencing
#43, #44, #45, #21can run in parallel.#86blocks on all four.#46follows#86.#87stays deferred.Out of scope (for follow-up epics)
/global_meta/*calls into a single backend webhookomit)