Ingestor Update — declarative YAML, official image, scoped tokens

## Goal

Redesign data ingestion to be declarative, secure, and future-proof. Customers should be able to ingest a new dataset by writing an `ingest.yaml` and running `helm install` — no Dockerfile, no `docker push`, no user-built images for the common path.

## Why now

Current flow (per-customer Dockerfile + image build + manual `kubectl apply` of [ingestor-job.yaml](https://github.com/tracebloc/data-ingestors/blob/develop/ingestor-job.yaml)) has accumulated:

- Hardcoded credentials in [tracebloc_ingestor/config.py](https://github.com/tracebloc/data-ingestors/blob/develop/tracebloc_ingestor/config.py)
- `:latest` tag and unsigned images in the job manifest
- Username/password auth at job start (training pods already use the better pre-minted-token pattern via jobs-manager)
- `data_id` column-mapping that can leak source PII (e.g., `patient_id`) to the central backend
- Silent passthrough of regression target values (`label` field for regression / time-series / time-to-event tasks is the prediction target itself, not metadata)
- Per-dataset images = supply-chain surface that's hard to scan, sign, and govern

## Architecture (locked)

1. **Declarative YAML config** with a Python escape hatch for genuine edge cases.
2. **PVC stays** as the data plane (no object-store migration in this epic).
3. **Option A**: jobs-manager (in `client-runtime`) owns ingestor Job creation and mints scoped tokens. The Helm chart becomes a thin client that submits the run to jobs-manager.
4. **`data_id` defaults to UUID**; mapping a source column requires explicit opt-in in the YAML.
5. **`label_policy` field, required for regression-class tasks**:
   - `bucket` (recommended): bin index + boundaries; default `strategy: quantile`, `bins: 10`
   - `passthrough`: customer explicitly accepts that exact target values leave the cluster
   - Classification / object-detection / text-classification: defaults to `passthrough` (categorical metadata, no behavior change)

## What does NOT change

- PVC + cluster-local MySQL remain the data plane.
- The six existing central-backend endpoints stay as they are; the token minted by jobs-manager is scoped exactly to those.
- Customer feature columns and file contents continue to never leave the cluster.

## Cross-cutting acceptance criteria

- A customer can ingest a new dataset with: `helm install my-dataset tracebloc/ingestor -f ingest.yaml`. No Dockerfile.
- Ingestor pod gets `BACKEND_TOKEN` from jobs-manager, never reads `CLIENT_PASSWORD`.
- Official ingestor image is on GHCR, semver-tagged, cosign-signed, pinned by digest in the chart.
- YAML schema is JSON Schema-validated client-side before the run is submitted.
- Existing user-built images keep working during deprecation window.

## Children

- tracebloc/data-ingestors#43 — Support `BACKEND_TOKEN` and remove hardcoded credentials *(P1, prerequisite)*
- tracebloc/data-ingestors#44 — Declarative YAML config schema and generic entrypoint *(P1)*
- tracebloc/data-ingestors#45 — Build official image to GHCR with cosign signing *(P1)*
- tracebloc/client-runtime#21 — jobs-manager submit-ingestion-run endpoint with scoped token mint *(P1, load-bearing security work)*
- tracebloc/client#86 — Helm ingestor subchart that submits runs to jobs-manager *(P1)*
- tracebloc/data-ingestors#46 — Docs and migration guide for new flow *(P2)*
- tracebloc/client#87 — Kyverno policy for custom-image escape hatch *(P3, deferred)*

## Sequencing

`#43, #44, #45, #21` can run in parallel. `#86` blocks on all four. `#46` follows `#86`. `#87` stays deferred.

## Out of scope (for follow-up epics)

- Object-store data plane (Layer 3 from design discussion)
- Dataset/DatasetVersion CRD with reconciler
- Folding the four `/global_meta/*` calls into a single backend webhook
- Cluster-local label orchestration (would let regression default to `omit`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingestor Update — declarative YAML, official image, scoped tokens #85

Goal

Why now

Architecture (locked)

What does NOT change

Cross-cutting acceptance criteria

Children

Sequencing

Out of scope (for follow-up epics)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ingestor Update — declarative YAML, official image, scoped tokens #85

Description

Goal

Why now

Architecture (locked)

What does NOT change

Cross-cutting acceptance criteria

Children

Sequencing

Out of scope (for follow-up epics)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions