Move your data into the tracebloc training environment — validated, clean, and ready for model evaluation. Your raw data never leaves your infrastructure.
Your raw data
│
▼
┌──────────────────┐ ┌──────────────────────────────────┐
│ Data ingestor │────►│ Your Kubernetes cluster │
│ │ │ │
│ Validates │ │ Validated dataset │
│ Preprocesses │ │ (ready for training) │
│ Transfers │ │ │
└──────────────────┘ └──────────────┬───────────────────┘
│
Metadata only
│
▼
┌──────────────────────────┐
│ tracebloc web app │
│ (dataset management UI) │
└──────────────────────────┘
Only metadata (schema, statistics, structure) syncs to the web app. Raw data stays put.
| Type | Templates |
|---|---|
| Image | image_classification, object_detection |
| Text / NLP | text_classification |
| Tabular | tabular_classification, tabular_regression |
| Time series | time_series_forecasting, time_to_event_prediction |
Each template is a runnable starting point — copy it, point it at your data, ship it.
pip install tracebloc-ingestorcp templates/image_classification/ingestor.py .Each template builds on the same primitives — BaseIngestor, CSVIngestor, validators — and overrides the parts that vary by data type.
The ingestor runs inside your cluster, next to a tracebloc client. The provided Dockerfile and ingestor-job.yaml are the canonical pattern:
docker build -t <your-registry>/<image-name>:latest .
docker push <your-registry>/<image-name>:latest
kubectl apply -f ingestor-job.yamlThe Job needs these environment variables (set in ingestor-job.yaml):
| Variable | What it is |
|---|---|
CLIENT_ID, CLIENT_PASSWORD |
Tracebloc client credentials |
CLIENT_PVC |
PVC name shared with the client (must match values.yaml) |
MYSQL_HOST |
Hostname of the client's MySQL service |
SRC_PATH |
Where your raw data is mounted in the ingestor pod |
LABEL_FILE |
Path to labels (e.g. Xy_train.csv) |
TABLE_NAME |
Destination table name in the client database |
TITLE |
(optional) Human-readable dataset name |
LOG_LEVEL |
(optional) INFO, WARNING, ERROR |
If the namespace you're deploying into enforces the restricted Pod Security Standard (OpenShift, hardened clusters, many managed-Kubernetes namespaces), the stock Dockerfile and ingestor-job.yaml won't admit. Two changes are needed.
Check first:
kubectl get ns <namespace> -o jsonpath='{.metadata.labels}' | jqLook for pod-security.kubernetes.io/enforce: restricted. If absent, the stock files admit fine and you can skip this section.
1. Dockerfile — drop root. Append before ENTRYPOINT:
# OpenShift-compatible: grant group write via GID 0
RUN chgrp -R 0 /app && chmod -R g=u /app
USER 10012. ingestor-job.yaml — add a hardened securityContext. Both pod-level and container-level:
spec:
template:
spec:
securityContext: # pod-level
runAsNonRoot: true
runAsUser: 1001
seccompProfile:
type: RuntimeDefault
containers:
- name: api
# ... existing container spec ...
securityContext: # container-level
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]For data that doesn't fit a template, subclass BaseIngestor:
from tracebloc_ingestor import BaseIngestor, FileTypeValidator
class MyIngestor(BaseIngestor):
validators = [FileTypeValidator(allowed=[".parquet"])]
def transform(self, record):
# your preprocessing
return record
if __name__ == "__main__":
MyIngestor().ingest()The package exports BaseIngestor, CSVIngestor, JSONIngestor, plus validators (FileTypeValidator, ImageResolutionValidator, TableNameValidator) and the Database / APIClient helpers. See examples/ for working scripts.
- Python 3.8+
- A tracebloc account
- A running tracebloc client on your infrastructure
Platform · Docs · Data preparation guide · Discord
Apache 2.0 — see LICENSE.
Questions? support@tracebloc.io or open an issue.