Data Ingestors 📊

Move your data into the tracebloc training environment — validated, clean, and ready for model evaluation. Your raw data never leaves your infrastructure.

How it works

Your raw data
     │
     ▼
┌──────────────────┐     ┌──────────────────────────────────┐
│  Data ingestor   │────►│  Your Kubernetes cluster         │
│                  │     │                                  │
│  Validates       │     │  Validated dataset               │
│  Preprocesses    │     │  (ready for training)            │
│  Transfers       │     │                                  │
└──────────────────┘     └──────────────┬───────────────────┘
                                        │
                               Metadata only
                                        │
                                        ▼
                         ┌──────────────────────────┐
                         │  tracebloc web app       │
                         │  (dataset management UI) │
                         └──────────────────────────┘

Only metadata (schema, statistics, structure) syncs to the web app. Raw data stays put.

Supported data types

Type	Templates
Image	`image_classification`, `object_detection`
Text / NLP	`text_classification`
Tabular	`tabular_classification`, `tabular_regression`
Time series	`time_series_forecasting`, `time_to_event_prediction`

Each template is a runnable starting point — copy it, point it at your data, ship it.

Quickstart

1. Install

pip install tracebloc-ingestor

2. Pick a template

cp templates/image_classification/ingestor.py .

Each template builds on the same primitives — BaseIngestor, CSVIngestor, validators — and overrides the parts that vary by data type.

3. Deploy as a Kubernetes Job

The ingestor runs inside your cluster, next to a tracebloc client. The provided Dockerfile and ingestor-job.yaml are the canonical pattern:

docker build -t <your-registry>/<image-name>:latest .
docker push <your-registry>/<image-name>:latest
kubectl apply -f ingestor-job.yaml

The Job needs these environment variables (set in ingestor-job.yaml):

Variable	What it is
`CLIENT_ID`, `CLIENT_PASSWORD`	Tracebloc client credentials
`CLIENT_PVC`	PVC name shared with the client (must match `values.yaml`)
`MYSQL_HOST`	Hostname of the client's MySQL service
`SRC_PATH`	Where your raw data is mounted in the ingestor pod
`LABEL_FILE`	Path to labels (e.g. `Xy_train.csv`)
`TABLE_NAME`	Destination table name in the client database
`TITLE`	(optional) Human-readable dataset name
`LOG_LEVEL`	(optional) `INFO`, `WARNING`, `ERROR`

Running under Pod Security Standards (`restricted`)

If the namespace you're deploying into enforces the restricted Pod Security Standard (OpenShift, hardened clusters, many managed-Kubernetes namespaces), the stock Dockerfile and ingestor-job.yaml won't admit. Two changes are needed.

Check first:

kubectl get ns <namespace> -o jsonpath='{.metadata.labels}' | jq

Look for pod-security.kubernetes.io/enforce: restricted. If absent, the stock files admit fine and you can skip this section.

1. Dockerfile — drop root. Append before ENTRYPOINT:

# OpenShift-compatible: grant group write via GID 0
RUN chgrp -R 0 /app && chmod -R g=u /app
USER 1001

2. ingestor-job.yaml — add a hardened securityContext. Both pod-level and container-level:

spec:
  template:
    spec:
      securityContext:                    # pod-level
        runAsNonRoot: true
        runAsUser: 1001
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: api
        # ... existing container spec ...
        securityContext:                  # container-level
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]

Writing a custom ingestor

For data that doesn't fit a template, subclass BaseIngestor:

from tracebloc_ingestor import BaseIngestor, FileTypeValidator

class MyIngestor(BaseIngestor):
    validators = [FileTypeValidator(allowed=[".parquet"])]

    def transform(self, record):
        # your preprocessing
        return record

if __name__ == "__main__":
    MyIngestor().ingest()

The package exports BaseIngestor, CSVIngestor, JSONIngestor, plus validators (FileTypeValidator, ImageResolutionValidator, TableNameValidator) and the Database / APIClient helpers. See examples/ for working scripts.

Prerequisites

Python 3.8+
A tracebloc account
A running tracebloc client on your infrastructure

Links

Platform · Docs · Data preparation guide · Discord

License

Apache 2.0 — see LICENSE.

Questions? support@tracebloc.io or open an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 239 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
templates		templates
tracebloc_ingestor		tracebloc_ingestor
.dockerignore		.dockerignore
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Readme.md		Readme.md
debug_csv_processing.py		debug_csv_processing.py
docker-entrypoint.sh		docker-entrypoint.sh
ingestor-job.yaml		ingestor-job.yaml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Ingestors 📊

How it works

Supported data types

Quickstart

1. Install

2. Pick a template

3. Deploy as a Kubernetes Job

Running under Pod Security Standards (`restricted`)

Writing a custom ingestor

Prerequisites

Links

License

About

Uh oh!

Releases

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Ingestors 📊

How it works

Supported data types

Quickstart

1. Install

2. Pick a template

3. Deploy as a Kubernetes Job

Running under Pod Security Standards (restricted)

Writing a custom ingestor

Prerequisites

Links

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors

Uh oh!

Languages

Running under Pod Security Standards (`restricted`)