Skip to content

tracebloc/data-ingestors

Repository files navigation

License PyPI Python Platform

Data Ingestors 📊

Move your data into the tracebloc training environment — validated, clean, and ready for model evaluation. Your raw data never leaves your infrastructure.

How it works

Your raw data
     │
     ▼
┌──────────────────┐     ┌──────────────────────────────────┐
│  Data ingestor   │────►│  Your Kubernetes cluster         │
│                  │     │                                  │
│  Validates       │     │  Validated dataset               │
│  Preprocesses    │     │  (ready for training)            │
│  Transfers       │     │                                  │
└──────────────────┘     └──────────────┬───────────────────┘
                                        │
                               Metadata only
                                        │
                                        ▼
                         ┌──────────────────────────┐
                         │  tracebloc web app       │
                         │  (dataset management UI) │
                         └──────────────────────────┘

Only metadata (schema, statistics, structure) syncs to the web app. Raw data stays put.

Supported data types

Type Templates
Image image_classification, object_detection
Text / NLP text_classification
Tabular tabular_classification, tabular_regression
Time series time_series_forecasting, time_to_event_prediction

Each template is a runnable starting point — copy it, point it at your data, ship it.

Quickstart

1. Install

pip install tracebloc-ingestor

2. Pick a template

cp templates/image_classification/ingestor.py .

Each template builds on the same primitives — BaseIngestor, CSVIngestor, validators — and overrides the parts that vary by data type.

3. Deploy as a Kubernetes Job

The ingestor runs inside your cluster, next to a tracebloc client. The provided Dockerfile and ingestor-job.yaml are the canonical pattern:

docker build -t <your-registry>/<image-name>:latest .
docker push <your-registry>/<image-name>:latest
kubectl apply -f ingestor-job.yaml

The Job needs these environment variables (set in ingestor-job.yaml):

Variable What it is
CLIENT_ID, CLIENT_PASSWORD Tracebloc client credentials
CLIENT_PVC PVC name shared with the client (must match values.yaml)
MYSQL_HOST Hostname of the client's MySQL service
SRC_PATH Where your raw data is mounted in the ingestor pod
LABEL_FILE Path to labels (e.g. Xy_train.csv)
TABLE_NAME Destination table name in the client database
TITLE (optional) Human-readable dataset name
LOG_LEVEL (optional) INFO, WARNING, ERROR

Running under Pod Security Standards (restricted)

If the namespace you're deploying into enforces the restricted Pod Security Standard (OpenShift, hardened clusters, many managed-Kubernetes namespaces), the stock Dockerfile and ingestor-job.yaml won't admit. Two changes are needed.

Check first:

kubectl get ns <namespace> -o jsonpath='{.metadata.labels}' | jq

Look for pod-security.kubernetes.io/enforce: restricted. If absent, the stock files admit fine and you can skip this section.

1. Dockerfile — drop root. Append before ENTRYPOINT:

# OpenShift-compatible: grant group write via GID 0
RUN chgrp -R 0 /app && chmod -R g=u /app
USER 1001

2. ingestor-job.yaml — add a hardened securityContext. Both pod-level and container-level:

spec:
  template:
    spec:
      securityContext:                    # pod-level
        runAsNonRoot: true
        runAsUser: 1001
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: api
        # ... existing container spec ...
        securityContext:                  # container-level
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]

Writing a custom ingestor

For data that doesn't fit a template, subclass BaseIngestor:

from tracebloc_ingestor import BaseIngestor, FileTypeValidator

class MyIngestor(BaseIngestor):
    validators = [FileTypeValidator(allowed=[".parquet"])]

    def transform(self, record):
        # your preprocessing
        return record

if __name__ == "__main__":
    MyIngestor().ingest()

The package exports BaseIngestor, CSVIngestor, JSONIngestor, plus validators (FileTypeValidator, ImageResolutionValidator, TableNameValidator) and the Database / APIClient helpers. See examples/ for working scripts.

Prerequisites

Links

Platform · Docs · Data preparation guide · Discord

License

Apache 2.0 — see LICENSE.

Questions? support@tracebloc.io or open an issue.

Releases

No releases published

Contributors

Languages