Overview
Make your data available to the Kubernetes cluster so it can be used for training and evaluation. Regardless of where your client runs on Azure, AWS, Google Cloud, or a local Minikube setup, the process of ingesting datasets works the same way. The data ingestor is a lightweight service that bridges your raw data and the cluster’s persistent storage. It comes with ready-made templates (CSV, images, text) that you can use as starting points and customize for your own dataset. By containerizing the ingestion step, the ingestor validates data format and schema, enforces consistency, and transfers the dataset securely into cluster’s SQL storage where it becomes accessible to all training and evaluation jobs. This guide covers:- Customizing ingestor templates for different data types (CSV, images, text)
- Deploying the data ingestor for training and test data using Kubernetes
- Managing datasets through the tracebloc interface
Quick Setup
Use this quick setup if you already have an ingestor configured and just want to switch datasets or toggle between training and testing. If you are setting up for the first time, go to the next section for the detailed walkthrough.Steps
- Pick a template script and edit it. E.g.
/templates/tabular_classification/tabular_classification.py
- Update csv options and data_path
- Only for tabular data: Update schema
- Set
schemaandCSVIngestor()parameters like category, intent, label_column, etc. to match data type, task and train/test purpose
- Build and push docker image:
- Edit ingestor-job.yaml:
metadata.name: Unique job name (e.g. ingestor-job-train and ingestor-job-test)image: The tag you built and pushedLABEL_FILE: Path inside container (e.g. /data/train.csv). - Points to csv file with labels and/or data in case of tabular dataTABLE_NAME: Unique table name (no spaces, one per dataset). Title is optionalPATH_TO_LOCAL_DATASET_FILE: Path to your dataset file within the containerSRC_PATH: Root inside the container where your files are mounted
- Deploy to Kubernetes
Detailed Setup
1. Configure a Template
This section walks you through the step-by-step setup of a data ingestor. You will clone the repository, select the right template for your data type, and customize it to match your task. Follow this guide if you are setting up an ingestor for the first time or need full control beyond the quick setup.Clone the Data Ingestor Repository
Clone the public Data Ingestor GitHub repository:/templates/ folder. In most cases you only need to make minimal adjustments.
IMPORTANT: Datasets must be cleaned and preprocessed before ingestion. Participants cannot view, clean or fix raw data, so model performance will only be as good as the data you provide.
Choose a Template
Select the appropriate template from the/templates/ folder based on your data and task type.
Each template is already configured with the correct data category and format:
| Data Type | Template File | Data Category | Data Format |
|---|---|---|---|
| Tabular | templates/tabular_classification/tabular_classification.py | TaskCategory.TABULAR_CLASSIFICATION | DataFormat.TABULAR |
| Image | templates/image_classification/image_classification.py | TaskCategory.IMAGE_CLASSIFICATION | DataFormat.IMAGE |
| Image | templates/object_detection/object_detection.py | TaskCategory.OBJECT_DETECTION | DataFormat.IMAGE |
| Text | templates/text_classification/text_classification.py | TaskCategory.TEXT_CLASSIFICATION | DataFormat.TEXT |
High Level Template Structure
All templates follow the same structure:ingestor_job.yaml.
config.LABEL_FILE: Path to local csv label fileconfig.BATCH_SIZE: Batch size used during ingestion
Customize a Template
Templates provide a starting point, but every dataset has its own format and labels. In this step you adapt the template to your data by tuning CSV ingestion options and setting the ingestor parameters (category, label column, intent, data path and schema). The following example intemplates/tabular_classification/tabular_classification.py shows how to ingest a tabular dataset, but the setup works the same way for image or text data.
Needed for Tabular Data: Define Schema
Define the dataset schema as a Python dictionary, mapping each column to its SQL type and constraints. Do not include IDs or the label column into the schema.Needed for Image Classification Data: Define Image Options
Define image size and file extension.Needed for Object Detection Data: Define Image Options
Define file extension.Needed for Text Data: Define File Extension
Define file extensions.Set CSV ingestion options
Customize parsing, memory handling, and data cleaning with the csv_options dictionary:Set Up the Ingestor
Define the Ingestor instance with the required configuration. See the tabular data example below:category, choose the ML task type (TABULAR_CLASSIFICATION, IMAGE_CLASSIFICATION, OBJECT_DETECTION)label_column, target column or class labelsintent, set as TRAIN or TEST depending on dataset purpose- include
file_optionsorschemadepending on the data type
templates/ folder.
2. Build Docker Image
With your template configured, the next step is to package it into a Docker image so it can run inside the Kubernetes cluster.Edit Dockerfile
Before building the image, update yourDockerfile so that both the dataset and the ingestion script are copied into the container. This ensures the ingestor has everything it needs at runtime, independent of your local file system.
Copy data files
For all use cases except tabular data (where labels and features are contained within a single labels.csv file), copy the data files into the Docker container:Build Docker Image
You need a docker user and password to proceed with the next step. Most cloud platforms (AWS, Azure, GCP) run on Linux AMD64. Specifying--platform linux/amd64 guarantees compatibility, particularly if you build images on Apple Silicon (M1/M2) or other ARM-based systems. Pick a setup, build and deploy the image:
For Local Development/Testing
For Cloud Deployment (AWS, Azure, GCP)
3. Configure Kubernetes
With the image generated and pushed to the registry, editingestor-job.yaml with your settings:
JOBNAME, to distinguish between train and test data jobs.NAMESPACE, use the same as your client.image, your Docker image (imagePullPolicy: Always for DockerHub, IfNotPresent for local)CLIENT_ID,CLIENT_PASSWORDfrom the tracebloc client viewTABLE_NAME, unique per dataset, train and test use different names, no spaces. Different names for train and test data is mandatoryLABEL_FILE, path inside the ingestor container, for images this is usually a CSV with file path and label columns. Ensure it matches the copy path in theDockerfilePATH_TO_LOCAL_DATASET_FILE, path to your dataset file within the containerSRC_PATH, root inside the container where your files are mountedYOUR_COMPANY_OR_ORGANISATION_NAME, chose a suitable company or organisation nameBATCH_SIZE, number of entries sent per request. Depends on available CPU memory, not data size (e.g. image dimensions). Too large can exhaust memory. Tested up to 10,000, but 5,000 is a safe default for most systems.LOG_LEVEL, “WARNING” for all warnings and errors, “INFO” for all logs, “ERROR” for errors only
4. Deploy
Run the ingestor as a Kubernetes Job:JOBNAME and TABLE_NAME values for each run (e.g. ingestor-job-train / ingestor-job-test), and set intent to TRAIN or TEST accordingly in your template script.
The data ingestor always runs a validation step before ingestion and moving files.
Verify Deployment
Verify if jobs and pods are deployed successfully and running:Dataset Management Interface
View your datasets at ai.tracebloc.io/data after successful deployment. Interface displays:- Dataset name, ID, and record count
- Data type (Tabular, Image, Text) and purpose (Training/Testing)
- Namespace and GPU requirements
Best Practices
- Deploy jobs for training and testing simultaneously using different job names
- Use consistent, descriptive table names (e.g.,
insurance-claims-train,insurance-claims-test) - Validate data schemas before deployment to prevent ingestion failures
- Clean data before ingestion - Participants cannot view, clean, or fix raw data, so model performance depends entirely on the quality of data you provide
Troubleshooting
Recommended for debugging: Use k9s, a terminal-based Kubernetes dashboard, to monitor jobs, pods, and logs in real time. Runk9s -n <NAMESPACE> to get a live view of resources, switch between them instantly, and inspect logs or events with a few keystrokes. Compared to kubectl, it is faster and more convenient.
Stale Kubernetes Job preventing new Job execution:
Next Steps
- Define and publish your use case: Define Use Case
Need Help?
- Email us at [email protected]