Installer Options
Override defaults by setting environment variables before the install command. Useful for a custom cluster name, extra worker nodes, or a different data directory.| Variable | Default | Description |
|---|---|---|
CLUSTER_NAME | tracebloc | Name of the k3d cluster |
SERVERS | 1 | Number of control-plane nodes |
AGENTS | 1 | Number of worker nodes |
K8S_VERSION | v1.29.4-k3s1 | k3s image tag |
HOST_DATA_DIR | ~/.tracebloc | Persistent data directory on host — must be a local disk (NFS/CIFS/SMB is rejected; the database corrupts on network storage). |
HOST_DATASET_DIR | (unset) | Optional. Place the large dataset volume on a separate mount (e.g. a network/NFS share); the database + logs stay on HOST_DATA_DIR. Must already exist and be writable. |
Cluster Management
The installer creates a k3d cluster that runs inside Docker. You can stop it to free resources, start it again later, or delete it entirely. Your data persists inHOST_DATA_DIR between stop/start cycles.
View logs
The jobs manager is the main tracebloc process. Check its logs when debugging connectivity or job execution issues:Useful commands
Common kubectl commands for inspecting cluster state:~/.tracebloc/install-*.log.
GPU Support
GPU is automatic on Linux — the installer detects your hardware and sets up drivers, the container toolkit, and the Kubernetes device plugin.NVIDIA (Linux)
Fully automatic. The installer:- Detects NVIDIA GPUs via
nvidia-smiorlspci - Installs drivers if missing (Ubuntu, RHEL/CentOS, Arch)
- Installs the NVIDIA Container Toolkit and configures Docker
- Deploys the NVIDIA k8s device plugin into the cluster
- Passes
--gpus=allto k3d
AMD (Linux)
Auto-detected. ROCm is installed automatically on Ubuntu and RHEL/CentOS. A logout/login may be needed for full GPU access.macOS
CPU only. Docker Desktop on macOS does not support GPU passthrough. For GPU workloads, deploy on a Linux machine with NVIDIA GPUs or use AWS (EKS).Windows
The installer does not install GPU drivers on Windows. Pre-install NVIDIA drivers before running the installer. The installer detects them vianvidia-smi and configures the cluster to use them.
Manual Deployment
Skip the installer entirely. Use this if you already have a Kubernetes cluster, need custom resource limits, or want full control over the Helm deployment. A single chart —tracebloc/client — supports AKS, EKS, bare-metal, and OpenShift; choose your platform via values overrides. Reference defaults live at client/ci/{aks,eks,bm,oc}-values.yaml.
Add the Helm repository
Get default values
Export the chart’s default configuration to customize it:Configure values.yaml
Authentication
Set your Client ID and password from the tracebloc client view:Resource Limits for Training Jobs
Defaults are sized for typical workloads. Override per job size; for GPU support, requests and limits must be equal:Storage
Storage class and PVC sizes:The database must stay on local disk. MySQL/InnoDB is unsafe on NFS/CIFS, so the database and logs always use the local
/tracebloc tree. To place large datasets on a network mount, set the installer’s HOST_DATASET_DIR — it relocates only the dataset volume (the chart’s hostPath.datasetPath → /tracebloc-data) and runs ingestion as the mount’s owner uid, so writes succeed under NFS root_squash.Docker Registry
The chart pulls the client image from a container registry — credentials are required in production. Use a token, not a plaintext password.{{ .Release.Name }}-regcred. Omit the dockerRegistry block entirely to skip pull-secret creation (e.g. when using a public mirror).
Proxy (optional)
Auto-upgrade (on by default)
Releases of chart1.3.0+ install a <release>-auto-upgrade CronJob that polls https://tracebloc.github.io/client daily and runs helm upgrade --reset-then-reuse-values whenever a newer chart version is published — so clients auto-update instead of staying pinned to the version they were installed with.
cluster-admin ClusterRole because the chart templates cluster-scoped resources (PriorityClass, StorageClass, ClusterRoleBinding, optionally Namespace). Disable if you need a manual approval gate on upgrades.
NetworkPolicy hardening for training pods
Training pods run untrusted ML code. The chart can apply a NetworkPolicy that denies ingress and restricts egress to DNS + external HTTPS only — blocking pod-to-pod, MySQL, and Kubernetes API access from the training pod.| Platform | Notes |
|---|---|
| AKS | needs --network-policy azure (Azure NPM) or Calico at cluster create |
| EKS | needs Calico or Cilium add-on (the default AWS VPC CNI alone does not enforce) |
| Bare-metal | needs Calico / Cilium / kube-router (Flannel alone does not enforce) |
| OpenShift | OVN-Kubernetes enforces by default |
enabled: false on clusters without an enforcing CNI — silently having no protection is worse than explicitly disabling it.
Resource Monitor and node-agents namespace
Thetracebloc-resource-monitor DaemonSet collects node-level CPU/memory metrics. It mounts hostPath volumes (/proc, /sys) which Pod Security Admission’s restricted profile bans — so the chart isolates it in a dedicated privileged namespace (default tracebloc-node-agents).
create: false, create the namespace yourself with the required PSA labels:
metrics-server. It is bundled on k3d/k3s/AKS, present on OpenShift, and must be installed manually on EKS (kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml).
Pod Security Admission labels
Training Jobs run untrusted user-supplied ML code. The chart can create the release namespace with Pod Security Admissionwarn/audit/enforce labels at the restricted profile for defense-in-depth:
create: false (default) and you want PSA labels on an existing namespace:
Image digest pinning
Pin images by content hash for reproducible deploys. Whendigest is set, tag is ignored and imagePullPolicy drops to IfNotPresent.
PriorityClass and PodDisruptionBudgets
The chart pins the MySQL pod with atracebloc-data-plane PriorityClass (value 1000000) so it survives node-level OOM and scheduling pressure, and applies PDBs to MySQL and the jobs manager. Override only if you run a multi-replica MySQL externally:
Deploy
Install the chart into a new namespace:Upgrade
The auto-upgrade CronJob handles routine version bumps. To upgrade manually:When upgrading into chart 1.3.0 from 1.2.x, use
--reset-then-reuse-values (not plain --reuse-values) — the new autoUpgrade block did not exist in 1.2.x and a plain reuse fails template rendering.Uninstall
helm.sh/resource-policy: keep so your data and shared cluster resources survive uninstall. To remove them too:
Migrating from legacy charts
If you installed before chart 1.3.x usingtracebloc/aks, tracebloc/eks, or tracebloc/bm, see the migration guide in the client repo. Key changes:
- 4 charts → 1 chart (
tracebloc/client) with platform values overrides - Auth keys flattened:
jobsManager.env.CLIENT_ID+secrets.clientPassword→ top-levelclientId+clientPassword - PVC keys flattened:
clientData/clientLogsPvc/mysqlPvc(withname,storage,hostPath) →pvc.{data,logs,mysql}(size only) +hostPath.enabledfor bare-metal - ServiceAccount renamed from
defaultto{{ .Release.Name }}-jobs-manager - Pull-secret renamed from hard-coded
regcredto{{ .Release.Name }}-regcred - The
namespacevalue in the legacyvalues.yamlis gone — usehelm install -n <ns>instead
Security
Tracebloc is designed so your data never has to leave your network. Here’s how:- Data stays local. Training data never leaves your infrastructure. Only metadata and metrics are shared with the platform.
- Encrypted. All communication between client and platform is TLS-encrypted.
- Isolated. Training runs in containers with restricted system access. Kubernetes namespaces separate workloads from each other.
- Scanned. Submitted models are analyzed for vulnerabilities before execution on your infrastructure.
- Minimal footprint. The installer only modifies
~/.tracebloc/and Docker. No system-wide changes.