The installer uses sensible defaults. This page covers everything you can change — from cluster naming and port mapping to GPU configuration, manual Helm deployment, and day-to-day cluster management.Documentation Index
Fetch the complete documentation index at: https://docs.tracebloc.io/llms.txt
Use this file to discover all available pages before exploring further.
Installer Options
Override defaults by setting environment variables before the install command. Useful when you need a custom cluster name, multiple worker nodes, or non-standard ports.| Variable | Default | Description |
|---|---|---|
CLUSTER_NAME | tracebloc | Name of the k3d cluster |
SERVERS | 1 | Number of control-plane nodes |
AGENTS | 1 | Number of worker nodes |
K8S_VERSION | v1.29.4-k3s1 | k3s image tag |
HTTP_PORT | 80 | Host port mapped to cluster HTTP ingress |
HTTPS_PORT | 443 | Host port mapped to cluster HTTPS ingress |
HOST_DATA_DIR | ~/.tracebloc | Persistent data directory on host |
Cluster Management
The installer creates a k3d cluster that runs inside Docker. You can stop it to free resources, start it again later, or delete it entirely. Your data persists inHOST_DATA_DIR between stop/start cycles.
View logs
The jobs manager is the main tracebloc process. Check its logs when debugging connectivity or job execution issues:Useful commands
Common kubectl commands for inspecting cluster state:~/.tracebloc/install-*.log.
GPU Support
The installer auto-detects GPU hardware and configures the cluster accordingly. No manual setup required on Linux — the installer handles drivers, container toolkit, and Kubernetes device plugin.NVIDIA (Linux)
Fully automatic. The installer:- Detects NVIDIA GPUs via
nvidia-smiorlspci - Installs drivers if missing (Ubuntu, RHEL/CentOS, Arch)
- Installs the NVIDIA Container Toolkit and configures Docker
- Deploys the NVIDIA k8s device plugin into the cluster
- Passes
--gpus=allto k3d
AMD (Linux)
Auto-detected. ROCm is installed automatically on Ubuntu and RHEL/CentOS. A logout/login may be needed for full GPU access.macOS
CPU only. Docker Desktop on macOS does not support GPU passthrough. For GPU workloads, deploy on a Linux machine with NVIDIA GPUs or use AWS (EKS).Windows
The installer does not install GPU drivers on Windows. Pre-install NVIDIA drivers before running the installer. The installer detects them vianvidia-smi and configures the cluster to use them.
Manual Deployment
Skip the installer entirely. Use this if you already have a Kubernetes cluster, need custom resource limits, or want full control over the Helm deployment. A single unified chart —tracebloc/client — supports AKS, EKS, bare-metal, and OpenShift. Platform behaviour is selected via values overrides; reference defaults live in the repo at client/ci/{aks,eks,bm,oc}-values.yaml.
Add the Helm repository
Get default values
Export the chart’s default configuration to customize it:Configure values.yaml
Authentication
Set your Client ID and password from the tracebloc client view:Resource Limits for Training Jobs
Defaults are sized for typical workloads. Override per job size; for GPU support, requests and limits must be equal:Storage
Storage class and PVC sizes:Docker Registry
The chart pulls the client image from a container registry — credentials are required in production. Use a token, not a plaintext password.{{ .Release.Name }}-regcred. Omit the dockerRegistry block entirely to skip pull-secret creation (e.g. when using a public mirror).
Proxy (optional)
Auto-upgrade (on by default)
Releases of chart1.3.0+ install a <release>-auto-upgrade CronJob that polls https://tracebloc.github.io/client daily and runs helm upgrade --reset-then-reuse-values whenever a newer chart version is published. Closes tracebloc/client#69 — older deployed clients no longer drift from the latest secure release.
cluster-admin ClusterRole because the chart templates cluster-scoped resources (PriorityClass, StorageClass, ClusterRoleBinding, optionally Namespace). Disable if you need a manual approval gate on upgrades.
NetworkPolicy hardening for training pods
Training pods run untrusted ML code. The chart can apply a NetworkPolicy that denies ingress and restricts egress to DNS + external HTTPS only — blocking pod-to-pod, MySQL, and Kubernetes API access from the training pod.| Platform | Notes |
|---|---|
| AKS | needs --network-policy azure (Azure NPM) or Calico at cluster create |
| EKS | needs Calico or Cilium add-on (the default AWS VPC CNI alone does not enforce) |
| Bare-metal | needs Calico / Cilium / kube-router (Flannel alone does not enforce) |
| OpenShift | OVN-Kubernetes enforces by default |
enabled: false on clusters without an enforcing CNI — silently having no protection is worse than explicitly disabling it.
Resource Monitor and node-agents namespace
Thetracebloc-resource-monitor DaemonSet collects node-level CPU/memory metrics. It mounts hostPath volumes (/proc, /sys) which Pod Security Admission’s restricted profile bans — so the chart isolates it in a dedicated privileged namespace (default tracebloc-node-agents).
create: false, create the namespace yourself with the required PSA labels:
metrics-server. It is bundled on k3d/k3s/AKS, present on OpenShift, and must be installed manually on EKS (kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml).
Pod Security Admission labels
Training Jobs run untrusted user-supplied ML code. The chart can create the release namespace with Pod Security Admissionwarn/audit/enforce labels at the restricted profile for defense-in-depth:
create: false (default) and you want PSA labels on an existing namespace:
Image digest pinning
Pin images by content hash for reproducible deploys. Whendigest is set, tag is ignored and imagePullPolicy drops to IfNotPresent.
PriorityClass and PodDisruptionBudgets
The chart pins the MySQL pod with atracebloc-data-plane PriorityClass (value 1000000) so it survives node-level OOM and scheduling pressure, and applies PDBs to MySQL and the jobs manager. Override only if you run a multi-replica MySQL externally:
Deploy
Install the chart into a new namespace:Upgrade
The auto-upgrade CronJob handles routine version bumps. To upgrade manually:When upgrading into chart 1.3.0 from 1.2.x, use
--reset-then-reuse-values (not plain --reuse-values) — the new autoUpgrade block did not exist in 1.2.x and a plain reuse fails template rendering.Uninstall
helm.sh/resource-policy: keep so your data and shared cluster resources survive uninstall. To remove them too:
Migrating from legacy charts
If you installed before chart 1.3.x usingtracebloc/aks, tracebloc/eks, or tracebloc/bm, see the migration guide in the client repo. Key changes:
- 4 charts → 1 chart (
tracebloc/client) with platform values overrides - Auth keys flattened:
jobsManager.env.CLIENT_ID+secrets.clientPassword→ top-levelclientId+clientPassword - PVC keys flattened:
clientData/clientLogsPvc/mysqlPvc(withname,storage,hostPath) →pvc.{data,logs,mysql}(size only) +hostPath.enabledfor bare-metal - ServiceAccount renamed from
defaultto{{ .Release.Name }}-jobs-manager - Pull-secret renamed from hard-coded
regcredto{{ .Release.Name }}-regcred - The
namespacevalue in the legacyvalues.yamlis gone — usehelm install -n <ns>instead
Security
Tracebloc is designed so your data never has to leave your network. Here’s how:- Data stays local. Training data never leaves your infrastructure. Only metadata and metrics are shared with the platform.
- Encrypted. All communication between client and platform is TLS-encrypted.
- Isolated. Training runs in containers with restricted system access. Kubernetes namespaces separate workloads from each other.
- Scanned. Submitted models are analyzed for vulnerabilities before execution on your infrastructure.
- Minimal footprint. The installer only modifies
~/.tracebloc/and Docker. No system-wide changes.