The installer uses sensible defaults; this page covers what you can change.
Installed with the one-liner? See Installer Options, Cluster Management, and GPU Support. Deploying into your own cluster with Helm (EKS, AKS, bare-metal)? Jump to Manual Deployment.
Installer Options
Override defaults by setting environment variables before the install command. Useful for a custom cluster name, extra worker nodes, or a different data directory.
| Variable | Default | Description |
|---|
CLUSTER_NAME | tracebloc | Name of the k3d cluster |
SERVERS | 1 | Number of control-plane nodes |
AGENTS | 1 | Number of worker nodes |
K8S_VERSION | v1.29.4-k3s1 | k3s image tag |
HOST_DATA_DIR | ~/.tracebloc | Persistent data directory on host |
Example — custom cluster name with two worker nodes:
CLUSTER_NAME=my-cluster AGENTS=2 bash <(curl -fsSL https://tracebloc.io/i.sh)
Cluster Management
The installer creates a k3d cluster that runs inside Docker. You can stop it to free resources, start it again later, or delete it entirely. Your data persists in HOST_DATA_DIR between stop/start cycles.
# Stop — frees CPU/RAM, data persists
k3d cluster stop tracebloc
# Start — resume where you left off
k3d cluster start tracebloc
# Delete — removes the cluster entirely
k3d cluster delete tracebloc
View logs
The jobs manager is the main tracebloc process. Check its logs when debugging connectivity or job execution issues:
kubectl logs -n <workspace> -l app=manager
Useful commands
Common kubectl commands for inspecting cluster state:
kubectl get nodes -o wide # Node status and IPs
kubectl get pods -A # All pods across namespaces
kubectl get pods -n <workspace> # Pods in your workspace
kubectl get pvc -n <workspace> # Persistent volume claims
kubectl get services -n <workspace> # Services and endpoints
Install logs are saved to ~/.tracebloc/install-*.log.
GPU Support
GPU is automatic on Linux — the installer detects your hardware and sets up drivers, the container toolkit, and the Kubernetes device plugin.
NVIDIA (Linux)
Fully automatic. The installer:
- Detects NVIDIA GPUs via
nvidia-smi or lspci
- Installs drivers if missing (Ubuntu, RHEL/CentOS, Arch)
- Installs the NVIDIA Container Toolkit and configures Docker
- Deploys the NVIDIA k8s device plugin into the cluster
- Passes
--gpus=all to k3d
A reboot may be required after driver installation. Re-run the installer afterward — it picks up where it left off.
AMD (Linux)
Auto-detected. ROCm is installed automatically on Ubuntu and RHEL/CentOS. A logout/login may be needed for full GPU access.
macOS
CPU only. Docker Desktop on macOS does not support GPU passthrough. For GPU workloads, deploy on a Linux machine with NVIDIA GPUs or use AWS (EKS).
Windows
The installer does not install GPU drivers on Windows. Pre-install NVIDIA drivers before running the installer. The installer detects them via nvidia-smi and configures the cluster to use them.
Manual Deployment
Skip the installer entirely. Use this if you already have a Kubernetes cluster, need custom resource limits, or want full control over the Helm deployment.
A single chart — tracebloc/client — supports AKS, EKS, bare-metal, and OpenShift; choose your platform via values overrides. Reference defaults live at client/ci/{aks,eks,bm,oc}-values.yaml.
Add the Helm repository
helm repo add tracebloc https://tracebloc.github.io/client
helm repo update
Get default values
Export the chart’s default configuration to customize it:
helm show values tracebloc/client > values.yaml
Authentication
Set your Client ID and password from the tracebloc client view:
clientId: "<YOUR_CLIENT_ID>"
clientPassword: "<YOUR_CLIENT_PASSWORD>"
Resource Limits for Training Jobs
Defaults are sized for typical workloads. Override per job size; for GPU support, requests and limits must be equal:
env:
RESOURCE_REQUESTS: "cpu=2,memory=8Gi"
RESOURCE_LIMITS: "cpu=2,memory=8Gi"
GPU_REQUESTS: "" # "nvidia.com/gpu=1" for GPU
GPU_LIMITS: "" # "nvidia.com/gpu=1" for GPU
RUNTIME_CLASS_NAME: "" # "nvidia" for k3s GPU
Storage
Storage class and PVC sizes:
storageClass:
create: true
provisioner: "" # set per platform (see ci/*-values.yaml)
allowVolumeExpansion: true
parameters: {}
# Bare-metal only — hostPath-backed PVs at /tracebloc/{data,logs,mysql}
hostPath:
enabled: false
pvc:
mysql: 2Gi
logs: 10Gi
data: 50Gi
Platform snippets (drop into your values file):
Docker Registry
The chart pulls the client image from a container registry — credentials are required in production. Use a token, not a plaintext password.
dockerRegistry:
server: https://index.docker.io/v1/
username: <DOCKER_USERNAME>
password: <DOCKER_PASSWORD or DOCKER_TOKEN>
email: <DOCKER_EMAIL>
The chart auto-creates a secret named {{ .Release.Name }}-regcred. Omit the dockerRegistry block entirely to skip pull-secret creation (e.g. when using a public mirror).
Proxy (optional)
env:
HTTP_PROXY_HOST: "your-proxy.company.com"
HTTP_PROXY_PORT: "8080"
HTTP_PROXY_USERNAME: ""
HTTP_PROXY_PASSWORD: ""
Auto-upgrade (on by default)
Releases of chart 1.3.0+ install a <release>-auto-upgrade CronJob that polls https://tracebloc.github.io/client daily and runs helm upgrade --reset-then-reuse-values whenever a newer chart version is published — so clients auto-update instead of staying pinned to the version they were installed with.
autoUpgrade:
enabled: true # set false to opt out
schedule: "23 2 * * *" # daily at 02:23 UTC
suspend: false # one-shot pause without removing resources
repoUrl: "https://tracebloc.github.io/client"
repoName: "tracebloc"
chartName: "client"
timeout: "10m"
The CronJob’s ServiceAccount is bound to the built-in cluster-admin ClusterRole because the chart templates cluster-scoped resources (PriorityClass, StorageClass, ClusterRoleBinding, optionally Namespace). Disable if you need a manual approval gate on upgrades.
NetworkPolicy hardening for training pods
Training pods run untrusted ML code. The chart can apply a NetworkPolicy that denies ingress and restricts egress to DNS + external HTTPS only — blocking pod-to-pod, MySQL, and Kubernetes API access from the training pod.
networkPolicy:
training:
enabled: true
dnsNamespace: kube-system
dnsSelector: {} # empty falls back to {k8s-app: kube-dns}
clusterCidrs:
- "10.0.0.0/8"
- "172.16.0.0/12"
- "192.168.0.0/16"
Requires a CNI that enforces NetworkPolicy:
| Platform | Notes |
|---|
| AKS | needs --network-policy azure (Azure NPM) or Calico at cluster create |
| EKS | needs Calico or Cilium add-on (the default AWS VPC CNI alone does not enforce) |
| Bare-metal | needs Calico / Cilium / kube-router (Flannel alone does not enforce) |
| OpenShift | OVN-Kubernetes enforces by default |
Leave enabled: false on clusters without an enforcing CNI — silently having no protection is worse than explicitly disabling it.
The chart’s training-pod egress lockdown only blocks traffic if your CNI enforces NetworkPolicy. Verify your CNI before relying on it.
Resource Monitor and node-agents namespace
The tracebloc-resource-monitor DaemonSet collects node-level CPU/memory metrics. It mounts hostPath volumes (/proc, /sys) which Pod Security Admission’s restricted profile bans — so the chart isolates it in a dedicated privileged namespace (default tracebloc-node-agents).
resourceMonitor: true # set false on clusters where metrics-server cannot be installed
nodeAgents:
namespace:
create: true
name: tracebloc-node-agents
When create: false, create the namespace yourself with the required PSA labels:
kubectl create namespace tracebloc-node-agents
kubectl label namespace tracebloc-node-agents \
pod-security.kubernetes.io/enforce=privileged \
pod-security.kubernetes.io/warn=privileged \
pod-security.kubernetes.io/audit=privileged
The DaemonSet requires metrics-server. It is bundled on k3d/k3s/AKS, present on OpenShift, and must be installed manually on EKS (kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml).
Pod Security Admission labels
Training Jobs run untrusted user-supplied ML code. The chart can create the release namespace with Pod Security Admission warn/audit/enforce labels at the restricted profile for defense-in-depth:
namespace:
create: false # true only on greenfield installs
podSecurity:
warn: restricted
audit: restricted
enforce: restricted # set "" for bare-metal hostPath installs
When create: false (default) and you want PSA labels on an existing namespace:
kubectl label namespace <workspace> \
pod-security.kubernetes.io/warn=restricted \
pod-security.kubernetes.io/audit=restricted \
pod-security.kubernetes.io/enforce=restricted
Image digest pinning
Pin images by content hash for reproducible deploys. When digest is set, tag is ignored and imagePullPolicy drops to IfNotPresent.
images:
jobsManager: { digest: "sha256:..." }
podsMonitor: { digest: "sha256:..." }
resourceMonitor: { digest: "sha256:..." }
requestsProxy: { digest: "sha256:..." }
mysqlClient: { tag: "", digest: "" }
busybox: { tag: "1.35", digest: "" }
PriorityClass and PodDisruptionBudgets
The chart pins the MySQL pod with a tracebloc-data-plane PriorityClass (value 1000000) so it survives node-level OOM and scheduling pressure, and applies PDBs to MySQL and the jobs manager. Override only if you run a multi-replica MySQL externally:
priorityClass:
create: true
name: tracebloc-data-plane
value: 1000000
podDisruptionBudget:
mysql: { create: true }
jobsManager: { create: true }
Deploy
Install the chart into a new namespace:
helm upgrade --install <workspace> tracebloc/client \
--namespace <workspace> \
--create-namespace \
--values values.yaml
Upgrade
The auto-upgrade CronJob handles routine version bumps. To upgrade manually:
helm repo update
helm upgrade <workspace> tracebloc/client \
--namespace <workspace> \
--reset-then-reuse-values \
--values values.yaml
When upgrading into chart 1.3.0 from 1.2.x, use --reset-then-reuse-values (not plain --reuse-values) — the new autoUpgrade block did not exist in 1.2.x and a plain reuse fails template rendering.
Uninstall
helm uninstall <workspace> -n <workspace>
PVCs and the PriorityClass are annotated helm.sh/resource-policy: keep so your data and shared cluster resources survive uninstall. To remove them too:
kubectl delete pvc --all -n <workspace>
kubectl delete namespace <workspace>
Migrating from legacy charts
If you installed before chart 1.3.x using tracebloc/aks, tracebloc/eks, or tracebloc/bm, see the migration guide in the client repo. Key changes:
- 4 charts → 1 chart (
tracebloc/client) with platform values overrides
- Auth keys flattened:
jobsManager.env.CLIENT_ID + secrets.clientPassword → top-level clientId + clientPassword
- PVC keys flattened:
clientData / clientLogsPvc / mysqlPvc (with name, storage, hostPath) → pvc.{data,logs,mysql} (size only) + hostPath.enabled for bare-metal
- ServiceAccount renamed from
default to {{ .Release.Name }}-jobs-manager
- Pull-secret renamed from hard-coded
regcred to {{ .Release.Name }}-regcred
- The
namespace value in the legacy values.yaml is gone — use helm install -n <ns> instead
Security
Tracebloc is designed so your data never has to leave your network. Here’s how:
- Data stays local. Training data never leaves your infrastructure. Only metadata and metrics are shared with the platform.
- Encrypted. All communication between client and platform is TLS-encrypted.
- Isolated. Training runs in containers with restricted system access. Kubernetes namespaces separate workloads from each other.
- Scanned. Submitted models are analyzed for vulnerabilities before execution on your infrastructure.
- Minimal footprint. The installer only modifies
~/.tracebloc/ and Docker. No system-wide changes.