Configuration

The installer uses sensible defaults. This page covers everything you can change — from cluster naming and port mapping to GPU configuration, manual Helm deployment, and day-to-day cluster management.

Installer Options

Override defaults by setting environment variables before the install command. Useful when you need a custom cluster name, multiple worker nodes, or non-standard ports.

Variable	Default	Description
`CLUSTER_NAME`	`tracebloc`	Name of the k3d cluster
`SERVERS`	`1`	Number of control-plane nodes
`AGENTS`	`1`	Number of worker nodes
`K8S_VERSION`	`v1.29.4-k3s1`	k3s image tag
`HTTP_PORT`	`80`	Host port mapped to cluster HTTP ingress
`HTTPS_PORT`	`443`	Host port mapped to cluster HTTPS ingress
`HOST_DATA_DIR`	`~/.tracebloc`	Persistent data directory on host

Example — custom cluster name with two worker nodes:

CLUSTER_NAME=my-cluster AGENTS=2 bash <(curl -fsSL https://tracebloc.io/i.sh)

Cluster Management

The installer creates a k3d cluster that runs inside Docker. You can stop it to free resources, start it again later, or delete it entirely. Your data persists in HOST_DATA_DIR between stop/start cycles.

# Stop — frees CPU/RAM, data persists
k3d cluster stop tracebloc

# Start — resume where you left off
k3d cluster start tracebloc

# Delete — removes the cluster entirely
k3d cluster delete tracebloc

View logs

The jobs manager is the main tracebloc process. Check its logs when debugging connectivity or job execution issues:

kubectl logs -n <workspace> -l app=tracebloc-jobs-manager

Useful commands

Common kubectl commands for inspecting cluster state:

kubectl get nodes -o wide          # Node status and IPs
kubectl get pods -A                # All pods across namespaces
kubectl get pods -n <workspace>    # Pods in your workspace
kubectl get pvc -n <workspace>     # Persistent volume claims
kubectl get services -n <workspace> # Services and endpoints

Install logs are saved to ~/.tracebloc/install-*.log.

GPU Support

The installer auto-detects GPU hardware and configures the cluster accordingly. No manual setup required on Linux — the installer handles drivers, container toolkit, and Kubernetes device plugin.

NVIDIA (Linux)

Fully automatic. The installer:

Detects NVIDIA GPUs via nvidia-smi or lspci
Installs drivers if missing (Ubuntu, RHEL/CentOS, Arch)
Installs the NVIDIA Container Toolkit and configures Docker
Deploys the NVIDIA k8s device plugin into the cluster
Passes --gpus=all to k3d

A reboot may be required after driver installation. Re-run the installer afterward — it picks up where it left off.

AMD (Linux)

Auto-detected. ROCm is installed automatically on Ubuntu and RHEL/CentOS. A logout/login may be needed for full GPU access.

macOS

CPU only. Docker Desktop on macOS does not support GPU passthrough. For GPU workloads, deploy on a Linux machine with NVIDIA GPUs or use AWS (EKS).

Windows

The installer does not install GPU drivers on Windows. Pre-install NVIDIA drivers before running the installer. The installer detects them via nvidia-smi and configures the cluster to use them.

Manual Deployment

Skip the installer entirely. Use this if you already have a Kubernetes cluster, need custom resource limits, or want full control over the Helm deployment. A single unified chart — tracebloc/client — supports AKS, EKS, bare-metal, and OpenShift. Platform behaviour is selected via values overrides; reference defaults live in the repo at client/ci/{aks,eks,bm,oc}-values.yaml.

Add the Helm repository

helm repo add tracebloc https://tracebloc.github.io/client
helm repo update

Get default values

Export the chart’s default configuration to customize it:

helm show values tracebloc/client > values.yaml

Configure values.yaml

Authentication

Set your Client ID and password from the tracebloc client view:

clientId: "<YOUR_CLIENT_ID>"
clientPassword: "<YOUR_CLIENT_PASSWORD>"

Resource Limits for Training Jobs

Defaults are sized for typical workloads. Override per job size; for GPU support, requests and limits must be equal:

env:
  RESOURCE_REQUESTS: "cpu=2,memory=8Gi"
  RESOURCE_LIMITS: "cpu=2,memory=8Gi"
  GPU_REQUESTS: ""             # "nvidia.com/gpu=1" for GPU
  GPU_LIMITS: ""               # "nvidia.com/gpu=1" for GPU
  RUNTIME_CLASS_NAME: ""       # "nvidia" for k3s GPU

Storage

Storage class and PVC sizes:

storageClass:
  create: true
  provisioner: ""              # set per platform (see ci/*-values.yaml)
  allowVolumeExpansion: true
  parameters: {}

# Bare-metal only — hostPath-backed PVs at /tracebloc/{data,logs,mysql}
hostPath:
  enabled: false

pvc:
  mysql: 2Gi
  logs: 10Gi
  data: 50Gi

Platform snippets (drop into your values file):

Docker Registry

The chart pulls the client image from a container registry — credentials are required in production. Use a token, not a plaintext password.

dockerRegistry:
  server: https://index.docker.io/v1/
  username: <DOCKER_USERNAME>
  password: <DOCKER_PASSWORD or DOCKER_TOKEN>
  email: <DOCKER_EMAIL>

The chart auto-creates a secret named {{ .Release.Name }}-regcred. Omit the dockerRegistry block entirely to skip pull-secret creation (e.g. when using a public mirror).

Proxy (optional)

env:
  HTTP_PROXY_HOST: "your-proxy.company.com"
  HTTP_PROXY_PORT: "8080"
  HTTP_PROXY_USERNAME: ""
  HTTP_PROXY_PASSWORD: ""

Auto-upgrade (on by default)

Releases of chart 1.3.0+ install a <release>-auto-upgrade CronJob that polls https://tracebloc.github.io/client daily and runs helm upgrade --reset-then-reuse-values whenever a newer chart version is published. Closes tracebloc/client#69 — older deployed clients no longer drift from the latest secure release.

autoUpgrade:
  enabled: true                # set false to opt out
  schedule: "23 2 * * *"       # daily at 02:23 UTC
  suspend: false               # one-shot pause without removing resources
  repoUrl: "https://tracebloc.github.io/client"
  repoName: "tracebloc"
  chartName: "client"
  timeout: "10m"

The CronJob’s ServiceAccount is bound to the built-in cluster-admin ClusterRole because the chart templates cluster-scoped resources (PriorityClass, StorageClass, ClusterRoleBinding, optionally Namespace). Disable if you need a manual approval gate on upgrades.

NetworkPolicy hardening for training pods

Training pods run untrusted ML code. The chart can apply a NetworkPolicy that denies ingress and restricts egress to DNS + external HTTPS only — blocking pod-to-pod, MySQL, and Kubernetes API access from the training pod.

networkPolicy:
  training:
    enabled: true
    dnsNamespace: kube-system
    dnsSelector: {}            # empty falls back to {k8s-app: kube-dns}
    clusterCidrs:
      - "10.0.0.0/8"
      - "172.16.0.0/12"
      - "192.168.0.0/16"

Requires a CNI that enforces NetworkPolicy:

Platform	Notes
AKS	needs `--network-policy azure` (Azure NPM) or Calico at cluster create
EKS	needs Calico or Cilium add-on (the default AWS VPC CNI alone does not enforce)
Bare-metal	needs Calico / Cilium / kube-router (Flannel alone does not enforce)
OpenShift	OVN-Kubernetes enforces by default

Leave enabled: false on clusters without an enforcing CNI — silently having no protection is worse than explicitly disabling it.

The chart’s training-pod egress lockdown only blocks traffic if your CNI enforces NetworkPolicy. Verify your CNI before relying on it.

Resource Monitor and node-agents namespace

The tracebloc-resource-monitor DaemonSet collects node-level CPU/memory metrics. It mounts hostPath volumes (/proc, /sys) which Pod Security Admission’s restricted profile bans — so the chart isolates it in a dedicated privileged namespace (default tracebloc-node-agents).

resourceMonitor: true          # set false on clusters where metrics-server cannot be installed
nodeAgents:
  namespace:
    create: true
    name: tracebloc-node-agents

When create: false, create the namespace yourself with the required PSA labels:

kubectl create namespace tracebloc-node-agents
kubectl label namespace tracebloc-node-agents \
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/warn=privileged \
  pod-security.kubernetes.io/audit=privileged

The DaemonSet requires metrics-server. It is bundled on k3d/k3s/AKS, present on OpenShift, and must be installed manually on EKS (kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml).

Pod Security Admission labels

Training Jobs run untrusted user-supplied ML code. The chart can create the release namespace with Pod Security Admission warn/audit/enforce labels at the restricted profile for defense-in-depth:

namespace:
  create: false                # true only on greenfield installs
  podSecurity:
    warn: restricted
    audit: restricted
    enforce: restricted        # set "" for bare-metal hostPath installs

When create: false (default) and you want PSA labels on an existing namespace:

kubectl label namespace <workspace> \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/enforce=restricted

Image digest pinning

Pin images by content hash for reproducible deploys. When digest is set, tag is ignored and imagePullPolicy drops to IfNotPresent.

images:
  jobsManager:    { digest: "sha256:..." }
  podsMonitor:    { digest: "sha256:..." }
  resourceMonitor: { digest: "sha256:..." }
  requestsProxy:  { digest: "sha256:..." }
  mysqlClient:    { tag: "", digest: "" }
  busybox:        { tag: "1.35", digest: "" }

PriorityClass and PodDisruptionBudgets

The chart pins the MySQL pod with a tracebloc-data-plane PriorityClass (value 1000000) so it survives node-level OOM and scheduling pressure, and applies PDBs to MySQL and the jobs manager. Override only if you run a multi-replica MySQL externally:

priorityClass:
  create: true
  name: tracebloc-data-plane
  value: 1000000

podDisruptionBudget:
  mysql:       { create: true }
  jobsManager: { create: true }

Deploy

Install the chart into a new namespace:

helm upgrade --install <workspace> tracebloc/client \
  --namespace <workspace> \
  --create-namespace \
  --values values.yaml

Upgrade

The auto-upgrade CronJob handles routine version bumps. To upgrade manually:

helm repo update
helm upgrade <workspace> tracebloc/client \
  --namespace <workspace> \
  --reset-then-reuse-values \
  --values values.yaml

When upgrading into chart 1.3.0 from 1.2.x, use --reset-then-reuse-values (not plain --reuse-values) — the new autoUpgrade block did not exist in 1.2.x and a plain reuse fails template rendering.

Uninstall

helm uninstall <workspace> -n <workspace>

PVCs and the PriorityClass are annotated helm.sh/resource-policy: keep so your data and shared cluster resources survive uninstall. To remove them too:

kubectl delete pvc --all -n <workspace>
kubectl delete namespace <workspace>

Migrating from legacy charts

If you installed before chart 1.3.x using tracebloc/aks, tracebloc/eks, or tracebloc/bm, see the migration guide in the client repo. Key changes:

4 charts → 1 chart (tracebloc/client) with platform values overrides
Auth keys flattened: jobsManager.env.CLIENT_ID + secrets.clientPassword → top-level clientId + clientPassword
PVC keys flattened: clientData / clientLogsPvc / mysqlPvc (with name, storage, hostPath) → pvc.{data,logs,mysql} (size only) + hostPath.enabled for bare-metal
ServiceAccount renamed from default to {{ .Release.Name }}-jobs-manager
Pull-secret renamed from hard-coded regcred to {{ .Release.Name }}-regcred
The namespace value in the legacy values.yaml is gone — use helm install -n <ns> instead

Security

Tracebloc is designed so your data never has to leave your network. Here’s how:

Data stays local. Training data never leaves your infrastructure. Only metadata and metrics are shared with the platform.
Encrypted. All communication between client and platform is TLS-encrypted.
Isolated. Training runs in containers with restricted system access. Kubernetes namespaces separate workloads from each other.
Scanned. Submitted models are analyzed for vulnerabilities before execution on your infrastructure.
Minimal footprint. The installer only modifies ~/.tracebloc/ and Docker. No system-wide changes.

Overview

Create a Use Case

Join a Use Case

Environment Setup

Tools & Help

Installer Options

Cluster Management

View logs

Useful commands

GPU Support

NVIDIA (Linux)

AMD (Linux)

macOS

Windows

Manual Deployment

Add the Helm repository

Get default values

Configure values.yaml

Authentication

Resource Limits for Training Jobs

Storage

Docker Registry

Proxy (optional)

Auto-upgrade (on by default)

NetworkPolicy hardening for training pods

Resource Monitor and node-agents namespace

Pod Security Admission labels

Image digest pinning

PriorityClass and PodDisruptionBudgets

Deploy

Upgrade

Uninstall

Migrating from legacy charts

Security

Overview

Create a Use Case

Join a Use Case

Environment Setup

Tools & Help

Documentation Index

​Installer Options

​Cluster Management

​View logs

​Useful commands

​GPU Support

​NVIDIA (Linux)

​AMD (Linux)

​macOS

​Windows

​Manual Deployment

​Add the Helm repository

​Get default values

​Configure values.yaml

​Authentication

​Resource Limits for Training Jobs

​Storage

​Docker Registry

​Proxy (optional)

​Auto-upgrade (on by default)

​NetworkPolicy hardening for training pods

​Resource Monitor and node-agents namespace

​Pod Security Admission labels

​Image digest pinning

​PriorityClass and PodDisruptionBudgets

​Deploy

​Upgrade

​Uninstall

​Migrating from legacy charts

​Security

Installer Options

Cluster Management

View logs

Useful commands

GPU Support

NVIDIA (Linux)

AMD (Linux)

macOS

Windows

Manual Deployment

Add the Helm repository

Get default values

Configure values.yaml

Authentication

Resource Limits for Training Jobs

Storage

Docker Registry

Proxy (optional)

Auto-upgrade (on by default)

NetworkPolicy hardening for training pods

Resource Monitor and node-agents namespace

Pod Security Admission labels

Image digest pinning

PriorityClass and PodDisruptionBudgets

Deploy

Upgrade

Uninstall

Migrating from legacy charts

Security