Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.tracebloc.io/llms.txt

Use this file to discover all available pages before exploring further.

The installer uses sensible defaults. This page covers everything you can change — from cluster naming and port mapping to GPU configuration, manual Helm deployment, and day-to-day cluster management.

Installer Options

Override defaults by setting environment variables before the install command. Useful when you need a custom cluster name, multiple worker nodes, or non-standard ports.
VariableDefaultDescription
CLUSTER_NAMEtraceblocName of the k3d cluster
SERVERS1Number of control-plane nodes
AGENTS1Number of worker nodes
K8S_VERSIONv1.29.4-k3s1k3s image tag
HTTP_PORT80Host port mapped to cluster HTTP ingress
HTTPS_PORT443Host port mapped to cluster HTTPS ingress
HOST_DATA_DIR~/.traceblocPersistent data directory on host
Example — custom cluster name with two worker nodes:
CLUSTER_NAME=my-cluster AGENTS=2 bash <(curl -fsSL https://tracebloc.io/i.sh)

Cluster Management

The installer creates a k3d cluster that runs inside Docker. You can stop it to free resources, start it again later, or delete it entirely. Your data persists in HOST_DATA_DIR between stop/start cycles.
# Stop — frees CPU/RAM, data persists
k3d cluster stop tracebloc

# Start — resume where you left off
k3d cluster start tracebloc

# Delete — removes the cluster entirely
k3d cluster delete tracebloc

View logs

The jobs manager is the main tracebloc process. Check its logs when debugging connectivity or job execution issues:
kubectl logs -n <workspace> -l app=tracebloc-jobs-manager

Useful commands

Common kubectl commands for inspecting cluster state:
kubectl get nodes -o wide          # Node status and IPs
kubectl get pods -A                # All pods across namespaces
kubectl get pods -n <workspace>    # Pods in your workspace
kubectl get pvc -n <workspace>     # Persistent volume claims
kubectl get services -n <workspace> # Services and endpoints
Install logs are saved to ~/.tracebloc/install-*.log.

GPU Support

The installer auto-detects GPU hardware and configures the cluster accordingly. No manual setup required on Linux — the installer handles drivers, container toolkit, and Kubernetes device plugin.

NVIDIA (Linux)

Fully automatic. The installer:
  1. Detects NVIDIA GPUs via nvidia-smi or lspci
  2. Installs drivers if missing (Ubuntu, RHEL/CentOS, Arch)
  3. Installs the NVIDIA Container Toolkit and configures Docker
  4. Deploys the NVIDIA k8s device plugin into the cluster
  5. Passes --gpus=all to k3d
A reboot may be required after driver installation. Re-run the installer afterward — it picks up where it left off.

AMD (Linux)

Auto-detected. ROCm is installed automatically on Ubuntu and RHEL/CentOS. A logout/login may be needed for full GPU access.

macOS

CPU only. Docker Desktop on macOS does not support GPU passthrough. For GPU workloads, deploy on a Linux machine with NVIDIA GPUs or use AWS (EKS).

Windows

The installer does not install GPU drivers on Windows. Pre-install NVIDIA drivers before running the installer. The installer detects them via nvidia-smi and configures the cluster to use them.

Manual Deployment

Skip the installer entirely. Use this if you already have a Kubernetes cluster, need custom resource limits, or want full control over the Helm deployment. A single unified chart — tracebloc/client — supports AKS, EKS, bare-metal, and OpenShift. Platform behaviour is selected via values overrides; reference defaults live in the repo at client/ci/{aks,eks,bm,oc}-values.yaml.

Add the Helm repository

helm repo add tracebloc https://tracebloc.github.io/client
helm repo update

Get default values

Export the chart’s default configuration to customize it:
helm show values tracebloc/client > values.yaml

Configure values.yaml

Authentication

Set your Client ID and password from the tracebloc client view:
clientId: "<YOUR_CLIENT_ID>"
clientPassword: "<YOUR_CLIENT_PASSWORD>"

Resource Limits for Training Jobs

Defaults are sized for typical workloads. Override per job size; for GPU support, requests and limits must be equal:
env:
  RESOURCE_REQUESTS: "cpu=2,memory=8Gi"
  RESOURCE_LIMITS: "cpu=2,memory=8Gi"
  GPU_REQUESTS: ""             # "nvidia.com/gpu=1" for GPU
  GPU_LIMITS: ""               # "nvidia.com/gpu=1" for GPU
  RUNTIME_CLASS_NAME: ""       # "nvidia" for k3s GPU

Storage

Storage class and PVC sizes:
storageClass:
  create: true
  provisioner: ""              # set per platform (see ci/*-values.yaml)
  allowVolumeExpansion: true
  parameters: {}

# Bare-metal only — hostPath-backed PVs at /tracebloc/{data,logs,mysql}
hostPath:
  enabled: false

pvc:
  mysql: 2Gi
  logs: 10Gi
  data: 50Gi
Platform snippets (drop into your values file):

Docker Registry

The chart pulls the client image from a container registry — credentials are required in production. Use a token, not a plaintext password.
dockerRegistry:
  server: https://index.docker.io/v1/
  username: <DOCKER_USERNAME>
  password: <DOCKER_PASSWORD or DOCKER_TOKEN>
  email: <DOCKER_EMAIL>
The chart auto-creates a secret named {{ .Release.Name }}-regcred. Omit the dockerRegistry block entirely to skip pull-secret creation (e.g. when using a public mirror).

Proxy (optional)

env:
  HTTP_PROXY_HOST: "your-proxy.company.com"
  HTTP_PROXY_PORT: "8080"
  HTTP_PROXY_USERNAME: ""
  HTTP_PROXY_PASSWORD: ""

Auto-upgrade (on by default)

Releases of chart 1.3.0+ install a <release>-auto-upgrade CronJob that polls https://tracebloc.github.io/client daily and runs helm upgrade --reset-then-reuse-values whenever a newer chart version is published. Closes tracebloc/client#69 — older deployed clients no longer drift from the latest secure release.
autoUpgrade:
  enabled: true                # set false to opt out
  schedule: "23 2 * * *"       # daily at 02:23 UTC
  suspend: false               # one-shot pause without removing resources
  repoUrl: "https://tracebloc.github.io/client"
  repoName: "tracebloc"
  chartName: "client"
  timeout: "10m"
The CronJob’s ServiceAccount is bound to the built-in cluster-admin ClusterRole because the chart templates cluster-scoped resources (PriorityClass, StorageClass, ClusterRoleBinding, optionally Namespace). Disable if you need a manual approval gate on upgrades.

NetworkPolicy hardening for training pods

Training pods run untrusted ML code. The chart can apply a NetworkPolicy that denies ingress and restricts egress to DNS + external HTTPS only — blocking pod-to-pod, MySQL, and Kubernetes API access from the training pod.
networkPolicy:
  training:
    enabled: true
    dnsNamespace: kube-system
    dnsSelector: {}            # empty falls back to {k8s-app: kube-dns}
    clusterCidrs:
      - "10.0.0.0/8"
      - "172.16.0.0/12"
      - "192.168.0.0/16"
Requires a CNI that enforces NetworkPolicy:
PlatformNotes
AKSneeds --network-policy azure (Azure NPM) or Calico at cluster create
EKSneeds Calico or Cilium add-on (the default AWS VPC CNI alone does not enforce)
Bare-metalneeds Calico / Cilium / kube-router (Flannel alone does not enforce)
OpenShiftOVN-Kubernetes enforces by default
Leave enabled: false on clusters without an enforcing CNI — silently having no protection is worse than explicitly disabling it.
The chart’s training-pod egress lockdown only blocks traffic if your CNI enforces NetworkPolicy. Verify your CNI before relying on it.

Resource Monitor and node-agents namespace

The tracebloc-resource-monitor DaemonSet collects node-level CPU/memory metrics. It mounts hostPath volumes (/proc, /sys) which Pod Security Admission’s restricted profile bans — so the chart isolates it in a dedicated privileged namespace (default tracebloc-node-agents).
resourceMonitor: true          # set false on clusters where metrics-server cannot be installed
nodeAgents:
  namespace:
    create: true
    name: tracebloc-node-agents
When create: false, create the namespace yourself with the required PSA labels:
kubectl create namespace tracebloc-node-agents
kubectl label namespace tracebloc-node-agents \
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/warn=privileged \
  pod-security.kubernetes.io/audit=privileged
The DaemonSet requires metrics-server. It is bundled on k3d/k3s/AKS, present on OpenShift, and must be installed manually on EKS (kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml).

Pod Security Admission labels

Training Jobs run untrusted user-supplied ML code. The chart can create the release namespace with Pod Security Admission warn/audit/enforce labels at the restricted profile for defense-in-depth:
namespace:
  create: false                # true only on greenfield installs
  podSecurity:
    warn: restricted
    audit: restricted
    enforce: restricted        # set "" for bare-metal hostPath installs
When create: false (default) and you want PSA labels on an existing namespace:
kubectl label namespace <workspace> \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/enforce=restricted

Image digest pinning

Pin images by content hash for reproducible deploys. When digest is set, tag is ignored and imagePullPolicy drops to IfNotPresent.
images:
  jobsManager:    { digest: "sha256:..." }
  podsMonitor:    { digest: "sha256:..." }
  resourceMonitor: { digest: "sha256:..." }
  requestsProxy:  { digest: "sha256:..." }
  mysqlClient:    { tag: "", digest: "" }
  busybox:        { tag: "1.35", digest: "" }

PriorityClass and PodDisruptionBudgets

The chart pins the MySQL pod with a tracebloc-data-plane PriorityClass (value 1000000) so it survives node-level OOM and scheduling pressure, and applies PDBs to MySQL and the jobs manager. Override only if you run a multi-replica MySQL externally:
priorityClass:
  create: true
  name: tracebloc-data-plane
  value: 1000000

podDisruptionBudget:
  mysql:       { create: true }
  jobsManager: { create: true }

Deploy

Install the chart into a new namespace:
helm upgrade --install <workspace> tracebloc/client \
  --namespace <workspace> \
  --create-namespace \
  --values values.yaml

Upgrade

The auto-upgrade CronJob handles routine version bumps. To upgrade manually:
helm repo update
helm upgrade <workspace> tracebloc/client \
  --namespace <workspace> \
  --reset-then-reuse-values \
  --values values.yaml
When upgrading into chart 1.3.0 from 1.2.x, use --reset-then-reuse-values (not plain --reuse-values) — the new autoUpgrade block did not exist in 1.2.x and a plain reuse fails template rendering.

Uninstall

helm uninstall <workspace> -n <workspace>
PVCs and the PriorityClass are annotated helm.sh/resource-policy: keep so your data and shared cluster resources survive uninstall. To remove them too:
kubectl delete pvc --all -n <workspace>
kubectl delete namespace <workspace>

Migrating from legacy charts

If you installed before chart 1.3.x using tracebloc/aks, tracebloc/eks, or tracebloc/bm, see the migration guide in the client repo. Key changes:
  • 4 charts → 1 chart (tracebloc/client) with platform values overrides
  • Auth keys flattened: jobsManager.env.CLIENT_ID + secrets.clientPassword → top-level clientId + clientPassword
  • PVC keys flattened: clientData / clientLogsPvc / mysqlPvc (with name, storage, hostPath) → pvc.{data,logs,mysql} (size only) + hostPath.enabled for bare-metal
  • ServiceAccount renamed from default to {{ .Release.Name }}-jobs-manager
  • Pull-secret renamed from hard-coded regcred to {{ .Release.Name }}-regcred
  • The namespace value in the legacy values.yaml is gone — use helm install -n <ns> instead

Security

Tracebloc is designed so your data never has to leave your network. Here’s how:
  • Data stays local. Training data never leaves your infrastructure. Only metadata and metrics are shared with the platform.
  • Encrypted. All communication between client and platform is TLS-encrypted.
  • Isolated. Training runs in containers with restricted system access. Kubernetes namespaces separate workloads from each other.
  • Scanned. Submitted models are analyzed for vulnerabilities before execution on your infrastructure.
  • Minimal footprint. The installer only modifies ~/.tracebloc/ and Docker. No system-wide changes.