> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tracebloc.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Configuration

> Customize your tracebloc workspace — environment variables, cluster management, GPU support, and manual Helm deployment.

The installer uses sensible defaults; this page covers what you can change.

**Installed with the one-liner?** See [Installer Options](#installer-options), [Cluster Management](#cluster-management), and [GPU Support](#gpu-support). **Deploying into your own cluster with Helm** (EKS, AKS, bare-metal)? Jump to [Manual Deployment](#manual-deployment).

## Installer Options

Override defaults by setting environment variables before the install command. Useful for a custom cluster name, extra worker nodes, or a different data directory.

| Variable           | Default        | Description                                                                                                                                                               |
| ------------------ | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `CLUSTER_NAME`     | `tracebloc`    | Name of the k3d cluster                                                                                                                                                   |
| `SERVERS`          | `1`            | Number of control-plane nodes                                                                                                                                             |
| `AGENTS`           | `1`            | Number of worker nodes                                                                                                                                                    |
| `K8S_VERSION`      | `v1.29.4-k3s1` | k3s image tag                                                                                                                                                             |
| `HOST_DATA_DIR`    | `~/.tracebloc` | Persistent data directory on host — **must be a local disk** (NFS/CIFS/SMB is rejected; the database corrupts on network storage).                                        |
| `HOST_DATASET_DIR` | *(unset)*      | Optional. Place the large dataset volume on a separate mount (e.g. a network/NFS share); the database + logs stay on `HOST_DATA_DIR`. Must already exist and be writable. |

Example — custom cluster name with two worker nodes:

```bash theme={null}
CLUSTER_NAME=my-cluster AGENTS=2 bash <(curl -fsSL https://tracebloc.io/i.sh)
```

## Cluster Management

The installer creates a k3d cluster that runs inside Docker. You can stop it to free resources, start it again later, or delete it entirely. Your data persists in `HOST_DATA_DIR` between stop/start cycles.

```bash theme={null}
# Stop — frees CPU/RAM, data persists
k3d cluster stop tracebloc

# Start — resume where you left off
k3d cluster start tracebloc

# Delete — removes the cluster entirely
k3d cluster delete tracebloc
```

### View logs

The jobs manager is the main tracebloc process. Check its logs when debugging connectivity or job execution issues:

```bash theme={null}
kubectl logs -n <workspace> -l app=manager
```

### Useful commands

Common kubectl commands for inspecting cluster state:

```bash theme={null}
kubectl get nodes -o wide          # Node status and IPs
kubectl get pods -A                # All pods across namespaces
kubectl get pods -n <workspace>    # Pods in your workspace
kubectl get pvc -n <workspace>     # Persistent volume claims
kubectl get services -n <workspace> # Services and endpoints
```

Install logs are saved to `~/.tracebloc/install-*.log`.

## GPU Support

GPU is automatic on Linux — the installer detects your hardware and sets up drivers, the container toolkit, and the Kubernetes device plugin.

### NVIDIA (Linux)

Fully automatic. The installer:

1. Detects NVIDIA GPUs via `nvidia-smi` or `lspci`
2. Installs drivers if missing (Ubuntu, RHEL/CentOS, Arch)
3. Installs the NVIDIA Container Toolkit and configures Docker
4. Deploys the NVIDIA k8s device plugin into the cluster
5. Passes `--gpus=all` to k3d

A reboot may be required after driver installation. Re-run the installer afterward — it picks up where it left off.

### AMD (Linux)

Auto-detected. ROCm is installed automatically on Ubuntu and RHEL/CentOS. A logout/login may be needed for full GPU access.

### macOS

CPU only. Docker Desktop on macOS does not support GPU passthrough. For GPU workloads, deploy on a Linux machine with NVIDIA GPUs or use [AWS (EKS)](/environment-setup/eks-client-deployment-guide).

### Windows

The installer does **not** install GPU drivers on Windows. Pre-install NVIDIA drivers before running the installer. The installer detects them via `nvidia-smi` and configures the cluster to use them.

## Manual Deployment

Skip the installer entirely. Use this if you already have a Kubernetes cluster, need custom resource limits, or want full control over the Helm deployment.

A single chart — **`tracebloc/client`** — supports AKS, EKS, bare-metal, and OpenShift; choose your platform via values overrides. Reference defaults live at [`client/ci/{aks,eks,bm,oc}-values.yaml`](https://github.com/tracebloc/client/tree/main/client/ci).

### Add the Helm repository

```bash theme={null}
helm repo add tracebloc https://tracebloc.github.io/client
helm repo update
```

### Get default values

Export the chart's default configuration to customize it:

```bash theme={null}
helm show values tracebloc/client > values.yaml
```

### Configure values.yaml

#### Authentication

Set your Client ID and password from the [tracebloc client view](https://ai.tracebloc.io/clients):

```yaml theme={null}
clientId: "<YOUR_CLIENT_ID>"
clientPassword: "<YOUR_CLIENT_PASSWORD>"
```

#### Resource Limits for Training Jobs

Defaults are sized for typical workloads. Override per job size; for GPU support, requests and limits **must** be equal:

```yaml theme={null}
env:
  RESOURCE_REQUESTS: "cpu=2,memory=8Gi"
  RESOURCE_LIMITS: "cpu=2,memory=8Gi"
  GPU_REQUESTS: ""             # "nvidia.com/gpu=1" for GPU
  GPU_LIMITS: ""               # "nvidia.com/gpu=1" for GPU
  RUNTIME_CLASS_NAME: ""       # "nvidia" for k3s GPU
```

#### Storage

Storage class and PVC sizes:

```yaml theme={null}
storageClass:
  create: true
  provisioner: ""              # set per platform (see ci/*-values.yaml)
  allowVolumeExpansion: true
  parameters: {}

# Bare-metal only — hostPath-backed PVs at /tracebloc/{data,logs,mysql}
hostPath:
  enabled: false

pvc:
  mysql: 2Gi
  logs: 10Gi
  data: 50Gi
```

<Note>
  **The database must stay on local disk.** MySQL/InnoDB is unsafe on NFS/CIFS, so the database and logs always use the local `/tracebloc` tree. To place large **datasets** on a network mount, set the installer's `HOST_DATASET_DIR` — it relocates only the dataset volume (the chart's `hostPath.datasetPath` → `/tracebloc-data`) and runs ingestion as the mount's owner uid, so writes succeed under NFS `root_squash`.
</Note>

Platform snippets (drop into your values file):

<details>
  <summary>AKS</summary>

  ```yaml theme={null}
  storageClass:
    create: true
    provisioner: file.csi.azure.com
    parameters:
      skuName: Standard_LRS
    mountOptions:
      - dir_mode=0750
      - file_mode=0640
      - uid=999
      - gid=999
      - mfsymlinks
      - cache=strict
      - actimeo=30
  clusterScope: true
  ```
</details>

<details>
  <summary>EKS</summary>

  ```yaml theme={null}
  storageClass:
    create: true
    provisioner: efs.csi.aws.com
    volumeBindingMode: Immediate
    reclaimPolicy: Retain
    mountOptions: [actimeo=30]
    parameters:
      directoryPerms: "700"
      uid: "999"
      gid: "999"
      fileSystemId: <YOUR_EFS_FILESYSTEM_ID>
      provisioningMode: efs-ap
  clusterScope: true
  ```
</details>

<details>
  <summary>Bare-metal / k3s / k3d</summary>

  ```yaml theme={null}
  hostPath:
    enabled: true
  pvcAccessMode: ReadWriteOnce
  storageClass:
    create: true
    provisioner: kubernetes.io/no-provisioner
  namespace:
    podSecurity:
      enforce: ""        # hostPath needs the privileged init-mysql-data container
      enforceVersion: ""
  clusterScope: true
  ```
</details>

<details>
  <summary>OpenShift</summary>

  ```yaml theme={null}
  storageClass:
    create: false
    name: ocs-storagecluster-cephfs
  clusterScope: false
  openshift:
    scc:
      enabled: true
  networkPolicy:
    training:
      enabled: true
      dnsNamespace: openshift-dns
      dnsSelector:
        dns.operator.openshift.io/daemonset-dns: default
      clusterCidrs:
        - "10.128.0.0/14"
        - "172.30.0.0/16"
  ```
</details>

#### Docker Registry

The chart pulls the client image from a container registry — credentials are required in production. Use a token, not a plaintext password.

```yaml theme={null}
dockerRegistry:
  server: https://index.docker.io/v1/
  username: <DOCKER_USERNAME>
  password: <DOCKER_PASSWORD or DOCKER_TOKEN>
  email: <DOCKER_EMAIL>
```

The chart auto-creates a secret named `{{ .Release.Name }}-regcred`. Omit the `dockerRegistry` block entirely to skip pull-secret creation (e.g. when using a public mirror).

#### Proxy (optional)

```yaml theme={null}
env:
  HTTP_PROXY_HOST: "your-proxy.company.com"
  HTTP_PROXY_PORT: "8080"
  HTTP_PROXY_USERNAME: ""
  HTTP_PROXY_PASSWORD: ""
```

#### Auto-upgrade (on by default)

Releases of chart `1.3.0+` install a `<release>-auto-upgrade` CronJob that polls `https://tracebloc.github.io/client` daily and runs `helm upgrade --reset-then-reuse-values` whenever a newer chart version is published — so clients auto-update instead of staying pinned to the version they were installed with.

```yaml theme={null}
autoUpgrade:
  enabled: true                # set false to opt out
  schedule: "23 2 * * *"       # daily at 02:23 UTC
  suspend: false               # one-shot pause without removing resources
  repoUrl: "https://tracebloc.github.io/client"
  repoName: "tracebloc"
  chartName: "client"
  timeout: "10m"
```

The CronJob's ServiceAccount is bound to the built-in `cluster-admin` ClusterRole because the chart templates cluster-scoped resources (PriorityClass, StorageClass, ClusterRoleBinding, optionally Namespace). Disable if you need a manual approval gate on upgrades.

#### NetworkPolicy hardening for training pods

Training pods run untrusted ML code. The chart can apply a NetworkPolicy that denies ingress and restricts egress to DNS + external HTTPS only — blocking pod-to-pod, MySQL, and Kubernetes API access from the training pod.

```yaml theme={null}
networkPolicy:
  training:
    enabled: true
    dnsNamespace: kube-system
    dnsSelector: {}            # empty falls back to {k8s-app: kube-dns}
    clusterCidrs:
      - "10.0.0.0/8"
      - "172.16.0.0/12"
      - "192.168.0.0/16"
```

Requires a CNI that **enforces** NetworkPolicy:

| Platform   | Notes                                                                              |
| ---------- | ---------------------------------------------------------------------------------- |
| AKS        | needs `--network-policy azure` (Azure NPM) or Calico at cluster create             |
| EKS        | needs Calico or Cilium add-on (the default AWS VPC CNI alone does **not** enforce) |
| Bare-metal | needs Calico / Cilium / kube-router (Flannel alone does **not** enforce)           |
| OpenShift  | OVN-Kubernetes enforces by default                                                 |

Leave `enabled: false` on clusters without an enforcing CNI — silently having no protection is worse than explicitly disabling it.

<Warning>
  The chart's training-pod egress lockdown only blocks traffic if your CNI enforces NetworkPolicy. Verify your CNI before relying on it.
</Warning>

#### Resource Monitor and node-agents namespace

The `tracebloc-resource-monitor` DaemonSet collects node-level CPU/memory metrics. It mounts `hostPath` volumes (`/proc`, `/sys`) which Pod Security Admission's `restricted` profile bans — so the chart isolates it in a dedicated **privileged** namespace (default `tracebloc-node-agents`).

```yaml theme={null}
resourceMonitor: true          # set false on clusters where metrics-server cannot be installed
nodeAgents:
  namespace:
    create: true
    name: tracebloc-node-agents
```

When `create: false`, create the namespace yourself with the required PSA labels:

```bash theme={null}
kubectl create namespace tracebloc-node-agents
kubectl label namespace tracebloc-node-agents \
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/warn=privileged \
  pod-security.kubernetes.io/audit=privileged
```

The DaemonSet **requires** `metrics-server`. It is bundled on k3d/k3s/AKS, present on OpenShift, and **must be installed manually on EKS** (`kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml`).

#### Pod Security Admission labels

Training Jobs run untrusted user-supplied ML code. The chart can create the release namespace with Pod Security Admission `warn`/`audit`/`enforce` labels at the `restricted` profile for defense-in-depth:

```yaml theme={null}
namespace:
  create: false                # true only on greenfield installs
  podSecurity:
    warn: restricted
    audit: restricted
    enforce: restricted        # set "" for bare-metal hostPath installs
```

When `create: false` (default) and you want PSA labels on an existing namespace:

```bash theme={null}
kubectl label namespace <workspace> \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted \
  pod-security.kubernetes.io/enforce=restricted
```

#### Image digest pinning

Pin images by content hash for reproducible deploys. When `digest` is set, `tag` is ignored and `imagePullPolicy` drops to `IfNotPresent`.

```yaml theme={null}
images:
  jobsManager:    { digest: "sha256:..." }
  podsMonitor:    { digest: "sha256:..." }
  resourceMonitor: { digest: "sha256:..." }
  requestsProxy:  { digest: "sha256:..." }
  mysqlClient:    { tag: "", digest: "" }
  busybox:        { tag: "1.35", digest: "" }
```

#### PriorityClass and PodDisruptionBudgets

The chart pins the MySQL pod with a `tracebloc-data-plane` PriorityClass (value `1000000`) so it survives node-level OOM and scheduling pressure, and applies PDBs to MySQL and the jobs manager. Override only if you run a multi-replica MySQL externally:

```yaml theme={null}
priorityClass:
  create: true
  name: tracebloc-data-plane
  value: 1000000

podDisruptionBudget:
  mysql:       { create: true }
  jobsManager: { create: true }
```

### Deploy

Install the chart into a new namespace:

```bash theme={null}
helm upgrade --install <workspace> tracebloc/client \
  --namespace <workspace> \
  --create-namespace \
  --values values.yaml
```

### Upgrade

The auto-upgrade CronJob handles routine version bumps. To upgrade manually:

```bash theme={null}
helm repo update
helm upgrade <workspace> tracebloc/client \
  --namespace <workspace> \
  --reset-then-reuse-values \
  --values values.yaml
```

<Note>
  When upgrading **into** chart 1.3.0 from 1.2.x, use `--reset-then-reuse-values` (not plain `--reuse-values`) — the new `autoUpgrade` block did not exist in 1.2.x and a plain reuse fails template rendering.
</Note>

### Uninstall

```bash theme={null}
helm uninstall <workspace> -n <workspace>
```

PVCs and the PriorityClass are annotated `helm.sh/resource-policy: keep` so your data and shared cluster resources survive uninstall. To remove them too:

```bash theme={null}
kubectl delete pvc --all -n <workspace>
kubectl delete namespace <workspace>
```

### Migrating from legacy charts

If you installed before chart 1.3.x using `tracebloc/aks`, `tracebloc/eks`, or `tracebloc/bm`, see the [migration guide in the client repo](https://github.com/tracebloc/client/blob/main/client/MIGRATION.md). Key changes:

* 4 charts → 1 chart (`tracebloc/client`) with platform values overrides
* Auth keys flattened: `jobsManager.env.CLIENT_ID` + `secrets.clientPassword` → top-level `clientId` + `clientPassword`
* PVC keys flattened: `clientData` / `clientLogsPvc` / `mysqlPvc` (with `name`, `storage`, `hostPath`) → `pvc.{data,logs,mysql}` (size only) + `hostPath.enabled` for bare-metal
* ServiceAccount renamed from `default` to `{{ .Release.Name }}-jobs-manager`
* Pull-secret renamed from hard-coded `regcred` to `{{ .Release.Name }}-regcred`
* The `namespace` value in the legacy `values.yaml` is gone — use `helm install -n <ns>` instead

## Security

Tracebloc is designed so your data never has to leave your network. Here's how:

* **Data stays local.** Training data never leaves your infrastructure. Only metadata and metrics are shared with the platform.
* **Encrypted.** All communication between client and platform is TLS-encrypted.
* **Isolated.** Training runs in containers with restricted system access. Kubernetes namespaces separate workloads from each other.
* **Scanned.** Submitted models are analyzed for vulnerabilities before execution on your infrastructure.
* **Minimal footprint.** The installer only modifies `~/.tracebloc/` and Docker. No system-wide changes.
