k9s -n <workspace> to get a live view of pods, logs, and events.
Stuck? Generate a support bundle. Re-run the installer with It writes a redacted
--diagnose:~/.tracebloc/tracebloc-diagnose-<timestamp>.tgz — logs, pod status, and versions with credentials removed — that you can send to support. The first line of output shows your client version.Quick Checks
| Symptom | Check | Fix |
|---|---|---|
| Pods not starting | kubectl describe pod <pod> -n <workspace> | Check resource limits, Docker status |
| Client shows Offline | kubectl logs -n <workspace> -l app=manager | Verify client ID/password, check network |
| Docker not running | docker info | Start Docker Desktop or daemon |
| Cluster not found | k3d cluster list | Re-run the installer |
| GPU not detected | nvidia-smi | Install NVIDIA drivers, reboot, re-run installer |
Error Messages
General
These errors typically indicate storage pressure on your Kubernetes nodes:| Error Message | Description | Resolution |
|---|---|---|
ErrImagePull / The node was low on resource / Ephemeral storage | Kubernetes nodes have limited ephemeral storage by default. CUDA libraries and container layers can consume 8 GB+ on their own. | Increase node disk size (e.g. --disk-size 50 in aws eks create-nodegroup) |
Local
Issues specific to local (k3d) deployments:| Error Message | Description | Resolution |
|---|---|---|
| ServiceBus connection error after Docker restart | When Docker overutilizes local resources and restarts, the ServiceBus connection may fail with NoneType errors. | Monitor resources via Docker Dashboard. Restart the jobs manager pod (e.g. in k9s, exit with Ctrl+D) to restore the connection. |
Debugging Commands
When the quick checks don’t resolve the issue, use these commands to dig deeper.Pod status and logs
Resource usage
See if your nodes or pods are running out of CPU or memory:Storage
Check that persistent volume claims are bound and have enough capacity:Image pull credentials
If pods fail withErrImagePull, verify that the Docker registry secret exists:
CPU and Memory Optimization
Hitting resource limits during training? Two levers:- Reduce data size — smaller batches, lower resolution, shorter sequences
- Smaller models — fewer layers, smaller hidden dimensions
Memory Consumption (RAM / GPU VRAM)
Understanding what drives memory usage helps you right-size your resource limits:- Batch size — memory scales roughly linearly with batch size
- Model size — more parameters = more memory for weights, activations, and gradients
- Precision — FP16/BF16 uses half the memory of FP32. Mixed-precision training helps significantly
- Optimizer — Adam requires ~2-3x the memory of SGD (stores running averages)
- Input dimensions — transformer attention grows quadratically with sequence length; CNN memory grows quadratically with image resolution
Compute Consumption (CPU / GPU)
If training is slow, these are the factors to look at:- Batch size — larger batches increase GPU utilization up to memory saturation
- Model complexity — transformer attention is O(seq_len² x hidden_dim); CNNs scale with kernel x feature map x filters
- Precision — FP16/BF16 can speed up training 2-3x on modern GPUs
- Data pipeline — slow CPU preprocessing (augmentation, tokenization) can bottleneck training
- Parallelization — data parallelism splits batches across GPUs; model parallelism splits the model itself