Troubleshooting
Most issues fall into a few categories: pods not starting, client not connecting, or resource limits being hit. Start with the quick checks below — they cover the most common problems. For real-time cluster monitoring, try k9s — runk9s -n <workspace> to get a live view of pods, logs, and events.
Quick Checks
| Symptom | Check | Fix |
|---|---|---|
| Pods not starting | kubectl describe pod <pod> -n <workspace> | Check resource limits, Docker status |
| Client shows Offline | kubectl logs -n <workspace> -l app=tracebloc-jobs-manager | Verify client ID/password, check network |
| Docker not running | docker info | Start Docker Desktop or daemon |
| Cluster not found | k3d cluster list | Re-run the installer |
| GPU not detected | nvidia-smi | Install NVIDIA drivers, reboot, re-run installer |
Error Messages
General
These errors typically indicate storage pressure on your Kubernetes nodes:| Error Message | Description | Resolution |
|---|---|---|
ErrImagePull / The node was low on resource / Ephemeral storage | Kubernetes nodes have limited ephemeral storage by default. CUDA libraries and container layers can consume 8 GB+ on their own. | Increase node disk size (e.g. --disk-size 50 in aws eks create-nodegroup) |
Local
Issues specific to local (k3d) deployments:| Error Message | Description | Resolution |
|---|---|---|
| ServiceBus connection error after Docker restart | When Docker overutilizes local resources and restarts, the ServiceBus connection may fail with NoneType errors. | Monitor resources via Docker Dashboard. Restart the jobs manager pod (e.g. in k9s, exit with Ctrl+D) to restore the connection. |
Debugging Commands
When the quick checks don’t resolve the issue, use these commands to dig deeper.Pod status and logs
Resource usage
See if your nodes or pods are running out of CPU or memory:Storage
Check that persistent volume claims are bound and have enough capacity:Image pull credentials
If pods fail withErrImagePull, verify that the Docker registry secret exists:
CPU and Memory Optimization
Hitting resource limits during training? Two levers:- Reduce data size — smaller batches, lower resolution, shorter sequences
- Smaller models — fewer layers, smaller hidden dimensions
Memory Consumption (RAM / GPU VRAM)
Understanding what drives memory usage helps you right-size your resource limits:- Batch size — memory scales roughly linearly with batch size
- Model size — more parameters = more memory for weights, activations, and gradients
- Precision — FP16/BF16 uses half the memory of FP32. Mixed-precision training helps significantly
- Optimizer — Adam requires ~2-3x the memory of SGD (stores running averages)
- Input dimensions — transformer attention grows quadratically with sequence length; CNN memory grows quadratically with image resolution
Compute Consumption (CPU / GPU)
If training is slow, these are the factors to look at:- Batch size — larger batches increase GPU utilization up to memory saturation
- Model complexity — transformer attention is O(seq_len² x hidden_dim); CNNs scale with kernel x feature map x filters
- Precision — FP16/BF16 can speed up training 2-3x on modern GPUs
- Data pipeline — slow CPU preprocessing (augmentation, tokenization) can bottleneck training
- Parallelization — data parallelism splits batches across GPUs; model parallelism splits the model itself