Skip to main content

Troubleshooting

Most issues fall into a few categories: pods not starting, client not connecting, or resource limits being hit. Start with the quick checks below — they cover the most common problems. For real-time cluster monitoring, try k9s — run k9s -n <workspace> to get a live view of pods, logs, and events.

Quick Checks

SymptomCheckFix
Pods not startingkubectl describe pod <pod> -n <workspace>Check resource limits, Docker status
Client shows Offlinekubectl logs -n <workspace> -l app=tracebloc-jobs-managerVerify client ID/password, check network
Docker not runningdocker infoStart Docker Desktop or daemon
Cluster not foundk3d cluster listRe-run the installer
GPU not detectednvidia-smiInstall NVIDIA drivers, reboot, re-run installer

Error Messages

General

These errors typically indicate storage pressure on your Kubernetes nodes:
Error MessageDescriptionResolution
ErrImagePull / The node was low on resource / Ephemeral storageKubernetes nodes have limited ephemeral storage by default. CUDA libraries and container layers can consume 8 GB+ on their own.Increase node disk size (e.g. --disk-size 50 in aws eks create-nodegroup)

Local

Issues specific to local (k3d) deployments:
Error MessageDescriptionResolution
ServiceBus connection error after Docker restartWhen Docker overutilizes local resources and restarts, the ServiceBus connection may fail with NoneType errors.Monitor resources via Docker Dashboard. Restart the jobs manager pod (e.g. in k9s, exit with Ctrl+D) to restore the connection.

Debugging Commands

When the quick checks don’t resolve the issue, use these commands to dig deeper.

Pod status and logs

kubectl get pods -n <workspace>
kubectl logs <pod-name> -n <workspace>
kubectl describe pod <pod-name> -n <workspace>

Resource usage

See if your nodes or pods are running out of CPU or memory:
kubectl top nodes
kubectl top pods -n <workspace>

Storage

Check that persistent volume claims are bound and have enough capacity:
kubectl get pvc -n <workspace>
kubectl get pv

Image pull credentials

If pods fail with ErrImagePull, verify that the Docker registry secret exists:
kubectl get secret regcred -n <workspace>

CPU and Memory Optimization

Hitting resource limits during training? Two levers:
  • Reduce data size — smaller batches, lower resolution, shorter sequences
  • Smaller models — fewer layers, smaller hidden dimensions

Memory Consumption (RAM / GPU VRAM)

Understanding what drives memory usage helps you right-size your resource limits:
  • Batch size — memory scales roughly linearly with batch size
  • Model size — more parameters = more memory for weights, activations, and gradients
  • Precision — FP16/BF16 uses half the memory of FP32. Mixed-precision training helps significantly
  • Optimizer — Adam requires ~2-3x the memory of SGD (stores running averages)
  • Input dimensions — transformer attention grows quadratically with sequence length; CNN memory grows quadratically with image resolution

Compute Consumption (CPU / GPU)

If training is slow, these are the factors to look at:
  • Batch size — larger batches increase GPU utilization up to memory saturation
  • Model complexity — transformer attention is O(seq_len² x hidden_dim); CNNs scale with kernel x feature map x filters
  • Precision — FP16/BF16 can speed up training 2-3x on modern GPUs
  • Data pipeline — slow CPU preprocessing (augmentation, tokenization) can bottleneck training
  • Parallelization — data parallelism splits batches across GPUs; model parallelism splits the model itself