Troubleshooting
The following sections help with the most common problems when setting up the clusters locally or in the cloud.
Recommended for debugging: Use k9s, a terminal-based Kubernetes dashboard, to monitor jobs, pods, and logs in real time. Run k9s -n <NAMESPACE> to get a live view of resources, switch between them instantly, and inspect logs or events with a few keystrokes. Compared to kubectl, it is faster and more convenient.
Understanding Error Messages
The following sections discuss some common error messages that developers can encounter.
General Error Messages
| Error Message | Description | Resolution |
|---|---|---|
ErrImagePull / The node was low on resource / Ephemeral storage | Kubernetes nodes have limited ephemeral storage by default. CUDA libraries and container layers can consume 8 GB+ on their own, causing nodes to run out of space, leading to eviction or image pull failures. | Increase node disk size (e.g. --disk-size 50 in aws eks create-nodegroup) |
Cloud Specific
| Error Message | Description | Resolution |
|---|---|---|
Local Specific
| Error Message | Description | Resolution |
|---|---|---|
python-container Unexpected error occurred (TypeError("'NoneType' object is not subscriptable")) ServiceBus connection error after Docker restart | When Docker overutilizes local resources and restarts itself, the ServiceBus connection may fail, leading to NoneType errors in the Python container. This happens because the Azure ServiceBus handler cannot re-establish the connection properly. | Monitor resources using Docker Dashboard and logs of jobs-manager and pod-monitor pods. If the error occurs, increase resources in the docker dashboard and restart the jobs manager (e.g. in k9s exit with Ctrl+D) to restore the connection. |
| GPU Setup | Only one exp. can run per GPU | ask Shujaat |
Helm Operations
@ if we want to make changes, helm uninstall vs. upgrade
Environment
Pods Not Starting
kubectl get pods -n <NAMESPACE>
kubectl logs <POD_NAME> -n <NAMESPACE>
kubectl describe pod <POD_NAME> -n <NAMESPACE>
Insufficient Resources
kubectl top nodes
kubectl top pods -n <NAMESPACE>
Storage Issues
kubectl get pvc -n <NAMESPACE>
kubectl get pv
Image Pull Errors
kubectl get secret regcred -n <NAMESPACE>
Client
CPU or Memory Limits
Overview
When you encounter CPU or memory limits during deployment or training, there are several strategies you can employ to optimize resource usage mainly:
- Reduce data size
- Use smaller model architectures
Memory Consumption (RAM / GPU VRAM)
Batch Size
- Larger batch = more data stored in memory at once (inputs, intermediate activations, gradients)
- Memory scales roughly linearly with batch size
Model Size (Number of Parameters)
- Bigger models (more layers, more hidden units, higher dimensional embeddings) require more memory to store parameters and gradients
- Memory also scales with activation size (depends on architecture, sequence length for transformers, image resolution for CNNs, etc.)
Precision (FP32 vs. FP16 / BF16 / Quantization)
- FP32 takes twice the memory of FP16
- Mixed-precision training reduces memory footprint significantly
Optimizer Choice
- Adam requires ~2–3x the memory of SGD because it stores running averages of gradients and squared gradients
Sequence Length / Input Dimensionality
- For transformers: memory grows quadratically with sequence length because of attention
- For images: higher resolution increases memory footprint quadratically with image size
CPU / GPU Compute Consumption
Batch Size
- Larger batch size → more parallel work per step → higher GPU utilization, but each step is slower
- Often, throughput improves up to the point where memory is saturated
Model Complexity (Depth, Width, Architecture)
- Transformer attention complexity: O(sequence_length² × hidden_dim)
- CNNs scale with kernel size × feature map size × number of filters
Training Precision
- FP32 requires more compute per operation than FP16/BF16
- Mixed precision can speed up training by 2–3x on modern GPUs
Data Pipeline
- If CPU preprocessing (e.g., image augmentation, text tokenization) is too slow, it can bottleneck training
- Then CPU and RAM usage spike
Parallelization Strategy
- Data parallelism: more GPUs split the batch, but each GPU still needs memory for the full model
- Model parallelism / sharding: splits model across devices, reducing per-GPU load but increasing communication overhead
Helm Operations
@ if we want to make changes, helm uninstall vs. upgrade