Troubleshooting
The following sections help with the most common problems when setting up the clusters locally or in the cloud.
Recommended for debugging: Use k9s, a terminal-based Kubernetes dashboard, to monitor jobs, pods, and logs in real time. Run k9s -n <NAMESPACE>
to get a live view of resources, switch between them instantly, and inspect logs or events with a few keystrokes. Compared to kubectl, it is faster and more convenient.
Understanding Error Messages
The following sections discuss some common error messages that developers can encounter.
General Error Messages
Error Message | Description | Resolution |
---|---|---|
ErrImagePull / The node was low on resource / Ephemeral storage | Kubernetes nodes have limited ephemeral storage by default. CUDA libraries and container layers can consume 8 GB+ on their own, causing nodes to run out of space, leading to eviction or image pull failures. | Increase node disk size (e.g. --disk-size 50 in aws eks create-nodegroup ) |
Cloud Specific
Error Message | Description | Resolution |
---|---|---|
Local Specific
Error Message | Description | Resolution |
---|---|---|
python-container Unexpected error occurred (TypeError("'NoneType' object is not subscriptable")) ServiceBus connection error after Docker restart | When Docker overutilizes local resources and restarts itself, the ServiceBus connection may fail, leading to NoneType errors in the Python container. This happens because the Azure ServiceBus handler cannot re-establish the connection properly. | Monitor resources using Docker Dashboard and logs of jobs-manager and pod-monitor pods. If the error occurs, increase resources in the docker dashboard and restart the jobs manager (e.g. in k9s exit with Ctrl+D) to restore the connection. |
GPU Setup | Only one exp. can run per GPU | ask Shujaat |
Helm Operations
@ if we want to make changes, helm uninstall vs. upgrade
Environment
Pods Not Starting
kubectl get pods -n <NAMESPACE>
kubectl logs <POD_NAME> -n <NAMESPACE>
kubectl describe pod <POD_NAME> -n <NAMESPACE>
Insufficient Resources
kubectl top nodes
kubectl top pods -n <NAMESPACE>
Storage Issues
kubectl get pvc -n <NAMESPACE>
kubectl get pv
Image Pull Errors
kubectl get secret regcred -n <NAMESPACE>
Client
CPU or Memory Limits
Overview
When you encounter CPU or memory limits during deployment or training, there are several strategies you can employ to optimize resource usage mainly:
- Reduce data size
- Use smaller model architectures
Memory Consumption (RAM / GPU VRAM)
Batch Size
- Larger batch = more data stored in memory at once (inputs, intermediate activations, gradients)
- Memory scales roughly linearly with batch size
Model Size (Number of Parameters)
- Bigger models (more layers, more hidden units, higher dimensional embeddings) require more memory to store parameters and gradients
- Memory also scales with activation size (depends on architecture, sequence length for transformers, image resolution for CNNs, etc.)
Precision (FP32 vs. FP16 / BF16 / Quantization)
- FP32 takes twice the memory of FP16
- Mixed-precision training reduces memory footprint significantly
Optimizer Choice
- Adam requires ~2–3x the memory of SGD because it stores running averages of gradients and squared gradients
Sequence Length / Input Dimensionality
- For transformers: memory grows quadratically with sequence length because of attention
- For images: higher resolution increases memory footprint quadratically with image size
CPU / GPU Compute Consumption
Batch Size
- Larger batch size → more parallel work per step → higher GPU utilization, but each step is slower
- Often, throughput improves up to the point where memory is saturated
Model Complexity (Depth, Width, Architecture)
- Transformer attention complexity: O(sequence_length² × hidden_dim)
- CNNs scale with kernel size × feature map size × number of filters
Training Precision
- FP32 requires more compute per operation than FP16/BF16
- Mixed precision can speed up training by 2–3x on modern GPUs
Data Pipeline
- If CPU preprocessing (e.g., image augmentation, text tokenization) is too slow, it can bottleneck training
- Then CPU and RAM usage spike
Parallelization Strategy
- Data parallelism: more GPUs split the batch, but each GPU still needs memory for the full model
- Model parallelism / sharding: splits model across devices, reducing per-GPU load but increasing communication overhead
Helm Operations
@ if we want to make changes, helm uninstall vs. upgrade