Skip to main content

Troubleshooting

The following sections help with the most common problems when setting up the clusters locally or in the cloud.

Recommended for debugging: Use k9s, a terminal-based Kubernetes dashboard, to monitor jobs, pods, and logs in real time. Run k9s -n <NAMESPACE> to get a live view of resources, switch between them instantly, and inspect logs or events with a few keystrokes. Compared to kubectl, it is faster and more convenient.

Understanding Error Messages

The following sections discuss some common error messages that developers can encounter.

General Error Messages

Error MessageDescriptionResolution
ErrImagePull / The node was low on resource / Ephemeral storageKubernetes nodes have limited ephemeral storage by default. CUDA libraries and container layers can consume 8 GB+ on their own, causing nodes to run out of space, leading to eviction or image pull failures.Increase node disk size (e.g. --disk-size 50 in aws eks create-nodegroup)

Cloud Specific

Error MessageDescriptionResolution

Local Specific

Error MessageDescriptionResolution
python-container Unexpected error occurred (TypeError("'NoneType' object is not subscriptable"))
ServiceBus connection error after Docker restart
When Docker overutilizes local resources and restarts itself, the ServiceBus connection may fail, leading to NoneType errors in the Python container. This happens because the Azure ServiceBus handler cannot re-establish the connection properly.Monitor resources using Docker Dashboard and logs of jobs-manager and pod-monitor pods. If the error occurs, increase resources in the docker dashboard and restart the jobs manager (e.g. in k9s exit with Ctrl+D) to restore the connection.
GPU SetupOnly one exp. can run per GPUask Shujaat

Helm Operations

@ if we want to make changes, helm uninstall vs. upgrade

Environment

Pods Not Starting

kubectl get pods -n <NAMESPACE>
kubectl logs <POD_NAME> -n <NAMESPACE>
kubectl describe pod <POD_NAME> -n <NAMESPACE>

Insufficient Resources

kubectl top nodes
kubectl top pods -n <NAMESPACE>

Storage Issues

kubectl get pvc -n <NAMESPACE>
kubectl get pv

Image Pull Errors

kubectl get secret regcred -n <NAMESPACE>

Client

CPU or Memory Limits

Overview

When you encounter CPU or memory limits during deployment or training, there are several strategies you can employ to optimize resource usage mainly:

  • Reduce data size
  • Use smaller model architectures

Memory Consumption (RAM / GPU VRAM)

Batch Size

  • Larger batch = more data stored in memory at once (inputs, intermediate activations, gradients)
  • Memory scales roughly linearly with batch size

Model Size (Number of Parameters)

  • Bigger models (more layers, more hidden units, higher dimensional embeddings) require more memory to store parameters and gradients
  • Memory also scales with activation size (depends on architecture, sequence length for transformers, image resolution for CNNs, etc.)

Precision (FP32 vs. FP16 / BF16 / Quantization)

  • FP32 takes twice the memory of FP16
  • Mixed-precision training reduces memory footprint significantly

Optimizer Choice

  • Adam requires ~2–3x the memory of SGD because it stores running averages of gradients and squared gradients

Sequence Length / Input Dimensionality

  • For transformers: memory grows quadratically with sequence length because of attention
  • For images: higher resolution increases memory footprint quadratically with image size

CPU / GPU Compute Consumption

Batch Size

  • Larger batch size → more parallel work per step → higher GPU utilization, but each step is slower
  • Often, throughput improves up to the point where memory is saturated

Model Complexity (Depth, Width, Architecture)

  • Transformer attention complexity: O(sequence_length² × hidden_dim)
  • CNNs scale with kernel size × feature map size × number of filters

Training Precision

  • FP32 requires more compute per operation than FP16/BF16
  • Mixed precision can speed up training by 2–3x on modern GPUs

Data Pipeline

  • If CPU preprocessing (e.g., image augmentation, text tokenization) is too slow, it can bottleneck training
  • Then CPU and RAM usage spike

Parallelization Strategy

  • Data parallelism: more GPUs split the batch, but each GPU still needs memory for the full model
  • Model parallelism / sharding: splits model across devices, reducing per-GPU load but increasing communication overhead

Helm Operations

@ if we want to make changes, helm uninstall vs. upgrade