Troubleshooting

The following sections help with the most common problems when setting up the clusters locally or in the cloud.

Recommended for debugging: Use k9s, a terminal-based Kubernetes dashboard, to monitor jobs, pods, and logs in real time. Run k9s -n <NAMESPACE> to get a live view of resources, switch between them instantly, and inspect logs or events with a few keystrokes. Compared to kubectl, it is faster and more convenient.

Understanding Error Messages

The following sections discuss some common error messages that developers can encounter.

General Error Messages

Error Message	Description	Resolution
`ErrImagePull` / `The node was low on resource` / `Ephemeral storage`	Kubernetes nodes have limited ephemeral storage by default. CUDA libraries and container layers can consume 8 GB+ on their own, causing nodes to run out of space, leading to eviction or image pull failures.	Increase node disk size (e.g. `--disk-size 50` in `aws eks create-nodegroup`)

Cloud Specific

Error Message	Description	Resolution

Local Specific

Error Message	Description	Resolution
`python-container Unexpected error occurred (TypeError("'NoneType' object is not subscriptable"))` ServiceBus connection error after Docker restart	When Docker overutilizes local resources and restarts itself, the ServiceBus connection may fail, leading to `NoneType` errors in the Python container. This happens because the Azure ServiceBus handler cannot re-establish the connection properly.	Monitor resources using Docker Dashboard and logs of jobs-manager and pod-monitor pods. If the error occurs, increase resources in the docker dashboard and restart the jobs manager (e.g. in k9s exit with Ctrl+D) to restore the connection.
GPU Setup	Only one exp. can run per GPU	ask Shujaat

Helm Operations

@ if we want to make changes, helm uninstall vs. upgrade

Environment

Pods Not Starting

kubectl get pods -n <NAMESPACE>
kubectl logs <POD_NAME> -n <NAMESPACE>
kubectl describe pod <POD_NAME> -n <NAMESPACE>

Insufficient Resources

kubectl top nodes
kubectl top pods -n <NAMESPACE>

Storage Issues

kubectl get pvc -n <NAMESPACE>
kubectl get pv

Image Pull Errors

kubectl get secret regcred -n <NAMESPACE>

Client

CPU or Memory Limits

Overview

When you encounter CPU or memory limits during deployment or training, there are several strategies you can employ to optimize resource usage mainly:

Reduce data size
Use smaller model architectures

Memory Consumption (RAM / GPU VRAM)

Batch Size

Larger batch = more data stored in memory at once (inputs, intermediate activations, gradients)
Memory scales roughly linearly with batch size

Model Size (Number of Parameters)

Bigger models (more layers, more hidden units, higher dimensional embeddings) require more memory to store parameters and gradients
Memory also scales with activation size (depends on architecture, sequence length for transformers, image resolution for CNNs, etc.)

Precision (FP32 vs. FP16 / BF16 / Quantization)

FP32 takes twice the memory of FP16
Mixed-precision training reduces memory footprint significantly

Optimizer Choice

Adam requires ~2–3x the memory of SGD because it stores running averages of gradients and squared gradients

Sequence Length / Input Dimensionality

For transformers: memory grows quadratically with sequence length because of attention
For images: higher resolution increases memory footprint quadratically with image size

CPU / GPU Compute Consumption

Batch Size

Larger batch size → more parallel work per step → higher GPU utilization, but each step is slower
Often, throughput improves up to the point where memory is saturated

Model Complexity (Depth, Width, Architecture)

Transformer attention complexity: O(sequence_length² × hidden_dim)
CNNs scale with kernel size × feature map size × number of filters

Training Precision

FP32 requires more compute per operation than FP16/BF16
Mixed precision can speed up training by 2–3x on modern GPUs

Data Pipeline

If CPU preprocessing (e.g., image augmentation, text tokenization) is too slow, it can bottleneck training
Then CPU and RAM usage spike

Parallelization Strategy

Data parallelism: more GPUs split the batch, but each GPU still needs memory for the full model
Model parallelism / sharding: splits model across devices, reducing per-GPU load but increasing communication overhead

Helm Operations

@ if we want to make changes, helm uninstall vs. upgrade

Understanding Error Messages​

General Error Messages​

Cloud Specific​

Local Specific​

Helm Operations​

Environment​

Pods Not Starting​

Insufficient Resources​

Storage Issues​

Image Pull Errors​

Client​

CPU or Memory Limits​

Overview​

Memory Consumption (RAM / GPU VRAM)​

CPU / GPU Compute Consumption​

Helm Operations​

Understanding Error Messages

General Error Messages

Cloud Specific

Local Specific

Helm Operations

Environment

Pods Not Starting

Insufficient Resources

Storage Issues

Image Pull Errors

Client

CPU or Memory Limits

Overview

Memory Consumption (RAM / GPU VRAM)

CPU / GPU Compute Consumption

Helm Operations