Setting Up Local Kubernetes Infrastructure on Linux Using K3s
Overview
Running machine learning workloads locally on Linux requires a lightweight but production-grade Kubernetes environment. This guide shows how to set up a single-node cluster with K3s.Configure storage and deploy the tracebloc client. K3s is a slim Kubernetes distribution designed for edge and local use, consuming fewer resources than Minikube or full kubeadm setups.
The result is a local platform for securely training and benchmarking AI models while keeping all data on your machine.
The entire setup can be completed in about 1 hour.
Note: GPU support on Linux with K3s requires extra configuration. GPU steps are marked as optional.
Prerequisites
Before proceeding with the local infrastructure setup, make sure the following requirements are met:
System Requirements
Hardware Specifications
- Minimum: 4 CPU cores, 8 GB RAM for basic workloads
- Recommended: 8+ CPU cores, 16+ GB RAM for heavy computer vision or NLP workloads
- Your machine must have sufficient disk space for container images, training data, and model artifacts
Required Tooling & Tracebloc Account
Tool | Purpose | Installation Link |
---|---|---|
Docker | Build and run local containers, optional for K3s itself, useful for GPU test containers | Docker |
K3s | Runs a local Kubernetes cluster | K3s |
kubectl | Command-line tool to interact with Kubernetes clusters | kubectl |
Helm 3.x | Package manager for Kubernetes, used to install and manage applications | Helm |
Note: K3s and Helm are installed in later steps of this guide. You do not need to pre-install them manually unless you prefer to manage versions yourself.
Tracebloc Account
- Client ID and Password from the tracebloc client view (client ID, password)
- Docker Hub registry credentials(username, token/password, email)
Install Helm
Install via script and verify the installation:
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version
Recommended for Monitoring: Use k9s
You can use k9s, a terminal-based Kubernetes dashboard, to monitor jobs, pods, and logs in real time. Run k9s -n <NAMESPACE>
to get a live view of resources, switch between them instantly, and inspect logs or events with a few keystrokes. Compared to kubectl, it is faster and more convenient.
Components
The local deployment consists of two main parts: Core Infrastructure and Client Deployment. Together, they provide a secure, scalable environment for running ML workloads with the tracebloc client.
Core Infrastructure
Local Kubernetes cluster: Single-node K3s cluster, lightweight but fully compatible with upstream Kubernetes.
Container runtime: Containerd by default, GPU integration requires NVIDIA runtime setup
Persistent Storage: HostPath-based volumes for datasets, logs, and MySQL metadata
Local Networking: DNS for pod-to-pod networking.
Client Deployment
tracebloc Client Application: Deployed into the local K3s cluster using Helm, configured with credentials, registry access, and storage.
Monitoring and Verification: Kubernetes tools to confirm pods, services, and persistent volumes are running correctly.
Process Flow
- Infrastructure Setup – Create local cluster, configure storage and networking
- Client Deployment – Configure and deploy the tracebloc client application
- Verification – Validate that everything is running correctly
Security
Security is a core consideration even when running workloads on a local machine. This setup follows Kubernetes best practices to ensure your data remains protected and workloads are isolated, while still allowing external models to be evaluated safely.
Data Protection
- Data Locality: Training data stays on your local machine.
- Model Encryption: Model weights are encrypted and never leave your environment.
- Secure Communication: All communication is protected with TLS.
Tracebloc Backend Security
- Code Analysis: Submitted models scanned with Bandit for vulnerabilities.
- Input Validation: Training scripts and code validated before execution.
- Container Isolation: Training runs inside Docker containers with restricted system access.
- Namespace Isolation: Kubernetes namespace separates workloads from other applications.
Required Outbound Access
*.docker.io
– Container image downloads*.tracebloc.io
– Platform communicationpypi.org
and similar – Dependency installation during training
Proxy Support
If your machine is behind a proxy, configure the following in values.yaml
:
HTTP_PROXY_HOST: your-proxy.company.com
HTTP_PROXY_PORT: 8080
Quick Setup
Purpose
Spin up a K3s cluster and deploy the tracebloc client in one go.
What the script does:
- Starts K3s with recommended resources
- Creates hostPath storage directories (datasets, logs, MySQL)
- Deploys the tracebloc client with Helm
Run the setup script (example placeholder, adjust to your needs):
# Download and run the automated setup script
tbd
Cleanup:
Delete K3s cluster and remove created directories.
Tips:
- Network model: local Docker bridge networking.
- Resource limits: increase K3s memory/CPU for heavy workloads.
If you prefer more control and customization, follow the detailed step-by-step guide below.
Detailed Setup
This section walks through a step-by-step build with K3s and kubectl. It mirrors the Quick Setup but lets you choose your own resource settings (CPUs, memory, storage paths, namespace, Helm release name, etc.). Expect about 1 hour end-to-end.
What you’ll do (Steps 1–4):
-
Local Cluster Setup — Start a single-node Kubernetes cluster with K3s. Configure CPU and memory to match your workload requirements and ensure Docker is running as the container runtime. This provides a functional control plane and worker node in one instance.
-
Storage — Create hostPath directories on your machine for datasets, logs, and MySQL files. Mount them as persistent volumes in Kubernetes so data, logs, and model artifacts persist across pod restarts.
-
Client Configuration — Add the tracebloc Helm repository and generate a values.yaml file. Configure authentication credentials, registry access, and storage paths so the client can run securely in your environment.
-
Client Deployment — Install the Helm chart into your chosen namespace. Deploy and verify that the tracebloc client pods and persistent volumes are running and bound correctly.
Helm Usage: Helm is used to install and manage Kubernetes applications. In steps 3 and 4 you will deploy the tracebloc client via the tracebloc/bm (bare metal) chart from the tracebloc Helm repository.
1. Local Cluster Setup
Set up a Local Kubernetes Cluster
A local Kubernetes cluster is the foundation of this deployment. K3s packages the control plane and worker node components into a single lightweight binary, typically using containerd as the runtime. It runs directly on bare metal or inside a VM, giving you a fully functional Kubernetes environment without the overhead of a cloud provider.
Resource allocation in K3s
K3s does not impose fixed limits on CPU or memory like Minikube. Instead, it leverages the full capacity of the underlying operating system and the cluster adapts to the host: a small VM with 2 vCPUs and 4 GB RAM will provide that capacity to K3s, while a physical machine with 16 cores and 64 GB RAM makes the entire resource pool available. Ensure you provide sufficient resources for your data and the model type you expect to be trained on your infrastructure.
Install K3s
To install and launch, run:
curl -sfL https://get.k3s.io | sh -
sudo systemctl enable k3s
sudo systemctl status k3s
It runs the server process which includes both control plane and worker components.
enable
ensures K3s starts automatically on rebootstatus
confirms the service is up and running
Export kubeconfig:
K3s stores the cluster’s kubeconfig in /etc/rancher/k3s/k3s.yaml
. By exporting it, you let kubectl communicate with your local cluster.
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
Without this step, kubectl will not know where your cluster is. Verify cluster status and that your node has successfully joined the cluster:
kubectl get nodes
You should see STATUS="Ready"
which means the cluster is running.
Set namespace and node labels
Create a namespace for your workloads:
kubectl create namespace <NAMESPACE>
Namespaces let you isolate your workloads from system components and other running applications.
Add mandatory node labels to nodes for workload management. Get the nodename, then label nodes:
kubectl get nodes
kubectl label node <NODENAME> type=system trainingset=<LABEL>
Set cpu
or gpu
as the label depending on your setup.
Recommended for Stability (Optional)
Disable Swap
Kubernetes schedules pods based on actual memory availability. If swap is enabled, the kernel may push memory pages to disk, making nodes appear healthier than they are. This leads to unpredictable performance, delayed OOM kills, and failed scheduling decisions. Disabling swap keeps resource usage accurate and stable:
sudo swapoff -a
swapoff -a
disables swap memory on your machine, and commenting it out in /etc/fstab
ensures it stays off after reboot.
Manage Multiple Clusters by Merging kubeconfig Files
By default, kubectl
only talks to one cluster at a time. Merging configs lets you switch seamlessly between multiple clusters without constantly exporting environment variables or overwriting files. It keeps all your cluster contexts in one place, making kubectl config use-context
possible.
mkdir -p ~/.kube
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config_k3s
sudo chmod 644 ~/.kube/config_k3s
sudo KUBECONFIG=~/.kube/config:~/.kube/config_k3s kubectl config view --flatten > /tmp/merged_kubeconfig
mv /tmp/merged_kubeconfig ~/.kube/config
These commands copy the K3s kubeconfig, fix its permissions, then merge it with any existing kubeconfig (for example from Minikube or EKS) into a single ~/.kube/config
file.
Optional: GPU Enablement
This section enables GPU support for your Kubernetes cluster. You first make GPUs visible to containers on the host, then deploy NVIDIA’s GPU Operator so Kubernetes can schedule GPU workloads reliably.
Verify GPU Availability
Check if GPUs are detected:
nvidia-smi
Running nvidia-smi
checks whether your NVIDIA GPU is visible to the system, shows driver version, GPU utilization, and available memory.
Install NVIDIA Container Toolkit
The NVIDIA Container Toolkit lets container runtimes like Docker or containerd access your GPU and run CUDA workloads instead of being stuck on CPU. Kubernetes can only schedule GPU workloads if the runtime knows how to expose GPU devices to pods.
Add NVIDIA’s official package repository to your system:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
It adds NVIDIA’s signed repository so apt trusts it.
Update package index, then pull and install the toolkit:
apt-get update
apt-get install -y nvidia-container-toolkit
nvidia-container-cli --version
Configure Docker runtime:
By default, Docker cannot pass GPUs into containers. Without the step, even if your host sees the GPU, your containers remain GPU-blind. Restarting Docker makes sure the runtime hook is actually loaded.
nvidia-ctk runtime configure --runtime=docker
systemctl restart docker
nvidia-ctk runtime configure --runtime=docker
tells Docker to register the NVIDIA runtime so it can launch GPU-enabled containers. systemctl restart docker
applies this change.
Run GPU test container
Lets verify GPU access inside the container and check that the NVIDIA drivers, container toolkit, and runtime are correctly configured:
docker run --rm --gpus all nvidia/cuda:12.6.2-cudnn-runtime-ubuntu22.04 nvidia-smi
Launches a temporary CUDA container with GPU access enabled and executes nvidia-smi inside it, confirming end-to-end GPU visibility from host to container.
Deploy NVIDIA GPU Operator
The NVIDIA GPU Operator automates GPU management in Kubernetes. It installs and manages the device plugin, runtime components, monitoring stack, and optionally drivers. Without it, you’d have to manually configure each node to make GPUs schedulable.
Add NVIDIA Helm repository
Get the NVIDIA’s official Helm repository to install the GPU Operator with up-to-date charts:
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
Install GPU Operator
Create a dedicated namespace to keep GPU operator resources isolated from the rest of your cluster and deploy the operator so it can configure GPU runtime components across nodes:
helm install --wait gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=false
It installs the GPU Operator via Helm into the gpu-operator namespace and waits for pods to be ready. The flag driver.enabled=false
skips automatic driver installs in case your nodes already have the correct NVIDIA drivers.
Verify Operator deployment
Check pod and node resources to ensure GPUs are now exposed to Kubernetes:
kubectl get pods | grep nvidia
Expect them to be in Running
or Completed
state.
Verify node configuration:
kubectl describe nodes | grep nvidia
Expect to see labels and allocatable resources like nvidia.com/gpu: 1
, confirming that Kubernetes can schedule GPU workloads on your nodes.
2. Storage
Unlike cloud deployments that use managed storage services, local deployments use hostPath
which are real paths on your local machine. Create persistent storage directories for:
shared-data
: For training datasets (json, txt, jpeg, png, etc.), model checkpoints and weights during traininglogs
: For application and pod logs for tracebloc client operationsmysql
: For metadata and training data labels
mkdir -p data/shared-data
mkdir -p data/logs
mkdir -p data/mysql
These directories will be mounted as Persistent Volumes in Kubernetes.
3. Client Configuration
You can now prepare the Helm chart that deploys the tracebloc client itself: The client is the component that runs your model evaluation and training jobs inside K3s. To install it, you use a Helm chart, which bundles all required Kubernetes manifests into a single, configurable package. A values.yaml file controls this deployment, where you specify credentials (to authenticate with tracebloc), registry access (to pull container images), and storage settings. This configuration ensures the client can securely connect, schedule workloads, and store results.
Add Helm Repository
Install and update the tracebloc client using Helm instead of managing raw YAML. For details, refer to the public GitHub Repository.
helm repo add tracebloc https://tracebloc.github.io/client/
helm repo update
Adds the official tracebloc Helm repository to your local configuration so you can install the client with a single Helm command.
Configure your Deployment Settings
Export the chart’s default configuration into a local file that you can edit:
helm show values tracebloc/bm > values.yaml
Downloads the default configuration template for the tracebloc client for bm (bare metal). Open and update the following sections in values.yaml
:
Deployment Namespace
Use the defined namespace:
namespace: <NAMESPACE>
Defines where the client will be deployed.
Tracebloc Authentication
Client ID
Provide your client ID from the tracebloc client view.
jobsManager:
env:
CLIENT_ID: "<YOUR_CLIENT_ID>"
Client Password
Set create: true
to generate the secret during installation:
# Secrets configuration
secrets:
# Whether to create the secret or use existing secret
create: true
# Client password
clientPassword: "<YOUR_CLIENT_PASSWORD>"
Docker Registry Configuration
The tracebloc client images are stored in a private container registry. Kubernetes needs valid Docker Hub credentials to pull these images onto your nodes.
dockerRegistry:
create: true
secretName: regcred
server: https://index.docker.io/v1/
username: <DOCKER_USERNAME>
password: <DOCKER_PASSWORD or TOKEN>
email: <DOCKER_EMAIL>
- DOCKER_USERNAME: Docker Hub username
- DOCKER_PASSWORD: Password or access token (if 2FA enabled)
- DOCKER_TOKEN: Alternative token for automation or personal access
- DOCKER_EMAIL: Email linked to your Docker account
Storage
Your workloads need persistent storage that survives pod restarts and can be shared across all nodes. K3s uses the local file system:
storageClass:
# Set to true to create a new storage class, false to use existing. Be careful not to overwrite existing data files by setting it true.
create: true
parameters:
osType: <darwin|linux|windows>
sharedData:
name: shared-data
storage: 50Gi
hostPath: "/absolute/path/to/data/shared-data"
logsPvc:
name: logs-pvc
storage: 10Gi
hostPath: "/absolute/path/to/data/logs"
mysqlPvc:
name: mysql-pvc
storage: 2Gi
hostPath: "/absolute/path/to/data/mysql"
Set Resource Limits at Pod/Container Level
Kubernetes schedules pods based on requests and enforces limits, this prevents one workload from starving the node. With single node K3s this is critical since all pods share the same machine resources. If you do not set requests, the scheduler can overcommit, if you do not set limits, one training job can monopolize the node. Size requests to what the job needs, keep limits slightly above that, and within the actual host capacity.
Size pod level requests and limits according to your actual training requirements while keeping them well within your systems limits. Kubernetes then schedules pods if the requested resources are available:
RESOURCE_REQUESTS: "cpu=2,memory=8Gi"
RESOURCE_LIMITS: "cpu=2,memory=8Gi"
GPU_REQUESTS: "" # for GPU support, set "nvidia.com/gpu=1"
GPU_LIMITS: "" # for GPU support, set "nvidia.com/gpu=1"
# Optional: for GPU support with k3s, uncomment below:
# RUNTIME_CLASS_NAME: "nvidia"
Optional for GPU setup:
- set GPU_REQUESTS and GPU_LIMITS to
nvidia.com/gpu=1
- set
RUNTIME_CLASS_NAME: "nvidia"
These values define pod-level resource allocations. Kubernetes then schedules pods onto the node if the requested resources are available, otherwise the pod remains in a "Pending" state. Resource limits are optional, but they cap the maximum compute a pod can use, preventing it from consuming more than the defined values and starving other workloads.
In short: Set pod requests to what the training actually needs but keep limits well within your systems limits.
Tip: Estimating VRAM requirements for LLMs can be tricky. Use the VRAM Calculator to approximate memory needs for different model sizes and batch configurations.
Node, Pod, Job relationship
- A single-node K3s install runs all pods on that one node (although K3s can scale out with additional agent nodes).
- One pod contains one or more containers
- One training run usually equals one Job, which by default creates one pod
Pods are lightweight and share the same node, but all of them must fit within the actual CPU and memory capacity of the host (or the VM, if K3s runs inside one).
Proxy Settings (Optional)
These settings are only needed if your machine connects to the internet through a corporate or institutional proxy or firewall.
# proxy hostname.
HTTP_PROXY_HOST:
# proxy port.
HTTP_PROXY_PORT:
# username used for proxy authentication if needed.
HTTP_PROXY_USERNAME:
# password used for proxy authentication if needed.
HTTP_PROXY_PASSWORD:
If you are working on a private laptop or server with a direct internet connection, you can leave these fields empty.
4. Client Deployment
With the configuration ready, deploy the tracebloc client into your K3s cluster using Helm.
Deploy the client with Helm
Install the tracebloc Helm chart into the specified namespace using your customized values file:
helm install <RELEASE_NAME> tracebloc/bm \
--namespace <NAMESPACE> \
--values values.yaml
It creates:
- One MySQL pod:
mysql-...
- One tracebloc client pod:
tracebloc-jobs-manager-...
- Supporting resources such as a Service, ConfigMap, Secret, and PVC
Verification and Maintenance
After deployment, confirm the client is running correctly and learn how to maintain it.
Verify Deployment
Check that all pods are running in your namespace:
Check pod status:
kubectl get pods -n <NAMESPACE>
A few minutes after running the install command, it should show two pods with status "Running". If pods are not running, check the pod logs for error details.
Check services:
kubectl get services -n <NAMESPACE>
It should show the database service "mysql-".
Check persistent volumes:
kubectl get pvc -n <NAMESPACE>
Verifies that all persistent volume claims are bound and storage is available.
Check registry secret:
kubectl get secret regcred -n <NAMESPACE>
Verifies that the Docker registry secret was created successfully for pulling container images.
Maintenance
Update your values:
helm show values tracebloc/bm > new-values.yaml
Edit new-values.yaml with your changes.
Upgrade the deployment:
helm upgrade <RELEASE_NAME> tracebloc/bm \
--namespace <NAMESPACE> \
--values new-values.yaml
Uninstall
Remove the Helm release:
helm uninstall <RELEASE_NAME> -n <NAMESPACE>
Clean up persistent resources (optional):
kubectl delete pvc --all -n <NAMESPACE>
kubectl delete namespace <NAMESPACE>
Troubleshooting
Fix NVIDIA Container Runtime with Containerd (if K3s errors) (Optional for GPU)
If K3s fails with:
failed to load TOML from /etc/containerd/config.toml: invalid disabled plugin URI "cri>
the containerd configuration file is corrupted. Regenerate a clean config and re-apply the NVIDIA runtime:
Backup the broken config
Preserve the existing but corrupted file in case you need to inspect or restore it later:
sudo mv /etc/containerd/config.toml /etc/containerd/config.toml.bak
Moves the broken config.toml
out of the way and saves it as a backup.
Generate a clean default config
containerd requires a valid configuration file to start, generating a new one ensures a consistent baseline:
sudo containerd config default | sudo tee /etc/containerd/config.toml
Creates a fresh config.toml with default settings and writes it to the correct location.
Apply NVIDIA runtime:
Integrates the NVIDIA runtime with containerd so it can launch GPU-enabled workloads:
sudo nvidia-ctk runtime configure --runtime=containerd
Updates the new config file with the required NVIDIA runtime entries.
Reload and restart containerd
containerd must reload its configuration to apply changes, restarting ensures the service runs with the corrected settings:
sudo systemctl daemon-reload
sudo systemctl restart containerd
sudo systemctl status containerd
Reloads systemd, restarts containerd, and shows the service status. If successful, you will see Active: active (running).
Next Steps
Create a New Use Case
- Prepare your dataset - Upload and configure your training data
- Create an AI use case - Set up a new use case Join an Existing Use Case
- Explore and join available use cases - Browse ongoing AI projects
- Start training models - Begin training on shared datasets
Need Help?
- Email: support@tracebloc.io
- Docs: tracebloc Documentation Portal