Setting Up Local Kubernetes Infrastructure on Linux Using K3s

Overview

Running machine learning workloads locally on Linux requires a lightweight but production-grade Kubernetes environment. This guide shows how to set up a single-node cluster with K3s.Configure storage and deploy the tracebloc client. K3s is a slim Kubernetes distribution designed for edge and local use, consuming fewer resources than Minikube or full kubeadm setups.

The result is a local platform for securely training and benchmarking AI models while keeping all data on your machine.

The entire setup can be completed in about 1 hour.

Note: GPU support on Linux with K3s requires extra configuration. GPU steps are marked as optional.

Prerequisites

Before proceeding with the local infrastructure setup, make sure the following requirements are met:

System Requirements

Hardware Specifications

Minimum: 4 CPU cores, 8 GB RAM for basic workloads
Recommended: 8+ CPU cores, 16+ GB RAM for heavy computer vision or NLP workloads
Your machine must have sufficient disk space for container images, training data, and model artifacts

Required Tooling & Tracebloc Account

Tool	Purpose	Installation Link
Docker	Build and run local containers, optional for K3s itself, useful for GPU test containers	Docker
K3s	Runs a local Kubernetes cluster	K3s
kubectl	Command-line tool to interact with Kubernetes clusters	kubectl
Helm 3.x	Package manager for Kubernetes, used to install and manage applications	Helm

Note: K3s and Helm are installed in later steps of this guide. You do not need to pre-install them manually unless you prefer to manage versions yourself.

Tracebloc Account

Client ID and Password from the tracebloc client view (client ID, password)
Docker Hub registry credentials(username, token/password, email)

Install Helm

Install via script and verify the installation:

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version

Recommended for Monitoring: Use k9s

You can use k9s, a terminal-based Kubernetes dashboard, to monitor jobs, pods, and logs in real time. Run k9s -n <NAMESPACE> to get a live view of resources, switch between them instantly, and inspect logs or events with a few keystrokes. Compared to kubectl, it is faster and more convenient.

Components

The local deployment consists of two main parts: Core Infrastructure and Client Deployment. Together, they provide a secure, scalable environment for running ML workloads with the tracebloc client.

Core Infrastructure

Local Kubernetes cluster: Single-node K3s cluster, lightweight but fully compatible with upstream Kubernetes.

Container runtime: Containerd by default, GPU integration requires NVIDIA runtime setup

Persistent Storage: HostPath-based volumes for datasets, logs, and MySQL metadata

Local Networking: DNS for pod-to-pod networking.

Client Deployment

tracebloc Client Application: Deployed into the local K3s cluster using Helm, configured with credentials, registry access, and storage.

Monitoring and Verification: Kubernetes tools to confirm pods, services, and persistent volumes are running correctly.

Process Flow

Infrastructure Setup – Create local cluster, configure storage and networking
Client Deployment – Configure and deploy the tracebloc client application
Verification – Validate that everything is running correctly

Security

Security is a core consideration even when running workloads on a local machine. This setup follows Kubernetes best practices to ensure your data remains protected and workloads are isolated, while still allowing external models to be evaluated safely.

Data Protection

Data Locality: Training data stays on your local machine.
Model Encryption: Model weights are encrypted and never leave your environment.
Secure Communication: All communication is protected with TLS.

Tracebloc Backend Security

Code Analysis: Submitted models scanned with Bandit for vulnerabilities.
Input Validation: Training scripts and code validated before execution.
Container Isolation: Training runs inside Docker containers with restricted system access.
Namespace Isolation: Kubernetes namespace separates workloads from other applications.

Required Outbound Access

*.docker.io – Container image downloads
*.tracebloc.io – Platform communication
pypi.org and similar – Dependency installation during training

Proxy Support

If your machine is behind a proxy, configure the following in values.yaml:

HTTP_PROXY_HOST: your-proxy.company.com
HTTP_PROXY_PORT: 8080

Quick Setup

Purpose

Spin up a K3s cluster and deploy the tracebloc client in one go.

What the script does:

Starts K3s with recommended resources
Creates hostPath storage directories (datasets, logs, MySQL)
Deploys the tracebloc client with Helm

Run the setup script (example placeholder, adjust to your needs):

# Download and run the automated setup script
tbd

Cleanup:

Delete K3s cluster and remove created directories.

Tips:

Network model: local Docker bridge networking.
Resource limits: increase K3s memory/CPU for heavy workloads.

If you prefer more control and customization, follow the detailed step-by-step guide below.

Detailed Setup

This section walks through a step-by-step build with K3s and kubectl. It mirrors the Quick Setup but lets you choose your own resource settings (CPUs, memory, storage paths, namespace, Helm release name, etc.). Expect about 1 hour end-to-end.

What you’ll do (Steps 1–4):

Local Cluster Setup — Start a single-node Kubernetes cluster with K3s. Configure CPU and memory to match your workload requirements and ensure Docker is running as the container runtime. This provides a functional control plane and worker node in one instance.
Storage — Create hostPath directories on your machine for datasets, logs, and MySQL files. Mount them as persistent volumes in Kubernetes so data, logs, and model artifacts persist across pod restarts.
Client Configuration — Add the tracebloc Helm repository and generate a values.yaml file. Configure authentication credentials, registry access, and storage paths so the client can run securely in your environment.
Client Deployment — Install the Helm chart into your chosen namespace. Deploy and verify that the tracebloc client pods and persistent volumes are running and bound correctly.

Helm Usage: Helm is used to install and manage Kubernetes applications. In steps 3 and 4 you will deploy the tracebloc client via the tracebloc/bm (bare metal) chart from the tracebloc Helm repository.

1. Local Cluster Setup

Set up a Local Kubernetes Cluster

A local Kubernetes cluster is the foundation of this deployment. K3s packages the control plane and worker node components into a single lightweight binary, typically using containerd as the runtime. It runs directly on bare metal or inside a VM, giving you a fully functional Kubernetes environment without the overhead of a cloud provider.

Resource allocation in K3s

K3s does not impose fixed limits on CPU or memory like Minikube. Instead, it leverages the full capacity of the underlying operating system and the cluster adapts to the host: a small VM with 2 vCPUs and 4 GB RAM will provide that capacity to K3s, while a physical machine with 16 cores and 64 GB RAM makes the entire resource pool available. Ensure you provide sufficient resources for your data and the model type you expect to be trained on your infrastructure.

Install K3s

To install and launch, run:

curl -sfL https://get.k3s.io | sh -
sudo systemctl enable k3s
sudo systemctl status k3s

It runs the server process which includes both control plane and worker components.

enable ensures K3s starts automatically on reboot
status confirms the service is up and running

Export kubeconfig:

K3s stores the cluster’s kubeconfig in /etc/rancher/k3s/k3s.yaml. By exporting it, you let kubectl communicate with your local cluster.

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

Without this step, kubectl will not know where your cluster is. Verify cluster status and that your node has successfully joined the cluster:

kubectl get nodes

You should see STATUS="Ready" which means the cluster is running.

Set namespace and node labels

Create a namespace for your workloads:

kubectl create namespace <NAMESPACE>

Namespaces let you isolate your workloads from system components and other running applications.

Add mandatory node labels to nodes for workload management. Get the nodename, then label nodes:

kubectl get nodes
kubectl label node <NODENAME> type=system trainingset=<LABEL>

Set cpu or gpu as the label depending on your setup.

Recommended for Stability (Optional)

Disable Swap

Kubernetes schedules pods based on actual memory availability. If swap is enabled, the kernel may push memory pages to disk, making nodes appear healthier than they are. This leads to unpredictable performance, delayed OOM kills, and failed scheduling decisions. Disabling swap keeps resource usage accurate and stable:

sudo swapoff -a

swapoff -a disables swap memory on your machine, and commenting it out in /etc/fstab ensures it stays off after reboot.

Manage Multiple Clusters by Merging kubeconfig Files

By default, kubectl only talks to one cluster at a time. Merging configs lets you switch seamlessly between multiple clusters without constantly exporting environment variables or overwriting files. It keeps all your cluster contexts in one place, making kubectl config use-context possible.

mkdir -p ~/.kube
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config_k3s
sudo chmod 644 ~/.kube/config_k3s
sudo KUBECONFIG=~/.kube/config:~/.kube/config_k3s kubectl config view --flatten > /tmp/merged_kubeconfig
mv /tmp/merged_kubeconfig ~/.kube/config

These commands copy the K3s kubeconfig, fix its permissions, then merge it with any existing kubeconfig (for example from Minikube or EKS) into a single ~/.kube/config file.

Optional: GPU Enablement

This section enables GPU support for your Kubernetes cluster. You first make GPUs visible to containers on the host, then deploy NVIDIA’s GPU Operator so Kubernetes can schedule GPU workloads reliably.

Verify GPU Availability

Check if GPUs are detected:

nvidia-smi

Running nvidia-smi checks whether your NVIDIA GPU is visible to the system, shows driver version, GPU utilization, and available memory.

Install NVIDIA Container Toolkit

The NVIDIA Container Toolkit lets container runtimes like Docker or containerd access your GPU and run CUDA workloads instead of being stuck on CPU. Kubernetes can only schedule GPU workloads if the runtime knows how to expose GPU devices to pods.

Add NVIDIA’s official package repository to your system:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

It adds NVIDIA’s signed repository so apt trusts it.

Update package index, then pull and install the toolkit:

apt-get update 
apt-get install -y nvidia-container-toolkit
nvidia-container-cli --version

Configure Docker runtime:

By default, Docker cannot pass GPUs into containers. Without the step, even if your host sees the GPU, your containers remain GPU-blind. Restarting Docker makes sure the runtime hook is actually loaded.

nvidia-ctk runtime configure --runtime=docker
systemctl restart docker

nvidia-ctk runtime configure --runtime=docker tells Docker to register the NVIDIA runtime so it can launch GPU-enabled containers. systemctl restart docker applies this change.

Run GPU test container

Lets verify GPU access inside the container and check that the NVIDIA drivers, container toolkit, and runtime are correctly configured:

docker run --rm --gpus all nvidia/cuda:12.6.2-cudnn-runtime-ubuntu22.04 nvidia-smi

Launches a temporary CUDA container with GPU access enabled and executes nvidia-smi inside it, confirming end-to-end GPU visibility from host to container.

Deploy NVIDIA GPU Operator

The NVIDIA GPU Operator automates GPU management in Kubernetes. It installs and manages the device plugin, runtime components, monitoring stack, and optionally drivers. Without it, you’d have to manually configure each node to make GPUs schedulable.

Add NVIDIA Helm repository

Get the NVIDIA’s official Helm repository to install the GPU Operator with up-to-date charts:

helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

Install GPU Operator

Create a dedicated namespace to keep GPU operator resources isolated from the rest of your cluster and deploy the operator so it can configure GPU runtime components across nodes:

helm install --wait gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=false

It installs the GPU Operator via Helm into the gpu-operator namespace and waits for pods to be ready. The flag driver.enabled=false skips automatic driver installs in case your nodes already have the correct NVIDIA drivers.

Verify Operator deployment

Check pod and node resources to ensure GPUs are now exposed to Kubernetes:

kubectl get pods | grep nvidia

Expect them to be in Running or Completed state.

Verify node configuration:

kubectl describe nodes | grep nvidia

Expect to see labels and allocatable resources like nvidia.com/gpu: 1, confirming that Kubernetes can schedule GPU workloads on your nodes.

2. Storage

Unlike cloud deployments that use managed storage services, local deployments use hostPath which are real paths on your local machine. Create persistent storage directories for:

shared-data: For training datasets (json, txt, jpeg, png, etc.), model checkpoints and weights during training
logs: For application and pod logs for tracebloc client operations
mysql: For metadata and training data labels

mkdir -p data/shared-data 
mkdir -p data/logs
mkdir -p data/mysql 

These directories will be mounted as Persistent Volumes in Kubernetes.

3. Client Configuration

You can now prepare the Helm chart that deploys the tracebloc client itself: The client is the component that runs your model evaluation and training jobs inside K3s. To install it, you use a Helm chart, which bundles all required Kubernetes manifests into a single, configurable package. A values.yaml file controls this deployment, where you specify credentials (to authenticate with tracebloc), registry access (to pull container images), and storage settings. This configuration ensures the client can securely connect, schedule workloads, and store results.

Add Helm Repository

Install and update the tracebloc client using Helm instead of managing raw YAML. For details, refer to the public GitHub Repository.

helm repo add tracebloc https://tracebloc.github.io/client/
helm repo update

Adds the official tracebloc Helm repository to your local configuration so you can install the client with a single Helm command.

Configure your Deployment Settings

Export the chart’s default configuration into a local file that you can edit:

helm show values tracebloc/bm > values.yaml

Downloads the default configuration template for the tracebloc client for bm (bare metal). Open and update the following sections in values.yaml:

Deployment Namespace

Use the defined namespace:

namespace: <NAMESPACE>

Defines where the client will be deployed.

Tracebloc Authentication

Client ID

Provide your client ID from the tracebloc client view.

jobsManager:
  env:
    CLIENT_ID: "<YOUR_CLIENT_ID>"

Client Password

Set create: true to generate the secret during installation:

# Secrets configuration
secrets:
  # Whether to create the secret or use existing secret
  create: true
  # Client password
  clientPassword: "<YOUR_CLIENT_PASSWORD>"

Docker Registry Configuration

The tracebloc client images are stored in a private container registry. Kubernetes needs valid Docker Hub credentials to pull these images onto your nodes.

dockerRegistry:
  create: true
  secretName: regcred
  server: https://index.docker.io/v1/
  username: <DOCKER_USERNAME>
  password: <DOCKER_PASSWORD or TOKEN>
  email: <DOCKER_EMAIL>

DOCKER_USERNAME: Docker Hub username
DOCKER_PASSWORD: Password or access token (if 2FA enabled)
DOCKER_TOKEN: Alternative token for automation or personal access
DOCKER_EMAIL: Email linked to your Docker account

Storage

Your workloads need persistent storage that survives pod restarts and can be shared across all nodes. K3s uses the local file system:

storageClass:
  # Set to true to create a new storage class, false to use existing. Be careful not to overwrite existing data files by setting it true.
  create: true
  parameters:
    osType: <darwin|linux|windows>

sharedData:
  name: shared-data
  storage: 50Gi
  hostPath: "/absolute/path/to/data/shared-data"

logsPvc:
  name: logs-pvc
  storage: 10Gi
  hostPath: "/absolute/path/to/data/logs"

mysqlPvc:
  name: mysql-pvc
  storage: 2Gi
  hostPath: "/absolute/path/to/data/mysql"

Set Resource Limits at Pod/Container Level

Kubernetes schedules pods based on requests and enforces limits, this prevents one workload from starving the node. With single node K3s this is critical since all pods share the same machine resources. If you do not set requests, the scheduler can overcommit, if you do not set limits, one training job can monopolize the node. Size requests to what the job needs, keep limits slightly above that, and within the actual host capacity.

Size pod level requests and limits according to your actual training requirements while keeping them well within your systems limits. Kubernetes then schedules pods if the requested resources are available:

  RESOURCE_REQUESTS: "cpu=2,memory=8Gi"
  RESOURCE_LIMITS: "cpu=2,memory=8Gi"

  GPU_REQUESTS: "" # for GPU support, set "nvidia.com/gpu=1"
  GPU_LIMITS: "" # for GPU support, set "nvidia.com/gpu=1"

  # Optional: for GPU support with k3s, uncomment below:
  # RUNTIME_CLASS_NAME: "nvidia"

Optional for GPU setup:

set GPU_REQUESTS and GPU_LIMITS to nvidia.com/gpu=1
set RUNTIME_CLASS_NAME: "nvidia"

These values define pod-level resource allocations. Kubernetes then schedules pods onto the node if the requested resources are available, otherwise the pod remains in a "Pending" state. Resource limits are optional, but they cap the maximum compute a pod can use, preventing it from consuming more than the defined values and starving other workloads.

In short: Set pod requests to what the training actually needs but keep limits well within your systems limits.

Tip: Estimating VRAM requirements for LLMs can be tricky. Use the VRAM Calculator to approximate memory needs for different model sizes and batch configurations.

Node, Pod, Job relationship

A single-node K3s install runs all pods on that one node (although K3s can scale out with additional agent nodes).
One pod contains one or more containers
One training run usually equals one Job, which by default creates one pod

Pods are lightweight and share the same node, but all of them must fit within the actual CPU and memory capacity of the host (or the VM, if K3s runs inside one).

Proxy Settings (Optional)

These settings are only needed if your machine connects to the internet through a corporate or institutional proxy or firewall.

  # proxy hostname.
  HTTP_PROXY_HOST: 
  # proxy port.
  HTTP_PROXY_PORT: 
  # username used for proxy authentication if needed.
  HTTP_PROXY_USERNAME: 
  # password used for proxy authentication if needed.
  HTTP_PROXY_PASSWORD:

If you are working on a private laptop or server with a direct internet connection, you can leave these fields empty.

4. Client Deployment

With the configuration ready, deploy the tracebloc client into your K3s cluster using Helm.

Deploy the client with Helm

Install the tracebloc Helm chart into the specified namespace using your customized values file:

helm install <RELEASE_NAME> tracebloc/bm \
  --namespace <NAMESPACE> \
  --values values.yaml

It creates:

One MySQL pod: mysql-...
One tracebloc client pod: tracebloc-jobs-manager-...
Supporting resources such as a Service, ConfigMap, Secret, and PVC

Verification and Maintenance

After deployment, confirm the client is running correctly and learn how to maintain it.

Verify Deployment

Check that all pods are running in your namespace:

Check pod status:

kubectl get pods -n <NAMESPACE>

A few minutes after running the install command, it should show two pods with status "Running". If pods are not running, check the pod logs for error details.

Check services:

kubectl get services -n <NAMESPACE>

It should show the database service "mysql-".

Check persistent volumes:

kubectl get pvc -n <NAMESPACE>

Verifies that all persistent volume claims are bound and storage is available.

Check registry secret:

kubectl get secret regcred -n <NAMESPACE>

Verifies that the Docker registry secret was created successfully for pulling container images.

Maintenance

Update your values:

helm show values tracebloc/bm > new-values.yaml

Edit new-values.yaml with your changes.

Upgrade the deployment:

helm upgrade <RELEASE_NAME> tracebloc/bm \
  --namespace <NAMESPACE> \
  --values new-values.yaml

Uninstall

Remove the Helm release:

helm uninstall <RELEASE_NAME> -n <NAMESPACE>

Clean up persistent resources (optional):

kubectl delete pvc --all -n <NAMESPACE>
kubectl delete namespace <NAMESPACE>

Troubleshooting

Fix NVIDIA Container Runtime with Containerd (if K3s errors) (Optional for GPU)

If K3s fails with: failed to load TOML from /etc/containerd/config.toml: invalid disabled plugin URI "cri> the containerd configuration file is corrupted. Regenerate a clean config and re-apply the NVIDIA runtime:

Backup the broken config

Preserve the existing but corrupted file in case you need to inspect or restore it later:

sudo mv /etc/containerd/config.toml /etc/containerd/config.toml.bak

Moves the broken config.toml out of the way and saves it as a backup.

Generate a clean default config

containerd requires a valid configuration file to start, generating a new one ensures a consistent baseline:

sudo containerd config default | sudo tee /etc/containerd/config.toml

Creates a fresh config.toml with default settings and writes it to the correct location.

Apply NVIDIA runtime:

Integrates the NVIDIA runtime with containerd so it can launch GPU-enabled workloads:

sudo nvidia-ctk runtime configure --runtime=containerd

Updates the new config file with the required NVIDIA runtime entries.

Reload and restart containerd

containerd must reload its configuration to apply changes, restarting ensures the service runs with the corrected settings:

sudo systemctl daemon-reload
sudo systemctl restart containerd
sudo systemctl status containerd

Reloads systemd, restarts containerd, and shows the service status. If successful, you will see Active: active (running).

Next Steps

Create a New Use Case

Prepare your dataset - Upload and configure your training data
Create an AI use case - Set up a new use case Join an Existing Use Case
Explore and join available use cases - Browse ongoing AI projects
Start training models - Begin training on shared datasets

Need Help?

Email: support@tracebloc.io
Docs: tracebloc Documentation Portal

Overview​

Prerequisites​

System Requirements​

Hardware Specifications​

Required Tooling & Tracebloc Account​

Tracebloc Account​

Install Helm​

Recommended for Monitoring: Use k9s​

Components​

Core Infrastructure​

Client Deployment​

Process Flow​

Security​

Data Protection​

Tracebloc Backend Security​

Required Outbound Access​

Proxy Support​

Quick Setup​

Purpose​

What the script does:​

Run the setup script (example placeholder, adjust to your needs):​

Cleanup:​

Tips:​

Detailed Setup​

What you’ll do (Steps 1–4):​

1. Local Cluster Setup​

Set up a Local Kubernetes Cluster​

Install K3s​

Set namespace and node labels​

Recommended for Stability (Optional)​

Disable Swap​

Manage Multiple Clusters by Merging kubeconfig Files​

Optional: GPU Enablement​

Verify GPU Availability​

Install NVIDIA Container Toolkit​

Configure Docker runtime:​

Run GPU test container​

Deploy NVIDIA GPU Operator​

Add NVIDIA Helm repository​

Install GPU Operator​

Verify Operator deployment​

2. Storage​

3. Client Configuration​

Add Helm Repository​

Configure your Deployment Settings​

Deployment Namespace​

Tracebloc Authentication​

Client ID​

Client Password​

Docker Registry Configuration​

Storage​

Set Resource Limits at Pod/Container Level​

Proxy Settings (Optional)​

4. Client Deployment​

Deploy the client with Helm​

Verification and Maintenance​

Verify Deployment​

Check pod status:​

Check services:​

Check persistent volumes:​

Check registry secret:​

Maintenance​

Update your values:​

Upgrade the deployment:​

Uninstall​

Troubleshooting​

Fix NVIDIA Container Runtime with Containerd (if K3s errors) (Optional for GPU)​

Backup the broken config​

Generate a clean default config​

Apply NVIDIA runtime:​

Reload and restart containerd​

Next Steps​

Need Help?​