> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tracebloc.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Amazon EKS

> Deploy a tracebloc workspace on Amazon EKS using the AWS CLI — networking, GPU support, storage, and security for a production cluster.

## Overview

<Note>
  **Use EKS for production** — multi-node, autoscaling, or shared GPU clusters on AWS. For a single machine (a laptop or one server), the [local installer](/environment-setup/setup-guide) is simpler and faster.
</Note>

Running machine learning workloads in the cloud often requires a reliable, secure, and scalable infrastructure—yet setting it up can be complex. This guide walks you through building a complete Amazon EKS (Elastic Kubernetes Service) environment from scratch using the AWS CLI. By following these steps, you'll create a production-ready foundation with networking, GPU-optional compute, storage, and security fully aligned with AWS and Kubernetes best practices.

Once the infrastructure is in place, you'll deploy and configure the tracebloc client to securely train and benchmark AI models. This setup ensures that your proprietary data stays within your environment, while still allowing external AI models to be tested and fine-tuned in a controlled, isolated way. The result: a scalable, secure platform for high-performance ML workloads that accelerates collaboration with external experts while maintaining full control over your data and IP.

The entire setup takes \~1–2 hours.

If the cluster is already up and you are just adding another client to it, skip the cluster-creation steps and go straight to ["Client Configuration"](#5-client-configuration).

## Prerequisites

Before proceeding with the EKS infrastructure setup, make sure the following requirements are met:

### AWS Setup

#### 1. Create or use an AWS Account

* If you don't already have one, [create an AWS account](https://signin.aws.amazon.com/signup?request_type=register).
* Your account must have permissions to create and manage EKS resources, VPCs, EC2 instances, and IAM roles.

#### 2. Install and configure AWS CLI

* [Install the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) on your local machine.
* Configure your credentials with:

```bash theme={null}
aws configure
```

This prompts you to enter your Access Key ID, Secret Access Key, default region (recommended: `eu-central-1`), and output format.

#### 3. Verify your AWS CLI configuration

Check if your credentials and region are set correctly:

```bash theme={null}
aws configure list
```

If needed, set the region explicitly:

```bash theme={null}
aws configure set region eu-central-1
```

#### Required Permissions

Requires permissions for:

* Amazon EKS cluster management
* VPC and networking resources
* EC2 instances and security groups
* IAM roles and policies

### Required Tooling & Tracebloc Account

* **Helm 3.x**: Install Helm on your local machine. [Installation Guide](https://helm.sh/docs/intro/install/)
* **kubectl**: Install kubectl to interact with your EKS cluster. [Installation Guide](https://kubernetes.io/docs/tasks/tools/)
* **Tracebloc Account**: You will need your Client ID and Client Password ([from the tracebloc client view](https://ai.tracebloc.io/clients)).
* **Docker Registry Credentials**: Docker Hub username, password/token, and email for pulling container images.

### Recommended for Monitoring: Use k9s

You can use [k9s](https://k9scli.io/), a terminal-based Kubernetes dashboard, to monitor jobs, pods, and logs in real time. Run `k9s -n <NAMESPACE>` to get a live view of resources, switch between them instantly, and inspect logs or events with a few keystrokes. Compared to kubectl, it is faster and more convenient.

### Components

The EKS infrastructure setup consists of two main parts: Core Infrastructure and Client Deployment. Together, they provide a secure, scalable environment for running ML workloads with the tracebloc client.

#### Core Infrastructure

**VPC with isolated networking**
Creates a dedicated, secure network environment with subnets across multiple Availability Zones for high availability.

**EKS Cluster**
A managed Kubernetes control plane with auto-scaling nodegroups to run your ML workloads.

**IAM Roles and Policies**
Granular roles for the cluster and worker nodes following least-privilege principles.

**Security Groups**
Network-level access controls for pods, nodes, and storage.

**EFS Storage System**
Shared persistent storage for training datasets and model artifacts, accessible from all nodes.

#### Client Deployment

**tracebloc Client Application**
Deployed into the EKS cluster using Helm, configured with your account credentials and storage.

**Monitoring and Verification**
Tools for validating that your cluster, workloads, and client are running correctly.

By combining these components, you get a production-ready Kubernetes environment tailored for secure, high-performance machine learning workloads.

### Process Flow

1. **Infrastructure Setup** - Create VPC, networking, EKS cluster, and storage
2. **Client Deployment** - Configure and deploy the tracebloc application
3. **Verification** - Validate everything is running correctly

## Security

Security is a core consideration when setting up and running ML workloads on EKS. This setup follows AWS and Kubernetes best practices to ensure data protection, secure communication, and controlled access at every layer.

### Data Protection

* **Data Locality**: Training data always remains within your infrastructure. Only pre-defined metrics and logs are shared externally.
* **Model Encryption**: Model weights are encrypted and never leave your environment.
* **Secure Communication**: All platform communication is protected with TLS.

### tracebloc Backend Security

* **Code Analysis**: Submitted external models are scanned with the Bandit library to detect vulnerabilities or malicious code.
* **Input Validation**: Training scripts and model code undergo strict validation before execution.
* **Sandboxed Execution**: Training workloads run in isolated environments with restricted system access.
* **Namespace Isolation**: Kubernetes namespaces ensure logical separation from other applications.

### Identity & Access Management

**AWS IAM Roles:**

* **Cluster Role**: Minimal permissions (`AmazonEKSClusterPolicy`) for cluster management.
* **Nodegroup Role**: Least-privilege permissions for worker nodes, networking, container registry, and storage access.

**Kubernetes RBAC:**

* Cluster roles provide limited permissions for jobs, pods, and deployments.
* Service accounts are bound to roles within their namespace, preventing cross-namespace access.

### Network Security

* **VPC Isolation**: Dedicated VPC with private subnets and an Internet Gateway for controlled communication.
* **Security Groups**: Restrict access to EFS mount targets and cluster traffic.
* **Outbound Access**: Limited to required destinations such as `*.amazonaws.com`, `*.docker.io`, and `*.tracebloc.io`.

Together, these measures ensure that external models can be deployed safely into your EKS environment without exposing proprietary data or infrastructure.

## Quick Setup

Quick Setup runs an automated script that builds the whole cluster in one go. Want step-by-step control (or to customize networking)? Use [Detailed Setup](#detailed-setup) instead.

### Purpose

Spin up a production-ready EKS baseline (VPC, subnets, internet gateway, EKS cluster, managed nodegroup, EFS + CSI driver) in one go. Includes basic validation, colored logging, and a cleanup mode.

### What this script does

* Creates networking (VPC + 3 public subnets + route to IGW)
* Provisions an EKS cluster and a system nodegroup (t3.medium, 2–5 nodes)
* Creates EFS and mount targets in all AZs, sets up the EFS CSI driver + storage class
* Updates your kubeconfig for kubectl access
* Prints a summary and next steps

### Prerequisites

* AWS CLI installed and configured (able to run `aws sts get-caller-identity`)
* kubectl installed and on PATH
* Permissions to create EKS, EC2/VPC, EFS, and IAM resources in the target account
* (Recommended) Helm installed for later app deployment

### How to run

1. Save the script below as `setup_eks.sh`
2. Make it executable: `chmod +x setup_eks.sh`
3. (Optional) Edit the configuration variables at the top (REGION, CLUSTER\_NAME, OWNER\_TAG, etc.)
4. Execute: `./setup_eks.sh`
5. When finished, follow the printed "Next steps" to create your Docker registry secret and deploy workloads

### Cleanup (teardown)

Run `./setup_eks.sh cleanup` to remove cluster, nodegroup, EFS, subnets, gateway, roles, and VPC (irreversible).

### Tips:

* **Costs**: This creates billable resources (EC2, EKS, EFS, data transfer). Remove when not needed.
* **Network model**: Subnets are configured to auto-assign public IPs for simplicity. Adjust to private subnets + NAT as needed.
* **Kubernetes version**: The script requests `--kubernetes-version 1.32`; update if your region/account supports a different current version.
* **Security hardening**: This is a production baseline; harden further for your environment (security groups, private subnets, IRSA, Pod Security/OPA).

If you prefer more control over your setup and want to customize the environment to your needs, follow the step-by-step guide below.

## Detailed Setup

This section walks through a step-by-step build with AWS CLI and kubectl. It mirrors the Quick Setup but lets you choose your own resource settings (# of CPUs, Memory, VPC CIDRs, instance types, namespace, Helm release name, StorageClass, etc.). Expect about 1–2 hours end-to-end.

### What you'll do (Steps 1–6):

1. **VPC & Network Configuration** — Create an isolated Virtual Private Cloud (VPC) with subnets distributed across three availability zones for high availability, an Internet Gateway for outbound connectivity, and routing tables with proper associations. This provides network isolation and fault tolerance for your cluster infrastructure.
2. **EKS Cluster Setup** — Create the cluster service role and provision the EKS control plane which manages the Kubernetes API server.
3. **EKS Nodegroup Setup** — Create a node service role and provision two managed nodegroups: A system nodegroup for Kubernetes system components and a training nodegroup for ML workloads. Each nodegroup can be customized with different instance types, scaling parameters, and capacity types depending on the work loads and data types.
4. **Storage** — Create an Amazon EFS file system for shared persistent storage, configure a security group that allows NFS traffic from the cluster nodes, and create mount targets in each availability zone. This provides scalable, shared storage for training data as well as weights and logs.
5. **Client Configuration** — Install the Amazon EFS (Elastic File System) CSI driver (Container Storage Interface) in your EKS (Elastic Kubernetes Service) cluster. This driver is what lets Kubernetes automatically create and mount EFS storage volumes.
6. **Client Deployment** — Add the tracebloc Helm repository, configure your deployment values (authentication credentials, registry access, storage settings, resource limits), install the chart into your chosen namespace. Deploy and verify that all pods are running and persistent volume claims are properly bound.

**Helm Usage**: Helm is used to install and manage Kubernetes applications. In steps 5 and 6 you will deploy the tracebloc client via the unified `tracebloc/client` chart with EKS-specific values.

## 1. VPC and Network Configuration

Your AWS EKS cluster must run in a secure and isolated network with subnets across availability zones, DNS resolution, and internet access. This is provided by an AWS VPC (Virtual Private Cloud), which is a logically isolated section of AWS where you define your own IP address range (CIDR block), subnets, routing, and gateways.
This section sets up the foundation.

### VPC Creation

VPC provides network isolation, so your workloads are not exposed to the public internet by default.

```bash theme={null}
aws ec2 create-vpc \
  --cidr-block 10.0.0.0/16 \
  --tag-specifications "ResourceType=vpc,Tags=[{Key=owner,Value=<OWNER_NAME>}]"
```

Creates an isolated network environment where your EKS cluster runs securely. Uses CIDR (Classless Inter-Domain Routing) block `10.0.0.0/16` to define the IP address range for all resources in the VPC and tags it with the specified owner name. Replace `<OWNER_NAME>` with your preferred identifier. Keep the output and the `VPC_ID`for the next steps.

### Enable DNS Hostnames

DNS (Domain Name System) support and hostnames let Kubernetes pods and services find each other by name, which is essential for service discovery in Kubernetes, which relies on DNS resolution.

```bash theme={null}
aws ec2 modify-vpc-attribute --vpc-id <VPC_ID> --enable-dns-hostnames
```

Enables DNS resolution so nodes and services can resolve each other by name instead of IP addresses. Find the `VPC_ID`from the previous step or list VPCs using `aws ec2 describe-vpcs`.

### Create Subnets Across Availability Zones

Subnets (smaller ranges of IP addresses inside the VPC) are spread across multiple AZs (Availability Zones) to ensure high availability and fault tolerance. EKS requires subnets in at least 2 availability zones for high availability - using 3 zones provides better fault tolerance and load distribution.

```bash theme={null}
aws ec2 create-subnet \
  --vpc-id <VPC_ID> \
  --cidr-block 10.0.1.0/24 \
  --availability-zone eu-central-1a \
  --tag-specifications "ResourceType=subnet,Tags=[{Key=owner,Value=<OWNER_NAME>}]"

aws ec2 create-subnet \
  --vpc-id <VPC_ID> \
  --cidr-block 10.0.2.0/24 \
  --availability-zone eu-central-1b \
  --tag-specifications "ResourceType=subnet,Tags=[{Key=owner,Value=<OWNER_NAME>}]"

aws ec2 create-subnet \
  --vpc-id <VPC_ID> \
  --cidr-block 10.0.3.0/24 \
  --availability-zone eu-central-1c \
  --tag-specifications "ResourceType=subnet,Tags=[{Key=owner,Value=<OWNER_NAME>}]"
```

Creates three subnets with separate CIDR blocks (`10.0.1.0/24`, `10.0.2.0/24`, `10.0.3.0/24`) distributed across different availability zones. Use or create tags if needed. Register the generated `SUBNET_IDs` for the following steps.

### Enable Public IPs on Subnets

EKS worker nodes need public IPs to communicate with the EKS control plane, download container images from Dockerhub, and allow external traffic to reach applications.

```bash theme={null}
aws ec2 modify-subnet-attribute --subnet-id <SUBNET_ID_1A> --map-public-ip-on-launch
aws ec2 modify-subnet-attribute --subnet-id <SUBNET_ID_1B> --map-public-ip-on-launch
aws ec2 modify-subnet-attribute --subnet-id <SUBNET_ID_1C> --map-public-ip-on-launch
```

Configures each subnet to automatically assign public IP addresses to new EC2 instances.

### Create and Attach an Internet Gateway

An Internet Gateway (IGW) connects your VPC to the internet so nodes can pull container images and reach AWS APIs.

```bash theme={null}
aws ec2 create-internet-gateway \
  --tag-specifications "ResourceType=internet-gateway,Tags=[{Key=owner,Value=<OWNER_NAME>}]"

```

Register the generated INTERNET\_GATEWAY\_ID for the next step:

```bash theme={null}
aws ec2 attach-internet-gateway \
  --vpc-id <VPC_ID> \
  --internet-gateway-id <INTERNET_GATEWAY_ID>
```

### Create Default Route Table to Internet Gateway

A route table defines how network traffic is directed within the VPC and out through the IGW.

```bash theme={null}
aws ec2 describe-route-tables \
  --region eu-central-1 \
  --filters Name=vpc-id,Values=<VPC_ID> \
  --query "RouteTables[].RouteTableId" \
  --output table
```

This lists the route table IDs linked to your VPC. Pick the main route table ID you want to update, then add a default route:

```bash theme={null}
aws ec2 create-route \
  --route-table-id <ROUTE_TABLE_ID> \
  --destination-cidr-block 0.0.0.0/0 \
  --gateway-id <INTERNET_GATEWAY_ID>
```

This adds a rule that sends all internet-bound traffic (0.0.0.0/0) through the IGW, enabling external connectivity for your nodes.

## 2. EKS Cluster Setup

Amazon EKS provides the managed Kubernetes control plane. It needs permissions through AWS IAM (Identity and Access Management) to create and manage resources such as load balancers, security groups, and network interfaces.
The cluster itself is the logical container for the API server, linking your VPC networking with the control plane and IAM role. To interact with it you must update your local kubeconfig so kubectl can send commands. Setting this up gives you a working API server that can manage workloads once worker nodes are added in the next step.

### Create EKS Cluster Role

EKS needs an IAM role to manage your cluster's control plane and AWS resources:

```bash theme={null}
aws iam create-role \
  --role-name <ROLE_NAME> \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Action": "sts:AssumeRole",
      "Effect": "Allow",
      "Principal": {"Service": "eks.amazonaws.com"}
    }]
  }' \
  --tags Key=owner,Value=<OWNER_NAME>
```

Creates a role with the following parameters:

* `--role-name`: Name for your EKS service role
* `--assume-role-policy-document`: Trust policy allowing EKS to use this role
* `--tags`: Optional resource tags for organization

### Attach EKS Cluster Policy to Role

To give the EKS control plane the necessary permissions, you must attach the AmazonEKSClusterPolicy.

```bash theme={null}
aws iam attach-role-policy \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy \
  --role-name <ROLE_NAME>
```

Adds the AWS-managed EKS cluster policy to the role, giving the control plane its required permissions to manage networking, security groups, or load balancers.

### Create EKS Cluster

Create the EKS cluster to provision the managed Kubernetes control plane and connect it to your VPC, subnets, and IAM role.

```bash theme={null}
aws eks create-cluster \
  --name <CLUSTER_NAME> \
  --role-arn arn:aws:iam::<ACCOUNT_ID>:role/<ROLE_NAME> \
  --resources-vpc-config subnetIds=<SUBNET_ID_1A>,<SUBNET_ID_1B>,<SUBNET_ID_1C>,endpointPublicAccess=true,publicAccessCidrs=0.0.0.0/0 \
  --kubernetes-network-config serviceIpv4Cidr=172.20.0.0/16,ipFamily=ipv4 \
  --kubernetes-version 1.32 \
  --tags owner=<OWNER_NAME>
```

Creates the EKS cluster using the specified role, VPC subnets, and Kubernetes version. Set an appropriate `CLUSTER_NAME`, **expect this to take around 10 minutes**. If you do not have your accound ID ready, run `aws sts get-caller-identity`.

### Configure kubectl

Your local kubectl must know the cluster's API endpoint and credentials, otherwise you cannot interact with the cluster you just created:

```bash theme={null}
aws eks update-kubeconfig --region eu-central-1 --name <CLUSTER_NAME>
```

Updates your local kubeconfig for kubectl.

### Manage Multiple Clusters (Optional)

If you work with several clusters, make sure your context points to the right one. List all contexts:

```bash theme={null}
kubectl config get-contexts
```

Find your `CONTEXT_NAME`, then apply it:

```bash theme={null}
kubectl config use-context <CONTEXT_NAME>
```

This ensures subsequent kubectl commands apply to the intended cluster.

### Create Application Namespace

Namespaces let you isolate your workloads from system components and other applications running in the cluster. Define an appropriate namespace for your setup

```bash theme={null}
kubectl create namespace <NAMESPACE>
```

Adds a dedicated namespace for your workloads, keeping them isolated from system components.

## 3. EKS Nodegroup Setup

Nodegroups are sets of EC2 instances that act as worker nodes, running your system and training workloads. Each nodegroup needs an IAM role so the nodes can join the cluster and interact with AWS services such as pulling images from registries, attaching EFS storage, and sending logs. It is best practice to create at least two groups: a small system nodegroup to host cluster services and a training nodegroup sized for your ML workloads. The training nodegroup can be CPU-based for general jobs or GPU-based for deep learning, ensuring that heavy jobs do not interfere with core Kubernetes functions.

### Create Nodegroup Role

Worker nodes need an IAM role to join the cluster and access AWS services.

```bash theme={null}
aws iam create-role \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": ["ec2.amazonaws.com"]
      }
    }]
  }' \
  --role-name <NODEGROUP_ROLE_NAME> \
  --tags Key=owner,Value=<OWNER_NAME>
```

Creates an IAM role that nodes can assume, allowing them to join the cluster and interact with AWS services. Create an appropriate `NODEGROUP_ROLE_NAME`.

### Attach Required Policies

Attach minimum required AWS-managed policies so the nodes have permissions:

```bash theme={null}
aws iam attach-role-policy \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy \
  --role-name <NODEGROUP_ROLE_NAME>

aws iam attach-role-policy \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly \
  --role-name <NODEGROUP_ROLE_NAME>

aws iam attach-role-policy \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy \
  --role-name <NODEGROUP_ROLE_NAME>
```

* **AmazonEKSWorkerNodePolicy:** lets nodes communicate with the cluster control plane
* **AmazonEC2ContainerRegistryReadOnly:** allows pulling images from ECR
* **AmazonEKS\_CNI\_Policy:** required for pod networking via the Container Network Interface (CNI) plugin

### Create Managed Nodegroups

Now let\`s create two nodegroups: A system nodegroup that runs critical Kubernetes components (CoreDNS, kube-proxy, Container Network Interface (CNI), Storage Interface drivers (CSI), metrics-server, etc.) and a training nodegroup that runs your machine learning workloads. This separation ensures cluster stability even under heavy training load.

#### System Nodegroup

It is recommended to use [AWS EC2 instance type t3.medium](https://aws.amazon.com/ec2/instance-types/t3/) for system pods as they are relatively cheap and sufficient. Spread nodes across three Availability Zones for resilience, set ON\_DEMAND for reliability and label as "system":

```bash theme={null}
aws eks create-nodegroup \
  --cluster-name <CLUSTER_NAME> \
  --nodegroup-name <SYSTEM_NODEGROUP_NAME> \
  --scaling-config minSize=2,maxSize=5,desiredSize=2 \
  --subnets <SUBNET_ID_1A> <SUBNET_ID_1B> <SUBNET_ID_1C> \
  --node-role arn:aws:iam::<ACCOUNT_ID>:role/<NODEGROUP_ROLE_NAME> \
  --instance-types t3.medium \
  --ami-type AL2_x86_64 \
  --capacity-type ON_DEMAND \
  --update-config maxUnavailable=1 \
  --labels type=system \
  --tags owner=<OWNER_NAME>
```

Creates a nodegroup with `t3.medium` instances (2 vCPUs, 4 GiB memory) spread across three AZs. The group scales between 2 and 5 nodes (EC2s), each node is labeled for system workloads. Set an appropriate `SYSTEM_NODEGROUP_NAME`. Always make sure to use `AL2_x86_64` for system nodes so that the data ingestor runs on this node, too.

#### Training Nodegroup

This group runs your ML training workloads — size it for your dataset, model type, number of parallel workloads, and whether you need GPUs.

Refer to the [EC2 instance types list](https://aws.amazon.com/ec2/instance-types) and [EKS managed nodegroups docs](https://docs.aws.amazon.com/eks/latest/userguide/create-managed-node-group.html) for guidance.

```bash theme={null}
aws eks create-nodegroup \
  --cluster-name <CLUSTER_NAME> \
  --nodegroup-name <TRAINING_NODEGROUP_NAME> \
  --scaling-config minSize=0,maxSize=5,desiredSize=2 \
  --subnets <SUBNET_ID_1A> <SUBNET_ID_1B> <SUBNET_ID_1C> \
  --node-role arn:aws:iam::<ACCOUNT_ID>:role/<NODEGROUP_ROLE_NAME> \
  --instance-types <INSTANCE_TYPE> \
  --ami-type <AMI_TYPE> \
  --capacity-type ON_DEMAND \
  --update-config maxUnavailable=1 \
  --labels trainingset=<LABEL> \
  --tags owner=<OWNER_NAME> \
  --disk-size 50
```

Set an appropriate `TRAINING_NODEGROUP_NAME`. Depending on the compute type (CPU vs. GPU), set variables as follows:

| Compute Type | INSTANCE\_TYPE | AMI\_TYPE               | LABEL |
| ------------ | -------------- | ----------------------- | ----- |
| CPU          | t3.xlarge      | AL2\_x86\_64            | cpu   |
| GPU          | g5g.xlarge     | AL2023\_x86\_64\_NVIDIA | gpu   |

The first example provisions a training nodegroup with t3.xlarge instances (4 vCPUs, 16 GiB memory). It starts with 2 nodes, can scale down to 0 when idle, and grows to 5 under load.

You can limit the compute per participant or team: Refer to the [creating a use case](/create-use-case/define) section for details.

#### Optional for GPU Training Nodegroup

Install the NVIDIA device plugin so Kubernetes can automatically detect and schedule GPUs on any new node:

```bash theme={null}
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.1/deployments/static/nvidia-device-plugin.yml
```

## 4. Storage

Your workloads need persistent storage across nodes to store:

* train and test datasets
* model checkpoints and weights during training
* client logs

Tracebloc leverages Amazon EFS (Elastic File System) to provide a scalable, managed file system that integrates directly with Kubernetes via the EFS CSI (Container Storage Interface) driver. It provides shared, persistent storage that all nodes and pods in the cluster can access at the same time.

### Create EFS File System

Let\`s create an EFS file system:

```bash theme={null}
aws efs create-file-system \
  --performance-mode generalPurpose \
  --throughput-mode elastic \
  --tags Key=owner,Value=<OWNER_NAME>
```

Creates an Amazon EFS file system with elastic throughput that scales automatically with workload demand. Save the generated `FILE_SYSTEM_ID` for later use.

### Identify Security Groups

List all security groups in the VPC so you can allow NFS traffic from your worker nodes to the EFS mount targets. NFS traffic is the file system protocol that EFS uses to let your nodes read and write shared storage.

```bash theme={null}
aws ec2 describe-security-groups \
  --region eu-central-1 \
  --filters Name=vpc-id,Values=<VPC_ID> \
  --query "SecurityGroups[*].{ID:GroupId,Name:GroupName}" \
  --output table
```

List all security groups so you can identify the security group id your EKS cluster uses for the next step.

### Create EFS Mount Targets

Mount targets act as network endpoints in each Availability Zone, letting your worker nodes connect to the EFS file system using NFS.

```bash theme={null}
aws efs create-mount-target \
  --file-system-id <FILE_SYSTEM_ID> \
  --subnet-id <SUBNET_ID_1A> \
  --security-groups <SECURITY_GROUP_ID>

aws efs create-mount-target \
  --file-system-id <FILE_SYSTEM_ID> \
  --subnet-id <SUBNET_ID_1B> \
  --security-groups <SECURITY_GROUP_ID>

aws efs create-mount-target \
  --file-system-id <FILE_SYSTEM_ID> \
  --subnet-id <SUBNET_ID_1C> \
  --security-groups <SECURITY_GROUP_ID>
```

Creates EFS mount targets in each subnet so nodes across all Availability Zones can attach to the same shared file system.

### Attach EFS CSI Driver Policy to Node Role

The EFS CSI (Container Storage Interface) driver is the Kubernetes plugin that makes Amazon EFS usable inside your cluster. It translates Kubernetes PersistentVolumeClaims into actual EFS mounts. For the driver to do this, the worker nodes need IAM permissions to create, mount, and manage EFS volumes.

```bash theme={null}
aws iam attach-role-policy \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEFSCSIDriverPolicy \
  --role-name <NODEGROUP_ROLE_NAME>
```

Attaches the AWS-managed AmazonEFSCSIDriverPolicy to your nodegroup role. This lets the EFS CSI driver running on your nodes provision and mount EFS storage for pods, so workloads can share datasets, logs, and model checkpoints across the cluster.

### Add EFS CSI Driver Helm Repository

The EFS CSI driver is packaged as a Helm chart, which makes it easy to install and upgrade in your cluster.

```bash theme={null}
helm repo add aws-efs-csi-driver https://kubernetes-sigs.github.io/aws-efs-csi-driver
```

This command registers the AWS EFS CSI driver Helm repository with your local Helm setup. Once added, you can install or update the EFS CSI driver in your cluster using Helm commands. The driver runs as pods in the kube-system namespace and allows Kubernetes to dynamically provision and mount EFS volumes for your workloads.

### Configure EFS CSI Driver Placement

The EFS CSI driver consists of a controller (manages provisioning of storage) and node pods (handle actual EFS mounts on each worker). By default, the controller could be scheduled onto any node, including your training nodes. Since training jobs consume heavy CPU and memory, this risks starving the controller and blocking storage operations.
Lets manually create a driver configuration file `efs-csi-driver.yaml`:

```bash theme={null}
vi efs-csi-driver.yaml
```

and deploy it to the cluster. Paste content into file:

```bash theme={null}
controller:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: node.kubernetes.io/scope
            operator: In
            values:
            - system
```

This node affinity ensures the EFS CSI controller pods run on the system nodegroup (nodes labeled node.kubernetes.io/scope=system) instead of training nodes.

### Install EFS CSI Driver with Helm

Now, the EFS CSI driver needs to be installed into the cluster (namespace kube-system by convention) using the configuration file.

```bash theme={null}
helm install efs-csi-driver aws-efs-csi-driver/aws-efs-csi-driver \
  --namespace kube-system \
  -f efs-csi-driver.yaml
```

### Install Metrics Server

The Metrics Server is the cluster-wide aggregator for resource usage data. It collects CPU and memory metrics for autoscaling and live resource monitoring. Install it to monitor resources usage:

```bash theme={null}
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/high-availability-1.21+.yaml
```

Deploys the Metrics Server in high-availability mode for reliable metric collection.

## 5. Client Configuration

You can now prepare the Helm chart that deploys the tracebloc client itself: The client is the component that runs your model evaluation and training jobs inside the cluster. To install it, you use a Helm chart, which bundles all required Kubernetes manifests into a single, configurable package.
A values.yaml file controls this deployment, where you specify credentials (to authenticate with tracebloc), registry access (to pull container images), and storage settings (to mount your EFS volumes). This configuration ensures the client can securely connect, schedule workloads, and store results.

### Add Helm Repository

The tracebloc client is delivered as a single unified Helm chart (`tracebloc/client`) that supports AKS, EKS, bare-metal, and OpenShift. Source: [tracebloc/client](https://github.com/tracebloc/client).

```bash theme={null}
helm repo add tracebloc https://tracebloc.github.io/client
helm repo update
```

### Configure your Deployment Settings

Export the chart's default configuration into a local file that you can edit:

```bash theme={null}
helm show values tracebloc/client > values.yaml
```

Open and update the following sections in `values.yaml`:

#### Tracebloc Authentication

Provide your Client ID and password from the [tracebloc client view](https://ai.tracebloc.io/clients):

```yaml theme={null}
clientId: "<YOUR_CLIENT_ID>"
clientPassword: "<YOUR_CLIENT_PASSWORD>"
```

#### Docker Registry Configuration

The tracebloc client images live on Docker Hub. Provide credentials so Kubernetes can pull them — the chart auto-creates a secret named `{{ .Release.Name }}-regcred`:

```yaml theme={null}
dockerRegistry:
  server: https://index.docker.io/v1/
  username: <DOCKER_USERNAME>
  password: <DOCKER_TOKEN or DOCKER_PASSWORD>
  email: <DOCKER_EMAIL>
```

* `DOCKER_USERNAME`: Docker Hub username
* `DOCKER_PASSWORD` / `DOCKER_TOKEN`: Password, or an access token (preferred — required if 2FA is enabled)
* `DOCKER_EMAIL`: Email linked to your Docker account

#### Storage (EKS / EFS)

The chart provisions an EFS-backed storage class via the EFS CSI driver. Use **EFS Access Points** (`provisioningMode: efs-ap`) for proper UID/GID enforcement on shared filesystems:

```yaml theme={null}
storageClass:
  create: true
  provisioner: efs.csi.aws.com
  volumeBindingMode: Immediate
  reclaimPolicy: Retain
  mountOptions:
    - actimeo=30
  parameters:
    directoryPerms: "700"
    uid: "999"
    gid: "999"
    fileSystemId: <FILE_SYSTEM_ID>
    provisioningMode: efs-ap

clusterScope: true

pvc:
  mysql: 2Gi
  logs: 10Gi
  data: 50Gi
```

Add the `FILE_SYSTEM_ID` from your EFS setup (step 4). Adjust PVC sizes as needed.

**StorageClass options:**

* `create: true` — chart creates a release-unique storage class (e.g. `<release>-storage-class`). Each release gets its own.
* `create: false` — reuse an existing class; `name` must match.

#### Set Resource Limits for Training Jobs

Each training Job inherits these resource requests/limits. Make sure they fit within your EC2 instance capacity. **IMPORTANT:** For GPU jobs, `requests` and `limits` must be equal — Kubernetes rejects configs where they differ.

```yaml theme={null}
env:
  RESOURCE_REQUESTS: "cpu=2,memory=8Gi"
  RESOURCE_LIMITS: "cpu=2,memory=8Gi"
  GPU_REQUESTS: ""        # for GPU support set "nvidia.com/gpu=1"
  GPU_LIMITS: ""          # for GPU support set "nvidia.com/gpu=1"
```

**Tip:** Estimating VRAM requirements for LLMs can be tricky. Use the [VRAM Calculator](https://apxml.com/tools/vram-calculator) to approximate memory needs for different model sizes and batch configurations.

**Node, Pod, Job relationship**

* One EC2 equals one node; one node runs many pods.
* One pod contains one or more containers.
* One training run equals one Job, which in most cases creates one pod by default.

#### Proxy Settings (Optional)

Required only if your EKS worker nodes reach the internet through a corporate proxy:

```yaml theme={null}
env:
  HTTP_PROXY_HOST:
  HTTP_PROXY_PORT:
  HTTP_PROXY_USERNAME:
  HTTP_PROXY_PASSWORD:
```

#### NetworkPolicy for Training Pods

The chart can apply a NetworkPolicy that denies ingress and restricts egress to DNS + external HTTPS only — blocking pod-to-pod, MySQL, and Kubernetes API access from training pods.

```yaml theme={null}
networkPolicy:
  training:
    enabled: false   # see note below
```

**EKS caveat:** the default AWS VPC CNI does **not** enforce NetworkPolicy. The CI defaults ship with `enabled: false` to avoid a false sense of security. If your cluster runs Calico or Cilium as an add-on, set `enabled: true`.

#### Resource Monitor (metrics-server)

The chart's resource-monitor DaemonSet polls `/apis/metrics.k8s.io/v1beta1` and **requires `metrics-server`**. EKS does not install it by default — install it once per cluster before deploying the chart:

```bash theme={null}
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
```

If you cannot install metrics-server, disable the DaemonSet:

```yaml theme={null}
resourceMonitor: false
```

## 6. Client Deployment

With the configuration ready, deploy the tracebloc client into your EKS cluster using Helm.

### Deploy the client with Helm

Install the chart into your namespace using your customized values file:

```bash theme={null}
helm install <RELEASE_NAME> tracebloc/client \
  --namespace <NAMESPACE> \
  --create-namespace \
  --values values.yaml
```

This creates:

* One MySQL pod: `mysql-...`
* One tracebloc client pod: `<release>-jobs-manager-...`
* A `tracebloc-resource-monitor` DaemonSet in the `tracebloc-node-agents` namespace
* A `<release>-auto-upgrade` CronJob for daily chart upgrades (see [Configuration → Auto-upgrade](/environment-setup/configuration#auto-upgrade-on-by-default))
* Supporting resources: Service, ConfigMap, Secrets, PVCs, PriorityClass, PDBs

Expect a few minutes for pods to pull images, bind PVCs, and reach `Running`.

## Verification and Maintenance

After deployment, confirm the client is running correctly and learn how to maintain it.

### Verify Deployment

Check that all pods are running in your namespace:

#### Check pod status:

```bash theme={null}
kubectl get pods -n <NAMESPACE>
```

Within a few minutes, you should see the two pods "mysql-..." and "tracebloc-jobs-manager-..." in Running state. If not, inspect the logs for troubleshooting.

#### Check services:

```bash theme={null}
kubectl get services -n <NAMESPACE>
```

Should list the database service (for example, mysql-service).

#### Check persistent volumes:

```bash theme={null}
kubectl get pvc -n <NAMESPACE>
```

Verifies that PVCs are bound and storage is available.

### Maintenance

The chart's auto-upgrade CronJob handles routine version bumps daily. To upgrade manually:

#### Update your values:

```bash theme={null}
helm show values tracebloc/client > new-values.yaml
# Edit new-values.yaml with your changes
```

#### Upgrade the deployment:

```bash theme={null}
helm upgrade <RELEASE_NAME> tracebloc/client \
  --namespace <NAMESPACE> \
  --reset-then-reuse-values \
  --values new-values.yaml
```

### Uninstall

#### Remove the Helm release:

```bash theme={null}
helm uninstall <RELEASE_NAME> -n <NAMESPACE>
```

#### Clean up persistent resources (optional):

```bash theme={null}
kubectl delete pvc --all -n <NAMESPACE>
kubectl delete namespace <NAMESPACE>
```

## Next Steps

**Create a New Use Case**

* **[Prepare your dataset](/create-use-case/prepare-dataset)** - Upload and configure your training data
* **[Create an AI use case](/create-use-case/define)** - Set up a new use case
  **Join an Existing Use Case**
* **[Explore and join available use cases](/join-use-case/explore-use-case)** - Browse ongoing AI projects
* **[Start training models](/join-use-case/start-training)** - Begin training on shared datasets

## Need Help?

* Email: [support@tracebloc.io](mailto:support@tracebloc.io)
* Docs: [tracebloc Documentation Portal](https://docs.tracebloc.io)