How training works

This page documents the training and inference pipeline that the tracebloc client runs for every supported use case. The goal is full transparency: you can read what happens step-by-step, write an equivalent script on your own machine against the same dataset, and compare metrics number-for-number against what the platform reports. If something here does not match what you observe in your run, please contact us so we can investigate together.

Shared lifecycle

Every use case runs through the same outer loop on the edge:

Resolve the experiment

The platform reads your experiment configuration — dataset, hyperparameters, framework choice, training-or-inference mode — and selects the right framework backend (PyTorch, TensorFlow, scikit-learn, lifelines, or scikit-survival).

Load your model

Your uploaded model file is fetched and instantiated. For continued cycles and inference, the latest weights from the experiment are loaded into it.

Build the data pipeline

The platform loads your raw data, runs the use-case-specific preprocessing, and produces training and validation batches (or a single test set in inference mode).

Configure optimizer and loss

Your hyperparameters are normalized, your loss function is constructed, and your optimizer (and learning-rate scheduler, if any) is built — all from the values you set in the notebook.

Run the training loop

For each epoch, every training batch goes through forward, loss, backward, and optimizer step. Validation batches run a forward pass only. Per-batch numbers feed into the metrics layer.

Compute and report metrics

After each epoch the metrics layer produces epoch-level numbers (loss curves, monitoring metric). After the full cycle ends it produces the cycle-level aggregate metrics shown in the platform UI. Weights are then saved and uploaded for the next cycle.

Experiment parameters (shared across all use cases)

Every use case below pulls its run-time configuration from the same set of experiment parameters. You set these values in your Jupyter notebook when you configure and submit the experiment with the tracebloc Python package; the platform deserializes them on the edge before training begins. The same parameter names work the same way across image classification, object detection, segmentation, keypoint detection, text, tabular, time series, and survival use cases — only the subset that applies to a given task is read. The values that reach the platform are always whatever you set in the notebook. The SDK initializes every parameter to a default at construction time, so even an experiment where you change nothing arrives on the edge with concrete values for every field. When you call a setter (optimizer("adam"), batch_size(64), …), the SDK overwrites that field; on start() the assembled payload is what the platform receives. The SDK’s starting defaults — sent verbatim if you don’t change them in the notebook:

Setting	Default	Notebook setter
Epochs	10 (forced to 1 for survival frameworks)	`epochs(...)`
Cycles	1	`cycles(...)`
Optimizer	SGD	`optimizer(...)`
Learning rate	0.001, constant	`learning_rate(...)`
Batch size	16	`batch_size(...)`
Validation split	computed per dataset as `num_classes / min_images_per_edge`, clamped to `[0.1, 0.5]`	`validation_split(...)` (only accepts values in `[default, 0.5]`)
Seed	off	`seed(...)`
Shuffle (training)	on	`shuffle(...)`
All augmentation flags	off	`<flag>(...)`
Pre-trained weights	off	model upload setting

Class weighting is applied automatically by the platform for image classification and tabular classification only, when your loss function is cross-entropy, NLL, or binary BCE. The formula is described in those sections. Other use cases do not reweight the loss. To replicate a run locally, read the actual values your experiment was launched with from the experiment view (or your notebook), then match the per-use-case preprocessing and metrics described below — those parts are baked into the platform pipeline and are not configurable from the notebook.

Per use case

Image classification

Frameworks: PyTorch, TensorFlowInput

Image files (JPEG / PNG) supplied through the dataset metadata as data_id (the image filename) and label (the class name).
Class names are mapped to integer indices in the order defined by the dataset’s class list, so the same class always lines up with the same logit position across cycles and inference.

Preprocessing

Images are resized to a square at the size you set in the notebook (default 256). Aspect ratio is not preserved — the resize is a direct stretch.
Pixel values are normalized using ImageNet mean and standard deviation. You can override the mean and standard deviation in the notebook if your model was pre-trained against different statistics.
The augmentation flags you set in the notebook (rotation, shifts, brightness, etc.) drive an image augmentation pipeline that runs on the training split only — validation always sees the unaugmented preprocessing so metrics stay deterministic across epochs. For reference, the SDK only allows the geometric and color augmentation flags to be set on PyTorch experiments — horizontal and vertical flip flags are TensorFlow-only at the SDK level.
Train/validation split is stratified by class label (so class proportions are preserved on both sides) and uses a deterministic seed. If your chosen split would leave one side empty on a small dataset, it is silently retried with the ratio clamped into a safe range.

Training step

Forward pass through the model produces a logit per class.
The loss function you configured in the notebook is used. Cross-entropy is the common choice; if you pick a regression-style loss (such as MSE) the labels are converted to one-hot floats automatically so the shapes line up.
Class weighting is applied automatically: for cross-entropy / NLL, each class gets a weight inversely proportional to its training-split frequency (normalized so the weights average to 1, so balanced classes effectively pass through unchanged); for binary BCE, the positive class gets a weight equal to negative_count / positive_count. Regression losses like MSE and L1 are not reweighted. This means a verifier who computes loss locally without these weights will see different numbers, especially on imbalanced datasets.
Backward pass and optimizer step.
Per-batch monitoring metric: accuracy — the fraction of images whose predicted class matches the ground-truth class.

Validation step

Same forward pass without backward. The validation transform applies the same resize and normalization but no augmentation, so validation is deterministic across epochs.

Cycle metrics

Accuracy family: accuracy, top-3 accuracy, top-5 accuracy. For datasets with fewer than 3 (or 5) classes, the corresponding top-k accuracy collapses to 1.0 — interpret it accordingly.
Probability-based: macro-averaged AUC-ROC, macro-averaged AUC-PR, log loss, Brier score (multiclass squared-error form, not the binary sklearn version), quadratic weighted kappa.
Confusion matrix is produced and surfaced in the run output.

Inference output

Per image: the predicted class index and the full softmax probability vector. The class-index ordering is the dataset’s class list — match this ordering when comparing locally.

Object detection

Frameworks: PyTorch (YOLO and R-CNN model families)Input

Images plus per-image annotation files (Pascal VOC-style XML sidecars) listing each object’s class name and bounding-box coordinates.
Class names are matched case-insensitively against the dataset’s class list and mapped to integer indices in the order that list defines.

Choosing the model familyYou select the model family in the notebook — either an R-CNN family model (Faster R-CNN, Mask R-CNN) or a YOLO family model. The platform branches its training and validation logic on this choice, so it has to be set correctly for your model. If left unset, the platform falls back to inspecting the model name and class for the word “yolo”; if neither matches, it defaults to R-CNN. Picking an unsupported value will fail the run early.Preprocessing

Images are resized to a square at the size you set in the notebook (default 416 for R-CNN; for YOLO the platform pins the image size at 448 regardless of what you configure). Aspect ratio is not preserved — the resize is a direct stretch, and bounding-box coordinates are rescaled to the same stretched frame. Letterbox padding is not used today.
Pixel values are scaled to [0, 1]. ImageNet mean/std normalization is not applied in the object-detection pipeline by default — torchvision R-CNN models normalize internally as part of the model, and YOLO consumes the [0, 1] tensor directly.
Bounding boxes are validated before training: boxes that fall outside the image, are smaller than 2 pixels on a side, have an extreme aspect ratio, or cover a near-zero area are dropped (along with their labels) so the model never sees degenerate targets.
Class labels are zero-indexed — the first class in your dataset list is class 0. This differs from torchvision’s R-CNN convention where class 0 is reserved for background, so a torchvision pre-trained classifier head cannot be reused as-is.
The augmentation flags you set in the notebook drive a joint image-and-bounding-box augmentation pipeline that runs on the training split only. Geometric transforms are applied to the image and to its bounding-box coordinates together so labels stay aligned. Validation always sees the unaugmented preprocessing.
Train/validation split is random (non-stratified) and deduplicated by image filename, so all the boxes for a given image stay on the same side of the split. Default split is 85/15.

Training step

R-CNN family: the model is run in training mode and returns its internal loss dict (region-proposal, classification, box-regression, objectness). The platform sums these with equal weights and backpropagates. The loss function you set in the notebook is ignored for R-CNN — the model defines its own losses.
YOLO family: the model returns raw grid predictions. The loss is computed by an external loss module supplied alongside your model — the platform does not ship a built-in YOLO loss.

A backward pass and optimizer step follow.Per-batch monitoring metrics (loss-curve only, not the cycle metric)

R-CNN: an “all boxes correct” rate — an image counts as correct only if every ground-truth box has a predicted box of the same class with IoU above 0.2. Strict criterion; expect low values early in training.
YOLO: the fraction of grid cells with objectness confidence above 0.5.

Validation step

For R-CNN, the model is run in evaluation mode and produces a list of per-image predictions (boxes, scores, class labels) directly. No additional non-maximum suppression or score filtering is applied by the platform — whatever thresholds the model was constructed with apply.
For YOLO, every grid cell with positive objectness is decoded into a box in pixel coordinates. Non-maximum suppression is not applied by the platform on the YOLO path. If you want NMS for a fair local comparison, apply it in your local script with the same thresholds.

Cycle metricsStandard COCO-style detection metrics computed only on validation pairs where both prediction and ground truth have at least one box. Pairs where either side is empty are skipped so an all-empty-prediction model returns zero across the board cleanly.

Mean Average Precision: mAP averaged across IoU thresholds 0.5 to 0.95 in 0.05 steps (the COCO definition), plus mAP at fixed IoU thresholds of 0.5 and 0.75.
Mean Average Recall at 1 detection per image and at 10 detections per image.
IoU between matched prediction-target pairs (single-threshold, 0.2).
Generalized IoU between matched pairs.

The run also reports precision, recall, and F1, but these are currently placeholders and will read as zero — don’t compare against them.Inference output

Per image: a list of predicted bounding boxes, their confidence scores, and their predicted class labels. Boxes are returned in the same coordinate frame the model was trained on.

Semantic segmentation

Frameworks: PyTorchInput

An image and a corresponding mask file per row, supplied through the dataset metadata as data_id (image file) and mask_id (mask file).
Mask files are read from disk; class indices are derived as described under Mask handling below.

Preprocessing

The image and mask are resized to a square at the size you set in the notebook (default 256). Aspect ratio is not preserved — the resize is a direct stretch.
The image uses bilinear resampling; the mask uses nearest-neighbor so class indices stay integers. This is the most common reproduction mistake — bilinear on a mask invents non-existent classes.
Image pixel values are scaled to [0, 1]. ImageNet mean/std normalization is not applied by default in the segmentation pipeline.
The augmentation flags you set in the notebook drive a joint image-and-mask augmentation pipeline that runs on the training split only. The same geometric transform is applied to the image and to its mask so per-pixel labels stay aligned. Validation always sees the unaugmented preprocessing.
Train/validation split is random (non-stratified, deterministic seed). The platform uses your validation_split value, with two safety nets: a one-row dataset reuses the same data for train and val instead of crashing, and if your chosen split produces a degenerate partition the run silently retries with 80/20.

Mask handlingMask files are first read as grayscale, even if they are RGB on disk (so a multi-color RGB-encoded mask is effectively flattened before the class-index lookup). The grayscale pixel values are then mapped to class indices:

Binary problems (2 classes): the mask is thresholded at the midpoint of the 8-bit range — pixel values above 127 become class 1, the rest become class 0.
Multi-class problems: the first num_classes sorted unique pixel values in the mask file are treated as the canonical encodings and mapped to 0..N-1 in sorted order. Extra unique values that come from JPEG noise or anti-aliased edges are snapped to the nearest canonical neighbor, so every pixel ends up in [0, num_classes).

A user who encodes their masks differently locally (one-hot, RGB-color → class table, etc.) will not get the same loss numbers. Match this exact mapping when reproducing.Training step

Forward pass through the model. The pipeline accepts either a raw logits tensor or a dict-shaped output (the torchvision FCN / DeepLab family returns one), so torchvision-style models work without adaptation.
The loss function you configured in the notebook is used (cross-entropy is the common choice for segmentation). If no loss is configured, cross-entropy is used as a fallback.
If the model has an auxiliary classifier head (FCN / DeepLab with aux_loss=True), the total loss is main_loss + 0.4 × aux_loss, matching the torchvision reference recipe. The 0.4 weight is configurable.
Backward pass and optimizer step.
Per-batch monitoring metric: pixel accuracy — the fraction of pixels whose predicted class matches the ground-truth class.

Validation step

Same forward pass without backward. Predictions are taken as the argmax across the class dimension, producing a per-pixel class-index mask.

Cycle metrics

Pixel-level: pixel accuracy, mean pixel accuracy, IoU, mean IoU, frequency-weighted IoU, Dice
Boundary-aware: boundary IoU, boundary F1
Distance-based: Hausdorff distance, average surface distance
Classification-style (per-pixel, macro-averaged across classes): precision, recall, F1
Per-class IoU: one number per class that appeared in the cycle

A few definitions worth pinning down for local replication:

IoU here is the global Jaccard index across all pixels; mean IoU is the per-class IoU averaged across classes. They genuinely diverge on imbalanced data — pick the right one for your comparison.
Dice is macro-averaged across classes, computed on integer-class inputs (not one-hot).
Precision / recall / F1 are macro-averaged across classes.

Inference output

A per-image class-index mask at the configured image size.

Keypoint detection

Frameworks: PyTorch — three model families are supported: R-CNN-style keypoint detectors (KeypointRCNN), heatmap regressors, and direct coordinate regressors.Input

Images and per-image keypoint annotations supplied through the dataset metadata. Each keypoint is an [x, y, visibility] triple; the visibility component is optional.
Keypoints with non-positive x or y are treated as missing or out-of-frame — they contribute an all-zero plane in the heatmap target instead of needing a separate mask channel.

Choosing the model familyYou select the model family in the notebook. The platform’s loss, metrics, and per-batch monitoring all branch on this choice, so it has to match the model you uploaded.Preprocessing

Two size knobs that do different things:
- Image size for the model: the size you set in the notebook for what the model actually sees. Default 224. The image is resized to a square at this size, and keypoint coordinates are rescaled by the same factors so they stay aligned with the resized image. Aspect ratio is not preserved — the resize is a direct stretch.
- PCK reference size: a separate size used only as the reference scale for the per-batch PCK threshold (the threshold is set to 20% of this size). Default 256. Changing it does not change what the model sees — only how strict the per-batch correctness threshold is.
Pixel values are scaled to [0, 1]. ImageNet mean/std normalization is not applied by default in the keypoint pipeline.
For the heatmap family, ground-truth heatmaps are generated as 2D Gaussian peaks centered on each keypoint, at the resized image size, with a fixed standard deviation of 2 pixels. They are generated after augmentation so the targets stay aligned with the augmented image.
The augmentation flags you set in the notebook drive a joint image-and-keypoint augmentation pipeline that runs on the training split only. The same geometric transform is applied to the image and to its keypoint coordinates so the labels stay consistent.
Train/validation split is random (non-stratified) and uses a deterministic seed. Default 85/15. If the split fails on a tiny dataset, the same data is reused for both train and val instead of crashing.

Training step

R-CNN family: the model is run in training mode and returns its internal loss dict (region-proposal, classification, box-regression, keypoint losses). The platform sums these with equal weights and backpropagates. The loss function you set in the notebook is ignored for R-CNN — the model defines its own losses. A second forward pass in evaluation mode produces the per-image predictions used by the metrics layer.
Heatmap family: the model returns a heatmap tensor with one channel per keypoint. The loss function you configured in the notebook is applied between predicted and ground-truth heatmaps (mean squared error is the common choice). Per-batch keypoints are recovered by taking the argmax of each predicted heatmap channel.
Direct regression: the model returns keypoint coordinates directly. The loss function you configured in the notebook is applied between predicted and ground-truth coordinates.

A backward pass and optimizer step follow.Per-batch monitoring metricPercentage of Correct Keypoints (PCK): the fraction of predicted keypoints whose Euclidean distance to the ground truth is below 0.2 × PCK reference size (in pixels of the resized image).Validation step

Same forward pass without backward. Predictions are stored for the cycle-level metrics layer to consume.

Cycle metrics

Detection-style: precision, recall, F1 — computed from a per-keypoint TP/FP/FN match against a configurable distance threshold.
Position error: Mean Per-Joint Position Error (MPJPE), the mean Euclidean distance between predicted and ground-truth keypoints in pixels of the resized image. Mean Absolute Error (MAE) is also reported.
COCO-style: Object Keypoint Similarity (OKS). Per-image scale is derived from the bounding box of the ground-truth keypoints, and per-keypoint sigma defaults to a uniform 0.05.
PCK at multiple thresholds: [email protected], [email protected], [email protected], [email protected], [email protected].
Visibility accuracy: reported only when your dataset carries a visibility component on each keypoint.

Inference output

R-CNN: per-image predicted bounding boxes, confidence scores, class labels, and keypoints.
Heatmap: per-image heatmap stack; the predicted keypoint per channel is the argmax of that channel.
Direct regression: per-image (K, 2) keypoint coordinates.

Text classification

Frameworks: PyTorch — both HuggingFace Transformers and plain PyTorch models are supported. The platform detects at the start of training whether your model returns its own loss (HF style) or needs an external one (plain PyTorch) and routes accordingly.Input

A text file per sample on disk (one .txt file per row), plus a metadata table that lists each text’s filename and its class label.
The platform looks up <dataset_path>/<filename>.txt for each row, so your filenames must match exactly.

Preprocessing

Each text is tokenized with your configured tokenizer. If you didn’t specify one, the platform falls back to your configured model ID, and finally to a default tokenizer.
Tokens are padded and truncated to your configured maximum sequence length (default 512). Padding happens at tokenization time, so all batches see fixed-shape inputs.
Label-to-index mapping is fixed in the first training cycle and persisted alongside your weights. Subsequent cycles and inference reuse the same mapping, so the same class always maps to the same logit position. When reproducing locally, use the saved mapping rather than your own ordering.
Train/validation split is stratified by label with a deterministic seed (default 80/20). If stratification fails because a class has too few examples, the run silently falls back to a non-stratified random split with the same seed.

Training step

HuggingFace-style models: the model is called with input_ids, attention_mask, and labels, and returns its own loss. Your notebook’s loss function is ignored on this path — the model defines it.
Plain PyTorch models: the model is called with input_ids only and returns logits. The platform applies the loss function you configured in the notebook to compute the training loss. Input dtype is automatically cast to match the model’s parameter dtype (float, half, or long).

A backward pass and optimizer step follow. Gradient clipping is applied during the backward pass.Per-batch monitoring metric: accuracy — the fraction of texts whose argmax-predicted class matches the ground-truth class.Validation step

Same forward pass without backward. Logits are retained on CPU so the cycle metrics layer can compute probability-based metrics from them.

Optional model adaptationsLoRA fine-tuning is available for compatible models. You configure the LoRA parameters (rank, alpha, dropout, and whether to use Q-LoRA for quantized fine-tuning) from the notebook; SDK starting defaults are rank 256, alpha 512, dropout 0.05, Q-LoRA off.Cycle metrics

Classification basics (per-class, macro-averaged): precision, recall, F1.
F1 variants: F1 macro, F1 micro, F1 weighted.
Agreement metrics: Matthews correlation coefficient, Cohen’s kappa, quadratic weighted kappa.
Other classification: Hamming loss, Jaccard score (macro), F-beta at β = 0.5 and β = 2.0 (macro), specificity, negative predictive value (binary direct; multiclass macro-averaged), balanced accuracy.
Probability-based: AUC-ROC (binary on the positive class; multiclass one-vs-rest macro-averaged), AUC-PR (average precision; multiclass macro-averaged over one-hot encodings), Gini coefficient and normalized Gini, log loss, Brier score (multiclass squared-error form).
Top-k accuracy for problems with more than two classes: top-3 and top-5, reported only when k is strictly less than the number of classes.
Confusion matrix: produced with a fixed label order matching your dataset’s class list — pin to that order when comparing locally.

Each metric is computed independently; if one fails (for example, AUC-ROC on a single-class validation slice), it falls back to zero rather than failing the entire cycle.Class weighting is not applied automatically for text classification. If your dataset is imbalanced, configure class weights in your loss function from the notebook.Inference output

Per text: the predicted class index plus the softmax probability vector. The class-index ordering follows your dataset’s class list — match that ordering when comparing locally.

Tabular classification

Frameworks: PyTorch, TensorFlow, and any scikit-learn-compatible estimator (including XGBoost and LightGBM).Input

A tabular file with feature columns plus a label column. The label column name is configurable; categorical feature values can be strings.

PreprocessingThe preprocessing pipeline runs in this order, and the full set of fitted statistics (which columns to use, imputation values, category mappings, label-to-index map, scaling means and standard deviations) is frozen in the first training cycle and reused in subsequent cycles and at inference. When reproducing a run locally, pull these statistics from the experiment artifacts rather than refitting on your own data slice.

Column selection. You can configure which columns the model sees from the notebook — either an include list, an exclude list, or derived feature definitions. If you don’t configure anything, all columns from the dataset’s schema are used.
Missing value imputation. Numeric columns are filled with the median of the training split. Categorical columns are filled with the literal string "Unknown". The label column is not imputed.
Binary encoding. Columns with exactly two distinct values that look like booleans (Y/N, YES/NO, TRUE/FALSE, 1/0, plus literal Python booleans) are auto-encoded to 0/1 integers. The truthy and falsy strings are configurable.
Categorical encoding. Two strategies, configurable from the notebook:
- Label encoding (default): each distinct string in a categorical column maps to a small integer based on the order it first appears in the training split. Categories not seen during training map to -1 at inference.
- One-hot encoding: each distinct string becomes its own 0/1 column.
Label-to-index mapping. Class labels are mapped to integer indices in the order defined by your dataset’s class list, so a class always lines up with the same logit position. By default, encountering a label that isn’t in the class list fails the run; this strict check is configurable.
Numeric feature scaling. Numeric feature columns are z-scored (mean 0, standard deviation 1) using training-split statistics; the label column is excluded. On by default and can be turned off.

Train/validation splitStratified by class label with a deterministic seed. Default 85/15. If the configured split would leave one side empty, the run silently retries with the ratio clamped into a safe range.Training step

PyTorch and TensorFlow: a forward pass produces a logit per class. The loss function you configured in the notebook is applied; cross-entropy is the typical choice. A backward pass and optimizer step follow.
scikit-learn: training is a single fit(X, y) call per batch using the estimator’s built-in objective. There is no separate forward / backward pass and no notebook-configured loss on this path.

Class weighting is applied automatically across all three frameworks. For cross-entropy and similar log-likelihood losses, each class is weighted in inverse proportion to its training-split frequency, normalized so the weights average to 1 — so balanced datasets effectively pass through unchanged, and imbalanced ones get the rarer classes upweighted.Per-batch monitoring metric: accuracy — the fraction of rows whose predicted class matches the ground-truth class.Validation step

Same forward pass without backward. Predictions and raw logits are retained so the cycle metrics layer can compute probability-based metrics from them.

Cycle metrics

Classification basics (per-class, macro-averaged): precision, recall, F1.
Other classification metrics: balanced accuracy, F-beta at β = 0.5 and β = 2.0 (macro), Matthews correlation coefficient, Cohen’s kappa, quadratic weighted kappa, Hamming loss, Jaccard score (macro), specificity, negative predictive value (binary direct; multiclass macro-averaged).
Probability-based (when raw logits are available): AUC-ROC (binary on the positive class; multiclass one-vs-rest macro), AUC-PR (average precision; multiclass macro over one-hot), Gini coefficient and normalized Gini, Brier score (multiclass squared-error form).
Confusion matrix: produced with a fixed label order matching your dataset’s class list — pin to that order when comparing locally.

Each metric is computed independently; if one fails, it falls back to zero rather than crashing the cycle.Inference output

Per row: the predicted class index plus the predicted probability vector (softmax for multiclass, sigmoid for binary). The class-index ordering follows your dataset’s class list — match that ordering when comparing locally.

Tabular regression

Frameworks: PyTorch, TensorFlow, and any scikit-learn-compatible regressor (including XGBoost and LightGBM).Input

A tabular file with feature columns plus a continuous target column. The target column name is configurable.
Rows whose target is missing or non-finite are dropped before training so they can’t poison gradients.

PreprocessingThe feature pipeline is the same as tabular classification — column selection (or full schema if you don’t configure one), median / "Unknown" imputation, binary encoding, categorical encoding (label or one-hot), and z-scoring of numeric feature columns. There are two regression-specific differences:

Label-to-index mapping is skipped. The target stays numeric.
Target scaling. By default, the target column is also z-scored using training-split statistics (mean and standard deviation). The platform stores the scaling parameters alongside your weights and inverse-transforms predictions and labels back to the original target scale before computing cycle metrics — so the reported error numbers are in your data’s original units, not in the z-scored space the loss is computed in. Target scaling can be turned off from the notebook; if you turn it off but leave feature scaling on, and your target has a much wider range than your scaled features, you’ll see a warning in the run log because the loss will dominate strangely.

The fitted preprocessing state (column choices, imputation values, category mappings, feature scaling stats, target scaling stats) is frozen in the first training cycle and reused in subsequent cycles and at inference.Train/validation splitRandom (non-stratified — the target is continuous), with a deterministic seed. Default 85/15. If the configured split would leave one side empty, the run silently retries with the ratio clamped into a safe range.Training step

PyTorch and TensorFlow: a forward pass produces a continuous prediction per row (or per timestep, for sequence-shaped outputs — the platform takes the last timestep). The loss function you configured in the notebook is applied; MSE is the typical default, with MAE, smooth L1, and Huber as common alternatives. A backward pass and optimizer step follow. Gradient clipping is applied only when the global gradient norm exceeds 10, then clipped to 10 — so well-behaved training runs see no clipping and unstable runs are kept from blowing up.
scikit-learn: training is a single fit(X, y) call using the estimator’s built-in objective. The platform fits the estimator once in the first training batch of the first cycle; subsequent cycles only run prediction. If you want a sklearn regressor that actually updates across federated cycles, choose one that supports incremental / warm-start fitting.

Per-batch monitoring metric: R² (coefficient of determination), accumulated from the running residual sum of squares and target variance.Validation step

Same forward pass without backward. Predictions are stored, then inverse-transformed alongside ground-truth targets at cycle metric time so the reported metrics are in original units.

Cycle metricsAll values are reported in the original target scale (after the inverse transform if target scaling was on):

Standard error metrics: mean absolute error, mean squared error, root mean squared error, median absolute error, max absolute error.
Percentage error metrics: mean absolute percentage error (computed only over rows whose true value is non-zero — entirely-zero validation slices return NaN), symmetric mean absolute percentage error.
Goodness of fit: R², explained variance.
Bias: mean bias error (mean of prediction − target — positive means systematic over-prediction).
Log-scale error: root mean squared log error, computed only when both predictions and targets are non-negative; otherwise NaN.

Rows with NaN in either prediction or target are filtered out before metrics are computed (with a warning if any rows were dropped). If every row is NaN, all metrics return NaN rather than crashing.Inference output

Per row: a single continuous prediction in the original target scale (inverse-transformed if target scaling was on during training).

Time series forecasting

Frameworks: PyTorchInput

A time-indexed table with one timestamp column, one or more feature columns, and a continuous target column. Rows are expected to arrive in chronological order.

Preprocessing

Feature and target scaling. Both the feature columns and the target column are scaled using statistics fit on the training window only, then re-applied to the validation window and to inference data. The choice of scaler is configurable from the notebook (Min-Max scaling or standard z-scoring); Min-Max is the default. The fitted scaler instances are persisted alongside your weights and reused in subsequent cycles and at inference, so a federated run keeps a consistent scale across cycles.
Sliding-window construction. From the chronologically ordered, scaled rows the platform builds sliding-window samples: each input is a sequence of length sequence length (the lookback window you set in the notebook, default 60), and each target covers the next forecast horizon steps (default 1). With a single-step horizon the target is scalar; with a longer horizon it’s a vector of that length.
Auto-adjusted sequence length. If your training or validation window is too short to fit even one full lookback-plus-horizon sample, the platform shortens the sequence length to the largest feasible value rather than crashing, and logs a warning. Forecast horizon is never silently shrunk — that’s part of your experiment contract.

Train/validation splitTemporal (no shuffling) — the validation window is strictly later in time than the training window so future leakage is impossible. Default 80/20.Training step

The forward pass takes a batch of sequences and returns predictions shaped to the forecast horizon. The platform automatically reshapes outputs to match the target shape (squeezing or broadcasting trailing singleton dimensions).
The loss function you configured in the notebook is applied; mean squared error is the typical default, with mean absolute error, smooth L1, and Huber as common alternatives.
A backward pass and optimizer step follow.

Per-batch monitoring metric: mean absolute error between predicted and target sequences (in the scaled space the model trains on).Validation step

Same forward pass without backward. Predictions are stored for the cycle metrics layer.

Cycle metricsAll metrics are computed on the scaled target space the model trains in (no inverse transform is applied at metric time today — interpret error magnitudes accordingly):

Standard error metrics: mean absolute error, mean squared error, root mean squared error, max absolute error.
Goodness of fit: R² (returns NaN on degenerate slices, e.g. constant targets).
Percentage errors: mean absolute percentage error (skips rows whose true value is near zero), median absolute percentage error (robust to outliers), symmetric MAPE.
Direction accuracy: the percentage of consecutive timestep pairs where the predicted change has the same sign as the actual change — a “did the model get the trend right” metric.
Theil’s U: a normalized error statistic comparing predicted vs. actual change between consecutive steps. Lower is better.

Each metric is wrapped in error handling — degenerate inputs return NaN rather than failing the cycle.Inference output

Per input window: a predicted sequence of length equal to the forecast horizon, in the model’s scaled space. Use the saved scaler from the experiment artifacts to invert the predictions back to the original target scale.

Time-to-event prediction

Frameworks: PyTorch (a neural risk model trained with the Cox partial-likelihood loss), plus lifelines and scikit-survival (classical survival estimators that fit in a single pass and skip the neural training loop).Input

A tabular file with feature columns plus two target columns: a duration column (time until the event, or until censoring) and an event indicator column (1 if the event was observed, 0 if the row was censored).

Preprocessing

A small set of internal bookkeeping columns is dropped from the feature set automatically (filename, IDs, timestamps, etc.) so all three frameworks see the same feature columns.
For the PyTorch path, the same tabular preprocessing pipeline used by tabular classification and regression is applied: median / "Unknown" imputation, binary and categorical encoding, and z-scoring of numeric features. The fitted state is frozen in cycle 1 and reused thereafter.
For the lifelines and scikit-survival paths, the data is passed through to the estimator unmodified — those libraries handle their own preprocessing internally.

Train/validation splitRandom split with a deterministic seed. Default 85/15. The split is not stratified by the event indicator (matching the legacy behavior). The Cox loss sorts by duration internally at every step, so the order rows arrive in does not affect the loss.Training step

PyTorch: the model takes the feature matrix and produces a single risk score per row (higher = worse prognosis). The loss is Cox partial log-likelihood — a survival-specific loss that ranks each observed event against everyone who was still at risk at that event time. The loss is hardcoded for this use case; the loss function you configured in the notebook is ignored on this path because Cox is the only canonical choice. A backward pass and optimizer step follow. Gradient clipping is applied only when the global gradient norm exceeds 10, then clipped to 10 — Cox loss can spike on small batches when one event dominates the risk set, but well-behaved batches see no clipping.
Lifelines / scikit-survival: a single fit call on the full training slice, using the estimator’s own optimization. There is no separate forward / backward pass.

Validation step

PyTorch: same forward pass without backward; risk scores are stored for the cycle metric.
Lifelines / scikit-survival: predictions are taken from the fitted estimator’s standard prediction interface.

Cycle metric

Concordance index (C-index) — the standard survival metric. It measures the fraction of comparable pairs of samples that the model ranks correctly: among pairs where you can tell from the data which subject experienced the event sooner, the C-index is the fraction the model also ranks in that order. A C-index of 1.0 means perfect ranking; 0.5 means random; and below 0.5 means the model is ranking inversely. Computed on the entire validation slice at once (it cannot be averaged per-batch). The platform negates risk scores when calling the underlying library to handle the convention difference — internally, higher = sooner-event for the model, but the library expects higher = longer-survival.

Inference output

Per row: a single risk score (higher means earlier predicted event). The C-index is also reported over the test set.

Reproducing a run locally

Expect small variation, even with everything matched. Two runs of the same script on the same data on the same machine can produce slightly different metric values — that’s a property of modern deep-learning stacks, not a tracebloc-specific quirk. Reproducing a tracebloc run on your own machine compounds the same effects, so plan to compare numbers within a tolerance band (typically the second or third decimal place for accuracy-style metrics, a few percent for percentage errors), not exactly.Common reasons numbers move:

Hardware differences. GPU vs CPU, different GPU models, and different CUDA / cuDNN versions execute the same operations through different kernels. Sums of floating-point numbers are not associative, so reductions on different hardware can produce slightly different last-decimal-place values.
GPU non-determinism. Several common operations (some convolution backward passes, some scatter/gather kernels, atomic accumulation) are not deterministic by default — running the same forward/backward twice on the same GPU can produce different gradients.
Library versions. Different versions of PyTorch, TensorFlow, scikit-learn, and torchmetrics can change defaults, fix bugs, or alter numerical paths in ways that move the final numbers a little.
Data-loader worker timing. When the data loader uses multiple worker processes, the order batches actually arrive in can depend on process scheduling — different orderings produce slightly different gradient sequences and slightly different end-of-epoch state, even with the same shuffle seed.
Federated averaging. A tracebloc run trains across multiple federated cycles in which model weights are averaged across edges between cycles. A single-machine local run cannot reproduce that averaging step exactly — for multi-cycle and multi-edge experiments, the platform’s cycle-end weights and your local cycle-end weights will diverge after the first averaging round.
Stateful layers. Batch normalization’s running statistics, dropout masks, and any other stochastic layer state depend on batch order and initialization, both of which are sensitive to the points above.
Mixed precision. If your local run uses different mixed-precision settings than the platform did, you’ll see small differences from rounding alone.

If your local numbers land within a reasonable band of the platform’s, the run reproduced. If they diverge by a meaningful margin, please contact us — that signals a real mismatch (likely preprocessing, data-slice, or configuration drift) worth investigating together.

To validate a result you saw on the platform:

Take the same data slice

Use the same dataset and the same train/validation split ratio you configured. Match the split strategy for your use case — stratified by label for image and tabular classification, deduplicated by image for object detection, temporal (no shuffle) for time series, and so on.

Apply the same preprocessing

Match the preprocessing described in the section for your use case — especially feature scaling, target scaling (for regression and time series), and categorical encoding, all of which materially shift loss values. For use cases where the preprocessing state is frozen in the first cycle (tabular, time series, time-to-event), pull the saved statistics from your experiment artifacts rather than refitting on your slice.

Use the same model file

Run the same model architecture, and the same pre-trained weights if you started from any.

Match the experiment configuration

Use the same loss, optimizer, learning rate, batch size, epochs, cycles, sequence length, augmentation flags, etc. — read them off the experiment view or your notebook. The shared parameter table above lists where the platform falls back when something is unset.

Compute the same metrics

Match the metric definitions called out per use case (averaging convention, threshold choices, scaled vs. original target space). Each accordion above lists the exact set of metrics the platform reports for that use case.

Compare cycle-level numbers

The numbers shown in the platform UI are cycle-level aggregates, not per-batch. Run a full cycle locally before comparing.

If a number on your end doesn’t line up with what the platform reports, please contact us so we can investigate together.

Documentation Index

​Shared lifecycle

​Experiment parameters (shared across all use cases)

​Per use case

​Reproducing a run locally

Shared lifecycle

Experiment parameters (shared across all use cases)

Per use case

Reproducing a run locally