Skip to main content
This guide walks you through the 4 key steps to create, publish, and manage an AI use case on the tracebloc platform. Make sure you have a tracebloc client running and your data is ingested. Navigate to the use cases section, click on the ”+” on the top right corner and simply follow along. Use this documentation for context, clarification and examples when needed.

Step 1: Initialize and Set Privacy

Objective: Define the basics and visibility of your AI use case.
  • Title
  • Cover Image (optional): JPG or PNG (max. 25MB)
  • Task: Select the task, e.g. “Image Classification”, “Object Detection”, “Tabular Classification”, etc. See the full list of supported data types and tasks. In case your use case is not yet supported, please reach out to us at [email protected].
  • Privacy Type: Choose Public (visible to all users) or Private (invite-only visibility).
Initialize Preview your use case tile on the right side of the interface.

Step 2: Data & Evaluation

Objective: Attach datasets and define the benchmarking logic. Data & Evaluation
  • Training and test metadataset: A metadataset is a reference to an ingested dataset that stores summary information such as the number of samples and columns. Select the training and test metadatasets that correspond to the datasets you ingested in the Prepare Data step.
  • Score: Define which benchmark to use for evaluation (e.g. Accuracy, F1, etc.). In case your evaluation metric is not yet supported, please reach out to us at [email protected].
  • Upload EDA File (Optional): Attach a .ipynb EDA file to help participants understand the data context. Explore template use cases for inspiration.

Supported Metrics per Data Type and Task

The tables below list all evaluation metrics grouped by task type. Each metric uses either higher is better or lower is better sort order on the leaderboard.

Image Classification

MetricDescriptionSort Order
AccuracyProportion of predictions that exactly match the ground truth. Can be misleading on imbalanced datasets.Higher is better
PrecisionMeasures the proportion of predicted positives that are actually positive. Important when false positives are costly.Higher is better
RecallMeasures the proportion of actual positives correctly identified. Important when missing a positive instance is costly.Higher is better
F1 ScoreBalances precision and recall into a single metric. Especially useful for imbalanced datasets.Higher is better
LossQuantifies the error between predicted outputs and actual values. A core metric used during training and optimization.Lower is better
Log LossMeasures how well a model predicts probability estimates for each class. Penalizes overconfident incorrect predictions.Lower is better
AUC-ROCMeasures ability to distinguish between classes across all thresholds, independent of any single decision threshold.Higher is better
AUC-PRMeasures precision-recall balance across different thresholds. Especially useful for highly imbalanced datasets.Higher is better
Top-3 AccuracyMeasures how often the true class label appears among the model’s top three predictions.Higher is better
Top-5 AccuracyMeasures how often the true class label appears among the model’s top five predictions.Higher is better
Cohen’s KappaMeasures agreement between predicted and ground truth labels while accounting for chance agreement.Higher is better
Matthews Correlation Coefficient (MCC)Classification quality using all parts of the confusion matrix. Balanced even with imbalanced classes. Ranges from -1 to 1.Higher is better
Quadratic Weighted Kappa (QWK)Measures agreement between predicted and ground truth labels, penalizing larger disagreements more heavily.Higher is better
Brier ScoreMean squared difference between predicted probabilities and actual outcomes.Lower is better

Text Classification

MetricDescriptionSort Order
AccuracyProportion of predictions that exactly match the ground truth. Can be misleading on imbalanced datasets.Higher is better
PrecisionMeasures the proportion of predicted positives that are actually positive. Important when false positives are costly.Higher is better
RecallMeasures the proportion of actual positives correctly identified. Important when missing a positive instance is costly.Higher is better
F1 ScoreBalances precision and recall into a single metric. Especially useful for imbalanced datasets.Higher is better
F1 WeightedWeights each class’s F1 Score by its support (number of true instances). Suitable for imbalanced multi-class classification.Higher is better
Micro F1Aggregates true positives, false positives, and false negatives across all classes, treating every prediction equally.Higher is better
LossQuantifies the error between predicted outputs and actual values. A core metric used during training and optimization.Lower is better
Log LossMeasures how well a model predicts probability estimates for each class. Penalizes overconfident incorrect predictions.Lower is better
AUC-ROCMeasures ability to distinguish between classes across all thresholds, independent of any single decision threshold.Higher is better
Hamming LossMeasures the fraction of labels incorrectly predicted. Commonly used in multi-label classification tasks.Lower is better
Jaccard ScoreMeasures similarity between predicted and ground truth labels by comparing their intersection to their union.Higher is better
Cohen’s KappaMeasures agreement between predicted and ground truth labels while accounting for chance agreement.Higher is better
Matthews Correlation Coefficient (MCC)Classification quality using all parts of the confusion matrix. Balanced even with imbalanced classes. Ranges from -1 to 1.Higher is better
Quadratic Weighted Kappa (QWK)Measures agreement between predicted and ground truth labels, penalizing larger disagreements more heavily.Higher is better
Brier ScoreMean squared difference between predicted probabilities and actual outcomes.Lower is better

Tabular Classification

MetricDescriptionSort Order
AccuracyProportion of predictions that exactly match the ground truth. Can be misleading on imbalanced datasets.Higher is better
PrecisionMeasures the proportion of predicted positives that are actually positive. Important when false positives are costly.Higher is better
RecallMeasures the proportion of actual positives correctly identified. Important when missing a positive instance is costly.Higher is better
F1 ScoreBalances precision and recall into a single metric. Especially useful for imbalanced datasets.Higher is better
LossQuantifies the error between predicted outputs and actual values. A core metric used during training and optimization.Lower is better
Log LossMeasures how well a model predicts probability estimates for each class. Penalizes overconfident incorrect predictions.Lower is better
AUCMeasures ability to distinguish between positive and negative classes across all classification thresholds.Higher is better
AUC-ROCMeasures ability to distinguish between classes across all thresholds, independent of any single decision threshold.Higher is better
AUC-PRMeasures precision-recall balance across different thresholds. Especially useful for highly imbalanced datasets.Higher is better
Balanced AccuracyAverages recall across all classes, ensuring each class contributes equally regardless of frequency.Higher is better
Specificity (True Negative Rate)Measures the proportion of actual negatives correctly identified. Important when false positives must be minimized.Higher is better
NPV (Negative Predictive Value)Measures the proportion of predicted negatives that are actually negative. Important when confirming absence matters.Higher is better
F-beta Score (beta = 0.5)Balances precision and recall with more emphasis on precision. Suitable when false positives are more costly.Higher is better
F-beta Score (beta = 2)Balances precision and recall with more emphasis on recall. Suitable when missing positive instances is more costly.Higher is better
Hamming LossMeasures the fraction of labels incorrectly predicted. Commonly used in multi-label classification tasks.Lower is better
Jaccard ScoreMeasures similarity between predicted and ground truth labels by comparing their intersection to their union.Higher is better
Cohen’s KappaMeasures agreement between predicted and ground truth labels while accounting for chance agreement.Higher is better
Matthews Correlation Coefficient (MCC)Classification quality using all parts of the confusion matrix. Balanced even with imbalanced classes. Ranges from -1 to 1.Higher is better
Quadratic Weighted Kappa (QWK)Measures agreement between predicted and ground truth labels, penalizing larger disagreements more heavily.Higher is better
Brier ScoreMean squared difference between predicted probabilities and actual outcomes.Lower is better
Gini CoefficientMeasures discriminatory power between positive and negative classes. Closely related to AUC-ROC.Higher is better
Normalized GiniScales the Gini Coefficient relative to a perfect model, enabling fair comparison across datasets. Ranges from -1 to 1.Higher is better

Object Detection

MetricDescriptionSort Order
AccuracyProportion of predictions that exactly match the ground truth.Higher is better
LossQuantifies the error between predicted outputs and actual values. A core metric used during training and optimization.Lower is better
Mean Average Precision (mAP)Evaluates the quality of ranked predictions. Commonly used in object detection and ranking tasks.Higher is better
Mean Average Precision @ IoU 0.50Evaluates object detection requiring at least 50% overlap between predicted and ground truth bounding boxes.Higher is better
Mean Average Precision @ IoU 0.75Stricter variant requiring at least 75% overlap between predicted and ground truth bounding boxes.Higher is better
mAP per ClassReports Average Precision individually for each object class. Helps identify which classes the model struggles with.Higher is better
Intersection over Union (IoU)Measures overlap between predicted and ground truth regions. Used for localization accuracy.Higher is better
GIoU (Generalized IoU)Extends standard IoU by penalizing non-overlapping predictions using the smallest enclosing box. Ranges from -1 to 1.Higher is better
Mean Average Recall @ 1 DetectionMeasures how well the model retrieves ground truth objects when only the single top-scoring detection is allowed.Higher is better
Mean Average Recall @ 10 DetectionsMeasures retrieval of ground truth objects when up to 10 detections per image are allowed.Higher is better

Semantic Segmentation

MetricDescriptionSort Order
AccuracyProportion of predictions that exactly match the ground truth.Higher is better
PrecisionMeasures the proportion of predicted positives that are actually positive.Higher is better
RecallMeasures the proportion of actual positives correctly identified.Higher is better
F1 ScoreBalances precision and recall into a single metric.Higher is better
LossQuantifies the error between predicted outputs and actual values.Lower is better
Intersection over Union (IoU)Measures overlap between predicted and ground truth regions.Higher is better
Mean Intersection over Union (mIoU)Averages IoU across all classes, evaluating how well a model predicts each class region.Higher is better
Frequency Weighted IoUWeights each class’s IoU by its relative frequency in the ground truth, giving more importance to dominant classes.Higher is better
Dice CoefficientMeasures similarity between predicted and ground truth segmentation regions. Especially sensitive to small structures.Higher is better
Pixel AccuracyProportion of correctly classified pixels across the entire image. Can be dominated by frequent classes.Higher is better
Mean Pixel AccuracyAverages pixel accuracy per class, giving equal importance to all classes regardless of frequency.Higher is better
Boundary IoUMeasures how well predicted segmentation boundaries align with ground truth boundaries. Focuses on edge accuracy.Higher is better
Boundary F1 ScoreCombines boundary precision and recall to evaluate how accurately predicted boundaries match ground truth edges.Higher is better
Hausdorff DistanceMaximum distance between predicted and ground truth boundaries. Captures the worst-case boundary mismatch.Lower is better
Average Surface Distance (ASD)Average distance between predicted and ground truth boundary points. Stable measure of overall boundary alignment.Lower is better

Instance Segmentation

MetricDescriptionSort Order
AccuracyProportion of predictions that exactly match the ground truth.Higher is better
PrecisionMeasures the proportion of predicted positives that are actually positive.Higher is better
RecallMeasures the proportion of actual positives correctly identified.Higher is better
F1 ScoreBalances precision and recall into a single metric.Higher is better
LossQuantifies the error between predicted outputs and actual values.Lower is better

Keypoint Detection

MetricDescriptionSort Order
PrecisionMeasures the proportion of predicted positives that are actually positive.Higher is better
RecallMeasures the proportion of actual positives correctly identified.Higher is better
F1 ScoreBalances precision and recall into a single metric.Higher is better
LossQuantifies the error between predicted outputs and actual values.Lower is better
Mean Absolute Error (MAE)Average magnitude of errors between predicted and actual values, expressed in the same units as the target.Lower is better
PCK (Percentage of Correct Keypoints)Measures how accurately predicted keypoints fall within a specified distance of ground truth.Higher is better
[email protected]Keypoints within a normalized distance threshold of 0.05 from ground truth. Strictest variant.Higher is better
[email protected]Keypoints within a normalized distance threshold of 0.10 from ground truth.Higher is better
[email protected]Keypoints within a normalized distance threshold of 0.20 from ground truth.Higher is better
[email protected]Keypoints within a normalized distance threshold of 0.30 from ground truth.Higher is better
[email protected]Keypoints within a normalized distance threshold of 0.50 from ground truth. Most lenient variant.Higher is better
Object Keypoint Similarity (OKS)Measures similarity between predicted and ground truth keypoints, accounting for scale and localization uncertainty.Higher is better
Mean Per Joint Position Error (MPJPE)Average Euclidean distance between predicted and ground truth joint positions. Standard for pose estimation.Lower is better
Visibility AccuracyMeasures how correctly the model predicts the visibility status of keypoints, independent of spatial localization.Higher is better

Tabular Regression

MetricDescriptionSort Order
AccuracyProportion of predictions that exactly match the ground truth.Higher is better
LossQuantifies the error between predicted outputs and actual values.Lower is better
Mean Absolute Error (MAE)Average magnitude of errors between predicted and actual values, expressed in the same units as the target.Lower is better
Mean Squared Error (MSE)Average squared difference between predicted and actual values. Penalizes larger errors more heavily.Lower is better
Root Mean Squared Error (RMSE)Square root of MSE, expressing prediction error in the same units as the target variable.Lower is better
R² (Coefficient of Determination)Measures how well a regression model explains variance in the target. A value of 1.0 indicates perfect prediction.Higher is better
Root Mean Squared Logarithmic Error (RMSLE)Measures error on a logarithmic scale. Useful when target values span several orders of magnitude.Lower is better
Median Absolute Error (Median AE)Uses the median of absolute errors instead of the mean. Highly robust to outliers.Lower is better
Explained VarianceMeasures how well the model captures the variance of the target, independent of systematic bias. Ranges up to 1.0.Higher is better
Mean Bias Error (MBE)Measures the average bias in predictions. Positive = overestimation, negative = underestimation.Lower is better

Time Series Forecasting

MetricDescriptionSort Order
AccuracyProportion of predictions that exactly match the ground truth.Higher is better
LossQuantifies the error between predicted outputs and actual values.Lower is better
Mean Absolute Error (MAE)Average magnitude of errors between predicted and actual values, expressed in the same units as the target.Lower is better
Mean Squared Error (MSE)Average squared difference between predicted and actual values. Penalizes larger errors more heavily.Lower is better
Root Mean Squared Error (RMSE)Square root of MSE, expressing prediction error in the same units as the target variable.Lower is better
R² (Coefficient of Determination)Measures how well a regression model explains variance in the target. A value of 1.0 indicates perfect prediction.Higher is better
Root Mean Squared Logarithmic Error (RMSLE)Measures error on a logarithmic scale. Useful when target values span several orders of magnitude.Lower is better
Mean Absolute Percentage Error (MAPE)Average percentage difference between predicted and actual values. Easy to interpret across different scales.Lower is better
Symmetric Mean Absolute Percentage Error (SMAPE)Symmetric variant of MAPE that reduces issues when actual values are close to zero.Lower is better
Median Absolute Percentage Error (MdAPE)Uses the median of absolute percentage errors. More robust to outliers.Lower is better
Theil’s U (U2 Statistic)Measures forecasting accuracy relative to a naive benchmark. Values below 1.0 mean the model outperforms the baseline.Lower is better
Max ErrorCaptures the single largest absolute difference between predicted and actual values.Lower is better
Direction AccuracyMeasures how often the model correctly predicts the direction of change (up or down) between consecutive values.Higher is better

Time-to-Event Prediction

MetricDescriptionSort Order
F1 ScoreBalances precision and recall into a single metric.Higher is better
Concordance Index (C-Index)Measures how well predicted risk scores agree with the observed ordering of event times. Standard for survival analysis.Higher is better

Step 3: Describe Your Use Case

Objective: Describe your use case and objective in detail. Provide a clear description that helps participants understand the problem, the data context, and the goal. Cover what the data represents, what a good model should achieve, and any domain-specific considerations participants should be aware of. Browse published use cases in the Explore section for examples of well-written descriptions. Describe

Step 4: Review & Submit

Objective: Set collaboration and resource constraints. Add emails of vendors, colleagues, or researchers. Invitations are sent once the use case is saved or published. For instructions for data scientists about how to join your use case, follow the join a use case guide. Review & Submit

Compute Assignment

Define training budget in PFLOPs. Example: 10 participants × 200 PFLOPs each = 2,000 PFLOPs Cost Calculation: 2,000 PFLOPs × €0.025 = €50.00 Always allocate more resources than minimum requirements and monitor resource usage regularly. You can stop or adjust training at any time.

Final Step: Publish or Save as Draft

Use “Publish” to go live or “Save as Draft” to continue editing later. You can now see your use case in the use cases section.

Next Steps

Once your use case is published, reach out to external vendors, your colleagues or data scientists to train models on your use case. In the use case view, monitor
  • total resource consumption
  • daily submits and user activity
  • overall leaderboard and submissions
Once models have been submitted, you can compare them in the leaderboard section of a use case.

Need Help?