Skip to main content

Training Metrics

We use several metrics to evaluate the performance of your machine learning model. These metrics can be grouped into three categories: Performance-focused metrics, Sustainability-focused metrics and Mixed metrics. The first two are used to select the optimal cycle for a singular Experiment, while the last one is used to compare and select the optimal Experiment in a series of Experiments.

1. Performance-focused metrics

To better explain these metrics, we will work alongside with an example. To start out we have a ground truth and an example model prediction.

ground_truth = [1, 1, 1, 0, 0, 1, 0, 1, 0]
prediction = [1, 1, 0, 0, 1, 1, 0, 1, 0]

There are four different outcomes for the models prediction:

def true_positive(ground_truth, prediction):
tp = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 1 and pred == 1:
tp +=1
return tp

def true_negative(ground_truth, prediction):
tn = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 0 and pred == 0:
tn +=1
return tn

def false_positive(ground_truth, prediction):
fp = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 0 and pred == 1:
fp +=1
return fp

def false_negative(ground_truth, prediction):
fn = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 1 and pred == 0:
fn +=1
return fn

For our example we have:

true_positive(ground_truth, prediction)
# out: 4
true_negative(ground_truth, prediction)
# out: 3
false_positive(ground_truth, prediction)
# out: 1
false_negative(ground_truth, prediction)
# out: 1

Accuracy

The accuracy is a measure of how well a model correctly predicts the output for a given input. It is calculated by dividing the number of correct predictions by the total number of predictions made.

def accuracy(ground_truth, prediction):
tp = true_positive(ground_truth, prediction)
fp = false_positive(ground_truth, prediction)
tn = true_negative(ground_truth, prediction)
fn = false_negative(ground_truth, prediction)
accuracy = (tp + tn) / (tp + fp + tn + fn)
return accuracy

# Example:

accuracy(ground_truth, prediction)
# out: 7/9

Loss

Loss is a measure of how far a model's predictions are from the true values. A lower loss indicates that the model is performing well. The exact value depends on the loss-function used in the models evaluation. Click here to find our supported loss function.

Precision

Precision is a measure of the number of correct positive predictions made by the model, compared to the total number of positive predictions made. In other words, the precision of a model shows how accurate the positive predictions are.

def precision(ground_truth, prediction):
tp = true_positive(ground_truth, prediction)
fp = false_positive(ground_truth, prediction)
prec = tp / (tp + fp)
return prec

# Example

precision(ground_truth, prediction)
# out: 0.8

Recall

Recall is a measure of the number of correct positive predictions made by the model, compared to the total number of actual positive instances. In other words, the recall (also called "sensitivity") of the model shows the coverage of actual positive samples.

def recall(ground_truth, prediction):
tp = true_positive(ground_truth, prediction)
fp = false_positive(ground_truth, prediction)
rec = tp / (tp + fn)
return rec

# Example

recall(ground_truth, prediction)
# out: 0.8

F1-score

The F1-score is the harmonic mean of precision and recall. It is a good metric to use when you want to balance precision and recall.

def f1(ground_truth, prediction):
p = precision(ground_truth, prediction)
r = recall(ground_truth, prediction)
f1_score = 2 * p * r/ (p + r)
return f1_score

# Example

f1_score(ground_truth, prediction)
# out: 0.8

Vgap

Vgap is a metric that measures the absolute difference between the validation loss and the training loss. A larger vgap indicates that the model may be overfitting the training data.

def vgap(training_loss, validation_loss):
vgap = abs(training_loss - validation_loss)
return vgap

Flops

Flops, or floating point operations, are a measure of the computational power of a model. A model with a higher flops value required more computational resources to run.

2. Sustainability-focused metrics

gCO2e

gCO2e stands for "grams carbon dioxide equivalent." It is a measure of the environmental impact of training a machine learning model, up to a certain point in the training process.

gCO2e compares emissions from different greenhouse gases based on their global warming potential (GWP). It converts the amount of other gases into the equivalent amount of CO2 in grams.

For example, methane has a GWP of 25, meaning that the emission of one gram of methane is equivalent to the emission of 25 grams of carbon. Carbon is used as a reference and has a GWP of 1.

Calculation

To calculate gCO2e, we estimate the energy consumption of the GPU by looking at the training time and utilization rate. We then use the carbon intensity (gCO2e/kWh) of the computation center to calculate the gCO2e.

3. Mixed Metrics

The mixed metrics are benchmarks that take into account two or more of the previously discussed metrics in order to find the best experiment from a group of experiments.

acc-gCO2e

Weighted sum of training accuracy and normalized gCO2e emission. The accuracy to gCO2e weight ratio is 80/20. The gCO2e emission is normalized over all experiments on a dataset by min-max scaling the gCO2e value to the interval [0,1]. A score of 1 marks the best performance.

Equations

acc-gCO2e

acc-flops

Weighted sum of training accuracy and normalized flops. The accuracy to flops weight ratio is 80/20. The flops are normalized over all experiments on a dataset by min-max scaling the flops value to the interval [0,1]. A score of 1 marks the best performance.

Equations

acc-gCO2e

acc-vgap

Weighted sum of training accuracy and normalized vgap. The accuracy to vgap weight ratio is 80/20. The vgap is normalized over all experiments on a dataset by min-max scaling the vgap value to the interval [0,1]. A score of 1 marks the best performance.

Equations

acc-gCO2e