Training Metrics
We use several metrics to evaluate the performance of your machine learning model. These metrics can be grouped into three categories: Performance-focused metrics, Sustainability-focused metrics and Mixed metrics. The first two are used to select the optimal cycle for a singular Experiment, while the last one is used to compare and select the optimal Experiment in a series of Experiments.
1. Performance-focused metrics
To better explain these metrics, we will work alongside with an example. To start out we have a ground truth and an example model prediction.
ground_truth = [1, 1, 1, 0, 0, 1, 0, 1, 0]
prediction = [1, 1, 0, 0, 1, 1, 0, 1, 0]
There are four different outcomes for the models prediction:
def true_positive(ground_truth, prediction):
tp = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 1 and pred == 1:
tp +=1
return tp
def true_negative(ground_truth, prediction):
tn = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 0 and pred == 0:
tn +=1
return tn
def false_positive(ground_truth, prediction):
fp = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 0 and pred == 1:
fp +=1
return fp
def false_negative(ground_truth, prediction):
fn = 0
for gt, pred in zip(ground_truth, prediction):
if gt == 1 and pred == 0:
fn +=1
return fn
For our example we have:
true_positive(ground_truth, prediction)
# out: 4
true_negative(ground_truth, prediction)
# out: 3
false_positive(ground_truth, prediction)
# out: 1
false_negative(ground_truth, prediction)
# out: 1
Accuracy
The accuracy is a measure of how well a model correctly predicts the output for a given input. It is calculated by dividing the number of correct predictions by the total number of predictions made.
def accuracy(ground_truth, prediction):
tp = true_positive(ground_truth, prediction)
fp = false_positive(ground_truth, prediction)
tn = true_negative(ground_truth, prediction)
fn = false_negative(ground_truth, prediction)
accuracy = (tp + tn) / (tp + fp + tn + fn)
return accuracy
# Example:
accuracy(ground_truth, prediction)
# out: 7/9
Loss
Loss is a measure of how far a model's predictions are from the true values. A lower loss indicates that the model is performing well. The exact value depends on the loss-function used in the models evaluation. Click here to find our supported loss function.
Precision
Precision is a measure of the number of correct positive predictions made by the model, compared to the total number of positive predictions made. In other words, the precision of a model shows how accurate the positive predictions are.
def precision(ground_truth, prediction):
tp = true_positive(ground_truth, prediction)
fp = false_positive(ground_truth, prediction)
prec = tp / (tp + fp)
return prec
# Example
precision(ground_truth, prediction)
# out: 0.8
Recall
Recall is a measure of the number of correct positive predictions made by the model, compared to the total number of actual positive instances. In other words, the recall (also called "sensitivity") of the model shows the coverage of actual positive samples.
def recall(ground_truth, prediction):
tp = true_positive(ground_truth, prediction)
fp = false_positive(ground_truth, prediction)
rec = tp / (tp + fn)
return rec
# Example
recall(ground_truth, prediction)
# out: 0.8
F1-score
The F1-score is the harmonic mean of precision and recall. It is a good metric to use when you want to balance precision and recall.
def f1(ground_truth, prediction):
p = precision(ground_truth, prediction)
r = recall(ground_truth, prediction)
f1_score = 2 * p * r/ (p + r)
return f1_score
# Example
f1_score(ground_truth, prediction)
# out: 0.8
Vgap
Vgap is a metric that measures the absolute difference between the validation loss and the training loss. A larger vgap indicates that the model may be overfitting the training data.
def vgap(training_loss, validation_loss):
vgap = abs(training_loss - validation_loss)
return vgap
Flops
Flops, or floating point operations, are a measure of the computational power of a model. A model with a higher flops value required more computational resources to run.
2. Sustainability-focused metrics
gCO2e
gCO2e stands for "grams carbon dioxide equivalent." It is a measure of the environmental impact of training a machine learning model, up to a certain point in the training process.
gCO2e compares emissions from different greenhouse gases based on their global warming potential (GWP). It converts the amount of other gases into the equivalent amount of CO2 in grams.
For example, methane has a GWP of 25, meaning that the emission of one gram of methane is equivalent to the emission of 25 grams of carbon. Carbon is used as a reference and has a GWP of 1.
Calculation
To calculate gCO2e, we estimate the energy consumption of the GPU by looking at the training time and utilization rate. We then use the carbon intensity (gCO2e/kWh) of the computation center to calculate the gCO2e.
3. Mixed Metrics
The mixed metrics are benchmarks that take into account two or more of the previously discussed metrics in order to find the best experiment from a group of experiments.
acc-gCO2e
Weighted sum of training accuracy and normalized gCO2e emission. The accuracy to gCO2e weight ratio is 80/20. The gCO2e emission is normalized over all experiments on a dataset by min-max scaling the gCO2e value to the interval [0,1]. A score of 1 marks the best performance.
acc-flops
Weighted sum of training accuracy and normalized flops. The accuracy to flops weight ratio is 80/20. The flops are normalized over all experiments on a dataset by min-max scaling the flops value to the interval [0,1]. A score of 1 marks the best performance.
acc-vgap
Weighted sum of training accuracy and normalized vgap. The accuracy to vgap weight ratio is 80/20. The vgap is normalized over all experiments on a dataset by min-max scaling the vgap value to the interval [0,1]. A score of 1 marks the best performance.