Skip to main content

Training-Parameters

What is the difference between epochs and cycles?

In the context of our federated learning infrastructure, an 'epoch' refers to one complete forward and backward pass of all the training data, which is a standard term in machine learning training. A 'cycle,' on the other hand, refers to the completion of a specified number of epochs across multiple nodes participating in the training. After these epochs are completed, the model weights from different nodes are averaged.

The reason you are prompted to specify 'cycles' rather than 'epochs' during model submission is that our system performs weight averaging and calculates metrics at the end of each cycle. When you submit a model after a cycle, the averaged weights are used for evaluation. This makes 'cycles' a more meaningful unit for monitoring the federated learning process in our system.

Should I run more epochs or more cycles in my training?

The best practice is to run less epochs and more cycles than running more epochs and less cycles. For example if you want to run 25 epochs in total its advisable to run 5 epochs and 5 cycles instead running 25 epochs and 1 cycle.

How can I implement callbacks?

Callbacks can be implemented using the training plan. Simply have a look at the hyperparameters section.

Can I use a pre-trained model and specify which layers should not be retrained?

Yes, it's possible to do so using the layersFreeze parameter in the training plan. You can find it in our hyperparameters section.

Can I use the source code for my trainings?

We understand the importance of transparency in the training process. However, due to the proprietary nature of our federated learning infrastructure, we do not provide direct access to the training code or any open-source examples. That said, users do have the ability to control various hyperparameters like epochs, the learning rate, and data augmentations to customize the training process to their needs. Our documentation provides detailed explanations on how each hyperparameter affects the training, helping you better understand what's happening during this phase.

How can I set my training plan?

You can customize your models training with different types of parameters according to your needs.

For more information on training parameters have a look at the training plan section of our documentation.

Can I check the default values for each parameter?

Yes, you can check the default values of each parameter before setting any value by running:

trainingObject.getTrainingPlan()

You can use the same command to check the updated values for each parameter.

How can I reset all parameters to their default value?

The following command resets all the parameters in your training plan back to their default values:

trainingObject.resetTrainingPlan()

AttributeError: 'NoneType' object has no attribute 'experimentName' while submitting my training plan?

This error indicates that your Google Colab session has expired, and you should try starting the notebook again from the first step.

Why are my training epochs not showing up on the frontend?

Once you have started a training, it takes some time for example if you are training on the whole dataset. The time taken for completion of one epoch may also vary based on the batchsize you have selected. So to speed up your training you could select a smaller batchsize or try creating and training on a subdataset using the TrainingClasses parameter. More info on this can be found in the dataset parameters section.

What happens if I pause an experiment?

Pausing an experiment stops the training. When resuming the experiment again, it will start training again at the first epoch of the cycle it was in when you paused it.

For example: You started an experiment with 10 epochs and one cycle. You wait a short while and pause the experiment while it is still at epoch 5. When you resume the experiment, the training will start again from the first epoch.

What does the "weights file truncated" error indicate?

This error indicates that the weights file you are trying to upload is corrupted. Try downloading your weights file again and then rename it based on your model name like modelname_weights.pkl for TensorFlow and modelname_weights.pth for PyTorch.

Which values can i set for the validation and test dataset split?

The supported validation split range is (0, 0.5].

Which values can i set for the batch size of my training?

The supported batchsize range is (4, 128].