As easy as modern-day programming has become, there are still several different hoops developers have to jump through to figure out what’s causing bugs in their code, especially if they’re working with bleeding-edge technologies like AI, ML or computer vision.
In this article, we’re looking at the “CUDA error: Device-side assert triggered” when working with Python and PyTorch.
What causes this error?
The error mainly occurs because of the following two reasons.
- The number of labels/classes isn’t matching up to the number of output units.
- The loss function input might be incorrect.
Also read: What is Ring streaming error? 6 Fixes
How to fix this?
You can try out the following fixes.
Match output units with the number of classes
You should first check to see if the number of classes you’ve assigned to your dataset matches the number of output units you have. For example, if your model’s greatest possible output value is 100, any label that produces an output value greater than 100 will trigger this error. This can be resolved by changing the corresponding value in your classifier.
Fix the loss function input
Make sure that your output layer returns values that fall in the range of your selected loss function (also known as a criterion). You will have to use appropriate activation functions (Sigmoid, Softmax or LogSoftmax) in your final output layer.
The quickest way to turn this around is to experiment with all three functions to see which one works best. Sometimes, a function might only work on the CPU but not on GPU or vice-versa, so you’ll have to play around with the code a little bit to figure out the correct answer.
Check the ground label index
Make sure your ground index labels are set accordingly. If your ground truth label starts at 1, you should subtract 1 from every label. This should fix the problem for you.
Keep this in mind as a general rule. As array indexes start from zero, your class index should also start from zero.
If the fixes mentioned above didn’t solve the problem for you, try running your script again, but this time with the CUDA_LAUNCH_BLOCKING=1 flag to get an accurate stack trace. Depending on the error you get, you might want to research further on what went wrong.