FOMO Training on GPU not faster than on CPU

Hi,
for my upcoming project, I plan to train a FOMO model on a huge number of 96 by 96 images (about 50k).
I tested with a small subset (about 500 images) in the Web-UI and tested both CPU and GPU training processors. GPU training did not finish noticeably faster than CPU training.
Why is that, and is the training pipeline optimized for GPUs, parallelism and larger datasets in general?

It should be reproducible by picking any example project, setting the resolution to something similar and comparing CPU and GPU training times.

I already tried exporting the block and trained with RunPod on an NVIDIA L4, getting under 2% GPU utilisation and very slow (impractical) training speeds.

Best regards,
Luis