Running into both time and memory problems when training, why?

PilotProjectUCL · May 11, 2022, 11:08am

Pre-face: First time doing computer vision projects using a service like this. Will be used for graduation project. Goal is FOMO model, which can detect multiple different object types.

I keep running either OOM or time problems when trying to train my model. I however have difficulties understanding the underlying structure of the problems. The basics I understand; I’m using too much memory, or training the model takes too much time, but I don’t understand why.

I have a few different datasets available to me, of varying sizes, resolutions, objects etc. The biggest ones in the original datasets are 1920x1200. I am assuming that creating an impulse and setting it to 640x640 will require much more RAM compared to a 320x320, but will also increase the accuracy, so it seems to be an issue of trying to balance the resolution of the picture versus the size of the dataset (2000 images for example).

However, I’ve also had issues when using a different smaller dataset (which was set to 320x320, grayscale) where I would run into OOM problems after training around 50-60 batches, but just as training is about to begin, the job is evicted from the cluster, even though the RAM consumption should be much less.

So my questions are these:

What is the exact relationship between dataset sizes, their original resolutions, and the resolution set in the impulse? There must be something obvious I’m not grasping here.
Regarding the time issues, I would assume that because the impulse images are still somewhat large, it takes much longer to process them. Correct?
What could be the problem regarding the OOM eviction? It happened even when setting very loose training parameters for testing purposes (1 cycle, 0.002 learning rate)

Side note: Is there a way to get more time?

shawn_edgeimpulse · May 11, 2022, 1:11pm

Hi @PilotProjectUCL,

If you look at the “Image data” portion of your impulse under “Create impulse,” you can see settings for image width and height. Before going to the processing block, each image will be scaled to that resolution.

For example, here is a FOMO example where the input images (during training) are scaled to 96x96.

The public project is here if you would like to see how it is set up with what kind of data and processing/training blocks: Create impulse - FOMO Washers and Screws 96x96 - Edge Impulse

I recommend lowering that “Image dat” resolution to see if you can avoid the OOM issue. 2000 images at 320x320 is likely too much data for the free version of Edge Impulse. If you provide me with the project ID, I would be happy to take a look at your project.

PilotProjectUCL · May 11, 2022, 3:56pm

Makes more sense then, I misunderstood the relationship between the blocks and their effects!

I would really appreciate if you took a look at it, although the currently uploaded dataset only has a single class as output. This is because I had problems with one of the datasets and was trying to narrow down the issue, so I wrote a program converting YOLO sets into whatever format is used here (COCO-ish?), so I can adjust what bounding boxes are provided in the labels file.

ID: 102929

shawn_edgeimpulse · May 12, 2022, 1:59am

Hi @PilotProjectUCL,

I took a look at your project. 640x640 resolution is pretty high. I was running into 20 minute compute time limitations with just 192x192 resolution and 1600 images for FOMO. I’ve increased the compute time for your project to 60 minutes. If that does not work, I recommend seeing if you can get by with a lower resolution or fewer training images.