Issue
This Content is from Stack Overflow. Question asked by Noura Fayez
I train modal to predict the image label, and I’m using Pytorch 1.12.1 and Cuda 11.6 driver with one GPU Card NVIDIA GeForce RTX 3060 Laptop GPU with the following info
Utilization 0%
Dedicated GPU memory 0.0/6.0 GB
Shared GPU memory 0.0/7.9 GB
GPU Memory 0.0/13.9 GB
my training data is 12000 images with size 244*244,
The problem is when I set the following configuration
imgs_per_gpu=128,
workers_per_gpu=4,
it shows the error
RuntimeError: CUDA out of memory. Tried to allocate 784.00 MiB (GPU 0; 6.00 GiB total capacity; 5.20 GiB already allocated; 0 bytes free; 5.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I check the Nvidia status using the following command
and the result is as follows, which indicates the usage of Nvidia is zero and no running process
nvidia-smi
Moreover, I run the following command and it gives me the following result which indicates that my GPU is empty
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 516.94 Driver Version: 516.94 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 41C P0 23W / N/A | 0MiB / 6144MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I googled the issue and they I find two solutions:
reduce the batch_size:
so I tried to set the configuration toimgs_per_gpu=2,
workers_per_gpu=2,
And it trained only half of the training data which cause reducing the accuracy of the model. Also, as much as I increase the imgs_per_gpu value
, the training data decrease
cleaning the torch cache:
I run the following code and it’s not work:import gc import torch gc.collect() torch.cuda.empty_cache()
I tried to reduce the data set to 6000 and tried to test it all, but it also give the same error (out of memory) even when it trained it before as half of 12000 images
So, my question is how can I fix the issue and train all my 12000 images?
Also, I have one more question, Can I train the same model twice in two different datasets? and merge the training result at the end; I’m thinking about this solution as a workaround solution.
Thanks in advance.
Solution
This question is not yet answered, be the first one who answer using the comment. Later the confirmed answer will be published as the solution.
This Question and Answer are collected from stackoverflow and tested by JTuto community, is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.