在公司ubuntu服务器上配置了paddle的环境,训练paddleocr是遇到了Out of memory error on GPU 0. Cannot allocate 28.125000MB memory on GPU 0, 11.598755GB memory has been allocated。。。的问题,网上的主要解决方法减小配置文件中的batchsize,我试过了还是不行,甚至调成1也不行。有大神了解怎么解决吗
两个gpu都没使用
Traceback (most recent call last):
File "/home/sj/Project/PaddleOCR-main/tools/train.py", line 269, in <module>
main(config, device, logger, vdl_writer, seed)
File "/home/sj/Project/PaddleOCR-main/tools/train.py", line 222, in main
program.train(
File "/home/sj/Project/PaddleOCR-main/tools/program.py", line 345, in train
preds = model(images, data=batch[1:])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
return self.forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sj/Project/PaddleOCR-main/ppocr/modeling/architectures/base_model.py", line 85, in forward
x = self.backbone(x)
^^^^^^^^^^^^^^^^
File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
return self.forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sj/Project/PaddleOCR-main/ppocr/modeling/backbones/rec_lcnetv3.py", line 544, in forward
x = self.blocks6(x)
^^^^^^^^^^^^^^^
File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
return self.forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/container.py", line 615, in forward
input = layer(input)
^^^^^^^^^^^^
File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
return self.forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sj/Project/PaddleOCR-main/ppocr/modeling/backbones/rec_lcnetv3.py", line 390, in forward
x = self.pw_conv(x)
^^^^^^^^^^^^^^^
File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
return self.forward(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/sj/Project/PaddleOCR-main/ppocr/modeling/backbones/rec_lcnetv3.py", line 223, in forward
out += self.identity(x)
MemoryError:
--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 paddle::pybind::CallScalarFuction(paddle::Tensor const&, double, std::string)
1 scale_ad_func(paddle::Tensor const&, paddle::experimental::ScalarBase<paddle::Tensor>, paddle::experimental::ScalarBase<paddle::Tensor>, bool)
2 paddle::experimental::scale(paddle::Tensor const&, paddle::experimental::ScalarBase<paddle::Tensor> const&, paddle::experimental::ScalarBase<paddle::Tensor> const&, bool)
3 void phi::ScaleKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, bool, phi::DenseTensor*)
4 float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
5 phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
6 phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
7 paddle::memory::allocation::Allocator::Allocate(unsigned long)
8 paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
9 paddle::memory::allocation::Allocator::Allocate(unsigned long)
10 paddle::memory::allocation::Allocator::Allocate(unsigned long)
11 paddle::memory::allocation::Allocator::Allocate(unsigned long)
12 paddle::memory::allocation::Allocator::Allocate(unsigned long)
13 paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
14 std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
15 common::enforce::GetCurrentTraceBackString[abi:cxx11](bool)
----------------------
Error Message Summary:
----------------------
ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 28.125000MB memory on GPU 0, 11.602661GB memory has been allocated and available memory is only 22.687500MB.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
(at ../paddle/fluid/memory/allocation/cuda_allocator.cc:86)
附加一下文字检测模型det和文字识别模型rec的配置文件,出问题的是文字识别模型rec,文字检测模型det可以运行
1.文字检测模型det(可以跑)
Global:
debug: false
use_gpu: true
epoch_num: 800
log_smooth_window: 20
print_batch_step: 20
save_model_dir: ./output/ch_PP-OCRv4
save_epoch_step: 100
eval_batch_step:
- 0
- 20000
cal_metric_during_train: false
checkpoints: null
pretrained_model: null
save_inference_dir: null
use_visualdl: false
infer_img: doc/imgs_en/img_10.jpg
save_res_path: ./checkpoints/det_db/predicts_db.txt
distributed: true
Architecture:
name: DistillationModel
algorithm: Distillation
model_type: det
Models:
Student:
model_type: det
algorithm: DB
Transform: null
Backbone:
name: PPLCNetV3
scale: 0.75
pretrained: false
det: true
Neck:
name: RSEFPN
out_channels: 96
shortcut: true
Head:
name: DBHead
k: 50
Student2:
pretrained: null
model_type: det
algorithm: DB
Transform: null
Backbone:
name: PPLCNetV3
scale: 0.75
pretrained: true
det: true
Neck:
name: RSEFPN
out_channels: 96
shortcut: true
Head:
name: DBHead
k: 50
Teacher:
pretrained: https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_cml_teacher_pretrained/teacher.pdparams
freeze_params: true
return_all_feats: false
model_type: det
algorithm: DB
Backbone:
name: ResNet_vd
in_channels: 3
layers: 50
Neck:
name: LKPAN
out_channels: 256
Head:
name: DBHead
kernel_list:
- 7
- 2
- 2
k: 50
Loss:
name: CombinedLoss
loss_config_list:
- DistillationDilaDBLoss:
weight: 1.0
model_name_pairs:
- - Student
- Teacher
- - Student2
- Teacher
key: maps
balance_loss: true
main_loss_type: DiceLoss
alpha: 5
beta: 10
ohem_ratio: 3
- DistillationDMLLoss:
model_name_pairs:
- Student
- Student2
maps_name: thrink_maps
weight: 1.0
key: maps
- DistillationDBLoss:
weight: 1.0
model_name_list:
- Student
- Student2
balance_loss: true
main_loss_type: DiceLoss
alpha: 5
beta: 10
ohem_ratio: 3
Optimizer:
name: Adam
beta1: 0.9
beta2: 0.999
lr:
name: Cosine
learning_rate: 0.001
warmup_epoch: 2
regularizer:
name: L2
factor: 5.0e-05
PostProcess:
name: DistillationDBPostProcess
model_name:
- Student
key: head_out
thresh: 0.3
box_thresh: 0.6
max_candidates: 1000
unclip_ratio: 1.5
Metric:
name: DistillationMetric
base_metric_name: DetMetric
main_indicator: hmean
key: Student
Train:
dataset:
name: SimpleDataSet
data_dir: /home/sj/Project/PaddleOCR-main/Datasets/det/train/
label_file_list:
- /home/sj/Project/PaddleOCR-main/Datasets/det/train.txt
ratio_list: [1.0]
transforms:
- DecodeImage:
img_mode: BGR
channel_first: false
- DetLabelEncode: null
- IaaAugment:
augmenter_args:
- type: Fliplr
args:
p: 0.5
- type: Affine
args:
rotate:
- -10
- 10
- type: Resize
args:
size:
- 0.5
- 3
- EastRandomCropData:
size:
- 640
- 640
max_tries: 50
keep_ratio: true
- MakeBorderMap:
shrink_ratio: 0.4
thresh_min: 0.3
thresh_max: 0.7
total_epoch: 500
- MakeShrinkMap:
shrink_ratio: 0.4
min_text_size: 8
total_epoch: 500
- NormalizeImage:
scale: 1./255.
mean:
- 0.485
- 0.456
- 0.406
std:
- 0.229
- 0.224
- 0.225
order: hwc
- ToCHWImage: null
- KeepKeys:
keep_keys:
- image
- threshold_map
- threshold_mask
- shrink_map
- shrink_mask
loader:
shuffle: true
drop_last: false
batch_size_per_card: 4
num_workers: 12
Eval:
dataset:
name: SimpleDataSet
data_dir: /home/sj/Project/PaddleOCR-main/Datasets/det/val/
label_file_list:
- /home/sj/Project/PaddleOCR-main/Datasets/det/val.txt
transforms:
- DecodeImage:
img_mode: BGR
channel_first: false
- DetLabelEncode: null
- DetResizeForTest:
limit_side_len: 960
limit_type: max
- NormalizeImage:
scale: 1./255.
mean:
- 0.485
- 0.456
- 0.406
std:
- 0.229
- 0.224
- 0.225
order: hwc
- ToCHWImage: null
- KeepKeys:
keep_keys:
- image
- shape
- polys
- ignore_tags
loader:
shuffle: false
drop_last: false
batch_size_per_card: 1
num_workers: 12
profiler_options: null
2.文字识别模型rec(有问题的)
Global:
debug: false
use_gpu: true
epoch_num: 300
log_smooth_window: 20
print_batch_step: 100
save_model_dir: ./output/rec_ppocr_v4
save_epoch_step: 50
eval_batch_step:
- 0
- 2000
cal_metric_during_train: false
# pretrained_model: pretrain_model/en_PP-OCRv4_rec_train/best_accuracy.pdparams
pretrained_model: null
checkpoints: null
save_inference_dir: null
use_visualdl: false
infer_img: doc/imgs_words/ch/word_1.jpg
character_dict_path: ppocr/utils/en_dict.txt
max_text_length: 25
infer_mode: false
use_space_char: true
distributed: true
save_res_path: ./output/rec/predicts_ppocrv3.txt
Optimizer:
name: Adam
beta1: 0.9
beta2: 0.999
lr:
name: Cosine
learning_rate: 0.0005
warmup_epoch: 5
regularizer:
name: L2
factor: 3.0e-05
Architecture:
model_type: rec
algorithm: SVTR_LCNet
Transform: null
Backbone:
name: PPLCNetV3
scale: 0.95
Head:
name: MultiHead
head_list:
- CTCHead:
Neck:
name: svtr
dims: 120
depth: 2
hidden_dims: 120
kernel_size:
- 1
- 3
use_guide: true
Head:
fc_decay: 1.0e-05
- NRTRHead:
nrtr_dim: 384
max_text_length: 25
Loss:
name: MultiLoss
loss_config_list:
- CTCLoss: null
- NRTRLoss: null
PostProcess:
name: CTCLabelDecode
Metric:
name: RecMetric
main_indicator: acc
ignore_space: false
Train:
dataset:
name: MultiScaleDataSet
ds_width: false
data_dir: Datasets/rec/train/
ext_op_transform_idx: 1
label_file_list:
- Datasets/rec/train.txt
transforms:
- DecodeImage:
img_mode: BGR
channel_first: false
- RecConAug:
prob: 0.5
ext_data_num: 2
image_shape:
- 48
- 320
- 3
max_text_length: 25
- RecAug: null
- MultiLabelEncode:
gtc_encode: NRTRLabelEncode
- KeepKeys:
keep_keys:
- image
- label_ctc
- label_gtc
- length
- valid_ratio
sampler:
name: MultiScaleSampler
scales:
- - 320
- 32
- - 320
- 48
- - 320
- 64
first_bs: 96
fix_bs: false
divided_factor:
- 8
- 16
is_training: true
loader:
shuffle: true
batch_size_per_card: 64
drop_last: true
num_workers: 8
Eval:
dataset:
name: SimpleDataSet
data_dir: Datasets/rec/val/
label_file_list:
- Datasets/rec/val.txt
transforms:
- DecodeImage:
img_mode: BGR
channel_first: false
- MultiLabelEncode:
gtc_encode: NRTRLabelEncode
- RecResizeImg:
image_shape:
- 3
- 48
- 320
- KeepKeys:
keep_keys:
- image
- label_ctc
- label_gtc
- length
- valid_ratio
loader:
shuffle: false
drop_last: false
batch_size_per_card: 64
num_workers: 8
profiler_options: null