丙酸氟替卡松 2024-12-04 14:22 采纳率: 0%
浏览 73
已结题

Paddleocr:out of memory error on GPU

在公司ubuntu服务器上配置了paddle的环境,训练paddleocr是遇到了Out of memory error on GPU 0. Cannot allocate 28.125000MB memory on GPU 0, 11.598755GB memory has been allocated。。。的问题,网上的主要解决方法减小配置文件中的batchsize,我试过了还是不行,甚至调成1也不行。有大神了解怎么解决吗
两个gpu都没使用

img

img

Traceback (most recent call last):
  File "/home/sj/Project/PaddleOCR-main/tools/train.py", line 269, in <module>
    main(config, device, logger, vdl_writer, seed)
  File "/home/sj/Project/PaddleOCR-main/tools/train.py", line 222, in main
    program.train(
  File "/home/sj/Project/PaddleOCR-main/tools/program.py", line 345, in train
    preds = model(images, data=batch[1:])
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sj/Project/PaddleOCR-main/ppocr/modeling/architectures/base_model.py", line 85, in forward
    x = self.backbone(x)
        ^^^^^^^^^^^^^^^^
  File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sj/Project/PaddleOCR-main/ppocr/modeling/backbones/rec_lcnetv3.py", line 544, in forward
    x = self.blocks6(x)
        ^^^^^^^^^^^^^^^
  File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/container.py", line 615, in forward
    input = layer(input)
            ^^^^^^^^^^^^
  File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sj/Project/PaddleOCR-main/ppocr/modeling/backbones/rec_lcnetv3.py", line 390, in forward
    x = self.pw_conv(x)
        ^^^^^^^^^^^^^^^
  File "/home/sj/anaconda3/envs/paddle_env/lib/python3.11/site-packages/paddle/nn/layer/layers.py", line 1426, in __call__
    return self.forward(*inputs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sj/Project/PaddleOCR-main/ppocr/modeling/backbones/rec_lcnetv3.py", line 223, in forward
    out += self.identity(x)
MemoryError: 

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::CallScalarFuction(paddle::Tensor const&, double, std::string)
1   scale_ad_func(paddle::Tensor const&, paddle::experimental::ScalarBase<paddle::Tensor>, paddle::experimental::ScalarBase<paddle::Tensor>, bool)
2   paddle::experimental::scale(paddle::Tensor const&, paddle::experimental::ScalarBase<paddle::Tensor> const&, paddle::experimental::ScalarBase<paddle::Tensor> const&, bool)
3   void phi::ScaleKernel<float, phi::GPUContext>(phi::GPUContext const&, phi::DenseTensor const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, paddle::experimental::ScalarBase<phi::DenseTensor> const&, bool, phi::DenseTensor*)
4   float* phi::DeviceContext::Alloc<float>(phi::TensorBase*, unsigned long, bool) const
5   phi::DeviceContext::Impl::Alloc(phi::TensorBase*, phi::Place const&, phi::DataType, unsigned long, bool, bool) const
6   phi::DenseTensor::AllocateFrom(phi::Allocator*, phi::DataType, unsigned long, bool)
7   paddle::memory::allocation::Allocator::Allocate(unsigned long)
8   paddle::memory::allocation::StatAllocator::AllocateImpl(unsigned long)
9   paddle::memory::allocation::Allocator::Allocate(unsigned long)
10  paddle::memory::allocation::Allocator::Allocate(unsigned long)
11  paddle::memory::allocation::Allocator::Allocate(unsigned long)
12  paddle::memory::allocation::Allocator::Allocate(unsigned long)
13  paddle::memory::allocation::CUDAAllocator::AllocateImpl(unsigned long)
14  std::string phi::enforce::GetCompleteTraceBackString<std::string >(std::string&&, char const*, int)
15  common::enforce::GetCurrentTraceBackString[abi:cxx11](bool)

----------------------
Error Message Summary:
----------------------
ResourceExhaustedError: 

Out of memory error on GPU 0. Cannot allocate 28.125000MB memory on GPU 0, 11.602661GB memory has been allocated and available memory is only 22.687500MB.

Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model. 
 (at ../paddle/fluid/memory/allocation/cuda_allocator.cc:86)

附加一下文字检测模型det和文字识别模型rec的配置文件,出问题的是文字识别模型rec,文字检测模型det可以运行
1.文字检测模型det(可以跑)

Global:
  debug: false
  use_gpu: true
  epoch_num: 800
  log_smooth_window: 20
  print_batch_step: 20
  save_model_dir: ./output/ch_PP-OCRv4
  save_epoch_step: 100
  eval_batch_step:
  - 0
  - 20000
  cal_metric_during_train: false
  checkpoints: null
  pretrained_model: null
  save_inference_dir: null
  use_visualdl: false
  infer_img: doc/imgs_en/img_10.jpg
  save_res_path: ./checkpoints/det_db/predicts_db.txt
  distributed: true
Architecture:
  name: DistillationModel
  algorithm: Distillation
  model_type: det
  Models:
    Student:
      model_type: det
      algorithm: DB
      Transform: null
      Backbone:
        name: PPLCNetV3
        scale: 0.75
        pretrained: false
        det: true
      Neck:
        name: RSEFPN
        out_channels: 96
        shortcut: true
      Head:
        name: DBHead
        k: 50
    Student2:
      pretrained: null
      model_type: det
      algorithm: DB
      Transform: null
      Backbone:
        name: PPLCNetV3
        scale: 0.75
        pretrained: true
        det: true
      Neck:
        name: RSEFPN
        out_channels: 96
        shortcut: true
      Head:
        name: DBHead
        k: 50
    Teacher:
      pretrained: https://paddleocr.bj.bcebos.com/PP-OCRv4/chinese/ch_PP-OCRv4_det_cml_teacher_pretrained/teacher.pdparams
      freeze_params: true
      return_all_feats: false
      model_type: det
      algorithm: DB
      Backbone:
        name: ResNet_vd
        in_channels: 3
        layers: 50
      Neck:
        name: LKPAN
        out_channels: 256
      Head:
        name: DBHead
        kernel_list:
        - 7
        - 2
        - 2
        k: 50
Loss:
  name: CombinedLoss
  loss_config_list:
  - DistillationDilaDBLoss:
      weight: 1.0
      model_name_pairs:
      - - Student
        - Teacher
      - - Student2
        - Teacher
      key: maps
      balance_loss: true
      main_loss_type: DiceLoss
      alpha: 5
      beta: 10
      ohem_ratio: 3
  - DistillationDMLLoss:
      model_name_pairs:
      - Student
      - Student2
      maps_name: thrink_maps
      weight: 1.0
      key: maps
  - DistillationDBLoss:
      weight: 1.0
      model_name_list:
      - Student
      - Student2
      balance_loss: true
      main_loss_type: DiceLoss
      alpha: 5
      beta: 10
      ohem_ratio: 3
Optimizer:
  name: Adam
  beta1: 0.9
  beta2: 0.999
  lr:
    name: Cosine
    learning_rate: 0.001
    warmup_epoch: 2
  regularizer:
    name: L2
    factor: 5.0e-05
PostProcess:
  name: DistillationDBPostProcess
  model_name:
  - Student
  key: head_out
  thresh: 0.3
  box_thresh: 0.6
  max_candidates: 1000
  unclip_ratio: 1.5
Metric:
  name: DistillationMetric
  base_metric_name: DetMetric
  main_indicator: hmean
  key: Student
Train:
  dataset:
    name: SimpleDataSet
    data_dir: /home/sj/Project/PaddleOCR-main/Datasets/det/train/
    label_file_list:
      - /home/sj/Project/PaddleOCR-main/Datasets/det/train.txt
    ratio_list: [1.0]
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - DetLabelEncode: null
    - IaaAugment:
        augmenter_args:
        - type: Fliplr
          args:
            p: 0.5
        - type: Affine
          args:
            rotate:
            - -10
            - 10
        - type: Resize
          args:
            size:
            - 0.5
            - 3
    - EastRandomCropData:
        size:
        - 640
        - 640
        max_tries: 50
        keep_ratio: true
    - MakeBorderMap:
        shrink_ratio: 0.4
        thresh_min: 0.3
        thresh_max: 0.7
        total_epoch: 500
    - MakeShrinkMap:
        shrink_ratio: 0.4
        min_text_size: 8
        total_epoch: 500
    - NormalizeImage:
        scale: 1./255.
        mean:
        - 0.485
        - 0.456
        - 0.406
        std:
        - 0.229
        - 0.224
        - 0.225
        order: hwc
    - ToCHWImage: null
    - KeepKeys:
        keep_keys:
        - image
        - threshold_map
        - threshold_mask
        - shrink_map
        - shrink_mask
  loader:
    shuffle: true
    drop_last: false
    batch_size_per_card: 4
    num_workers: 12
Eval:
  dataset:
    name: SimpleDataSet
    data_dir: /home/sj/Project/PaddleOCR-main/Datasets/det/val/
    label_file_list:
      - /home/sj/Project/PaddleOCR-main/Datasets/det/val.txt
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - DetLabelEncode: null
    - DetResizeForTest: 
        limit_side_len: 960
        limit_type: max
    - NormalizeImage:
        scale: 1./255.
        mean:
        - 0.485
        - 0.456
        - 0.406
        std:
        - 0.229
        - 0.224
        - 0.225
        order: hwc
    - ToCHWImage: null
    - KeepKeys:
        keep_keys:
        - image
        - shape
        - polys
        - ignore_tags
  loader:
    shuffle: false
    drop_last: false
    batch_size_per_card: 1
    num_workers: 12
profiler_options: null

2.文字识别模型rec(有问题的)

Global:
  debug: false
  use_gpu: true
  epoch_num: 300
  log_smooth_window: 20
  print_batch_step: 100
  save_model_dir: ./output/rec_ppocr_v4
  save_epoch_step: 50
  eval_batch_step:
  - 0
  - 2000
  cal_metric_during_train: false
#  pretrained_model: pretrain_model/en_PP-OCRv4_rec_train/best_accuracy.pdparams
  pretrained_model: null
  checkpoints: null
  save_inference_dir: null
  use_visualdl: false
  infer_img: doc/imgs_words/ch/word_1.jpg
  character_dict_path: ppocr/utils/en_dict.txt
  max_text_length: 25
  infer_mode: false
  use_space_char: true
  distributed: true
  save_res_path: ./output/rec/predicts_ppocrv3.txt
Optimizer:
  name: Adam
  beta1: 0.9
  beta2: 0.999
  lr:
    name: Cosine
    learning_rate: 0.0005
    warmup_epoch: 5
  regularizer:
    name: L2
    factor: 3.0e-05
Architecture:
  model_type: rec
  algorithm: SVTR_LCNet
  Transform: null
  Backbone:
    name: PPLCNetV3
    scale: 0.95
  Head:
    name: MultiHead
    head_list:
    - CTCHead:
        Neck:
          name: svtr
          dims: 120
          depth: 2
          hidden_dims: 120
          kernel_size:
          - 1
          - 3
          use_guide: true
        Head:
          fc_decay: 1.0e-05
    - NRTRHead:
        nrtr_dim: 384
        max_text_length: 25
Loss:
  name: MultiLoss
  loss_config_list:
  - CTCLoss: null
  - NRTRLoss: null
PostProcess:
  name: CTCLabelDecode
Metric:
  name: RecMetric
  main_indicator: acc
  ignore_space: false
Train:
  dataset:
    name: MultiScaleDataSet
    ds_width: false
    data_dir: Datasets/rec/train/
    ext_op_transform_idx: 1
    label_file_list:
    - Datasets/rec/train.txt
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - RecConAug:
        prob: 0.5
        ext_data_num: 2
        image_shape:
        - 48
        - 320
        - 3
        max_text_length: 25
    - RecAug: null
    - MultiLabelEncode:
        gtc_encode: NRTRLabelEncode
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_gtc
        - length
        - valid_ratio
  sampler:
    name: MultiScaleSampler
    scales:
    - - 320
      - 32
    - - 320
      - 48
    - - 320
      - 64
    first_bs: 96
    fix_bs: false
    divided_factor:
    - 8
    - 16
    is_training: true
  loader:
    shuffle: true
    batch_size_per_card: 64
    drop_last: true
    num_workers: 8
Eval:
  dataset:
    name: SimpleDataSet
    data_dir: Datasets/rec/val/
    label_file_list:
    - Datasets/rec/val.txt
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - MultiLabelEncode:
        gtc_encode: NRTRLabelEncode
    - RecResizeImg:
        image_shape:
        - 3
        - 48
        - 320
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_gtc
        - length
        - valid_ratio
  loader:
    shuffle: false
    drop_last: false
    batch_size_per_card: 64
    num_workers: 8
profiler_options: null


  • 写回答

31条回答 默认 最新

  • 阿里嘎多学长 2024-12-04 14:22
    关注
    获得0.30元问题酬金

    阿里嘎多学长整理AIGC生成,因移动端显示问题导致当前答案未能完全显示,请使用PC端查看更加详细的解答过程

    PaddleOCR: Out of memory error on GPU

    usted遇到了PaddleOCR训练过程中的Out of memory error on GPU问题,这可能是由于GPU内存不足或其他原因导致的。

    1. 检查GPU内存:可以使用 nvidia-smi 命令或 GPURem 工具来查看当前GPU的内存使用情况。
    2. 减少模型-size:如果GPU内存不足,可以尝试减少模型的大小,例如使用 Fewer layers,Fewer channels等方法。
    3. 增加 Swap 空间:增加 Swap 空间可以临时解决内存不足的问题,但可能会影响性能。
    4. 调整 hyperparameter:可以尝试调整一些 hyperparameter,例如 batch_size, learning_rate 等,以避免内存溢出。
    5. 使用数据 augmentation:可以使用数据 augmentation(数据加Noise、flip、rotate等操作)来增加数据的数量和复杂性,避免内存溢出。

    具体来说,可以尝试以下代码:

    import paddle
    import paddleocr
    
    # ...
    
    # 减少 batch size
    batch_size = 2
    
    # 选择小模型
    model = paddleocr.PaddleOCR(reset=False, show_log=True, use_gpu=True, gpu_ids=[0])
    
    # ...
    
    model.train()
    

    如果以上方法仍不能解决问题,请提供更多的日志信息和配置信息,以便更好地 troubleshoot。

    评论

报告相同问题?

问题事件

  • 系统已结题 12月12日
  • 修改了问题 12月4日
  • 创建了问题 12月4日

悬赏问题

  • ¥15 PADS Logic 原理图
  • ¥15 PADS Logic 图标
  • ¥15 电脑和power bi环境都是英文如何将日期层次结构转换成英文
  • ¥20 气象站点数据求取中~
  • ¥15 如何获取APP内弹出的网址链接
  • ¥15 wifi 图标不见了 不知道怎么办 上不了网 变成小地球了
  • ¥50 STM32单片机传感器读取错误
  • ¥50 power BI 从Mysql服务器导入数据,但连接进去后显示表无数据
  • ¥15 (关键词-阻抗匹配,HFSS,RFID标签天线)
  • ¥15 机器人轨迹规划相关问题