别无所求_zjz 2024-03-26 13:19 采纳率: 20%
浏览 22
已结题

关于#yolov8训练#的问题,如何解决?


(/home/cx-a100/zb/fjh/ARR) root@cx-a100:/home/cx-a100/zb/fjh/arrow# yolo segment train data=coco8-seg.yaml model=yolov8m-seg.pt epochs=300 imgsz=640 device=0,1,2,3,4,5,6,7
Ultralytics YOLOv8.1.34 🚀 Python-3.10.14 torch-1.13.0+cu117 CUDA:0 (NVIDIA A100-PCIE-40GB, 40396MiB)
                                                             CUDA:1 (NVIDIA A100-PCIE-40GB, 40396MiB)
                                                             CUDA:2 (NVIDIA A100-PCIE-40GB, 40396MiB)
                                                             CUDA:3 (NVIDIA A100-PCIE-40GB, 40396MiB)
                                                             CUDA:4 (NVIDIA A100-PCIE-40GB, 40396MiB)
                                                             CUDA:5 (NVIDIA A100-PCIE-40GB, 40396MiB)
                                                             CUDA:6 (NVIDIA A100-PCIE-40GB, 40396MiB)
                                                             CUDA:7 (NVIDIA A100-PCIE-40GB, 40396MiB)
WARNING ⚠️ Upgrade to torch>=2.0.0 for deterministic training.
engine/trainer: task=segment, mode=train, model=yolov8m-seg.pt, data=coco8-seg.yaml, epochs=300, time=None, patience=100, batch=16, imgsz=640, save=True, save_period=-1, cache=False, device=(0, 1, 2, 3, 4, 5, 6, 7), workers=8, project=None, name=train24, exist_ok=False, pretrained=True, optimizer=auto, verbose=True, seed=0, deterministic=True, single_cls=False, rect=False, cos_lr=False, close_mosaic=10, resume=False, amp=True, fraction=1.0, profile=False, freeze=None, multi_scale=False, overlap_mask=True, mask_ratio=4, dropout=0.0, val=True, split=val, save_json=False, save_hybrid=False, conf=None, iou=0.7, max_det=300, half=False, dnn=False, plots=True, source=None, vid_stride=1, stream_buffer=False, visualize=False, augment=False, agnostic_nms=False, classes=None, retina_masks=False, embed=None, show=False, save_frames=False, save_txt=False, save_conf=False, save_crop=False, show_labels=True, show_conf=True, show_boxes=True, line_width=None, format=torchscript, keras=False, optimize=False, int8=False, dynamic=False, simplify=False, opset=None, workspace=4, nms=False, lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, dfl=1.5, pose=12.0, kobj=1.0, label_smoothing=0.0, nbs=64, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, bgr=0.0, mosaic=1.0, mixup=0.0, copy_paste=0.0, auto_augment=randaugment, erasing=0.4, crop_fraction=1.0, cfg=None, tracker=botsort.yaml, save_dir=runs/segment/train24
Overriding model.yaml nc=80 with nc=2

                   from  n    params  module                                       arguments                     
  0                  -1  1      1392  ultralytics.nn.modules.conv.Conv             [3, 48, 3, 2]                 
  1                  -1  1     41664  ultralytics.nn.modules.conv.Conv             [48, 96, 3, 2]                
  2                  -1  2    111360  ultralytics.nn.modules.block.C2f             [96, 96, 2, True]             
  3                  -1  1    166272  ultralytics.nn.modules.conv.Conv             [96, 192, 3, 2]               
  4                  -1  4    813312  ultralytics.nn.modules.block.C2f             [192, 192, 4, True]           
  5                  -1  1    664320  ultralytics.nn.modules.conv.Conv             [192, 384, 3, 2]              
  6                  -1  4   3248640  ultralytics.nn.modules.block.C2f             [384, 384, 4, True]           
  7                  -1  1   1991808  ultralytics.nn.modules.conv.Conv             [384, 576, 3, 2]              
  8                  -1  2   3985920  ultralytics.nn.modules.block.C2f             [576, 576, 2, True]           
  9                  -1  1    831168  ultralytics.nn.modules.block.SPPF            [576, 576, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  2   1993728  ultralytics.nn.modules.block.C2f             [960, 384, 2]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  2    517632  ultralytics.nn.modules.block.C2f             [576, 192, 2]                 
 16                  -1  1    332160  ultralytics.nn.modules.conv.Conv             [192, 192, 3, 2]              
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  2   1846272  ultralytics.nn.modules.block.C2f             [576, 384, 2]                 
 19                  -1  1   1327872  ultralytics.nn.modules.conv.Conv             [384, 384, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  2   4207104  ultralytics.nn.modules.block.C2f             [960, 576, 2]                 
 22        [15, 18, 21]  1   5160182  ultralytics.nn.modules.head.Segment          [2, 32, 192, [192, 384, 576]] 
YOLOv8m-seg summary: 331 layers, 27240806 parameters, 27240790 gradients, 110.4 GFLOPs

Transferred 531/537 items from pretrained weights
DDP: debug command /home/cx-a100/zb/fjh/ARR/bin/python -m torch.distributed.run --nproc_per_node 8 --master_port 40671 /root/.config/Ultralytics/DDP/_temp_hk7ijrg_140128794374688.py
Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp-a34b3233.so.1 library.
    Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.
Traceback (most recent call last):
  File "/home/cx-a100/zb/fjh/ARR/bin/yolo", line 8, in <module>
    sys.exit(entrypoint())
  File "/home/cx-a100/zb/fjh/ARR/lib/python3.10/site-packages/ultralytics/cfg/__init__.py", line 582, in entrypoint
    getattr(model, mode)(**overrides)  # default args from model
  File "/home/cx-a100/zb/fjh/ARR/lib/python3.10/site-packages/ultralytics/engine/model.py", line 657, in train
    self.trainer.train()
  File "/home/cx-a100/zb/fjh/ARR/lib/python3.10/site-packages/ultralytics/engine/trainer.py", line 208, in train
    raise e
  File "/home/cx-a100/zb/fjh/ARR/lib/python3.10/site-packages/ultralytics/engine/trainer.py", line 206, in train
    subprocess.run(cmd, check=True)
  File "/home/cx-a100/zb/fjh/ARR/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/home/cx-a100/zb/fjh/ARR/bin/python', '-m', 'torch.distributed.run', '--nproc_per_node', '8', '--master_port', '40671', '/root/.config/Ultralytics/DDP/_temp_hk7ijrg_140128794374688.py']' returned non-zero exit status 1.

为什么单卡可以,多卡就不行了?

  • 写回答

4条回答 默认 最新

  • 关注

    如果在 numpy 之前导入了 torch,那么这里的子进程将获得一个 GNU 线程层(即使父进程没有定义变量)

    但是如果 numpy 在 Torch 之前被导入,子进程将获得一个 INTEL 线程层,这种情况会导致线程之间打架
    在环境变量添加

    'MKL_SERVICE_FORCE_INTEL' = '1'
    

    Linux 中

    export MKL_SERVICE_FORCE_INTEL=1
    

    如果错误信息还是会报

    再加入

    export MKL_THREADING_LAYER=GNU
    
    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(3条)

报告相同问题?

问题事件

  • 系统已结题 4月3日
  • 已采纳回答 3月26日
  • 修改了问题 3月26日
  • 创建了问题 3月26日

悬赏问题

  • ¥15 matlab数据降噪处理,提高数据的可信度,确保峰值信号的不损失?
  • ¥15 怎么看我在bios每次修改的日志
  • ¥15 python+mysql图书管理系统
  • ¥15 Questasim Error: (vcom-13)
  • ¥15 船舶旋回实验matlab
  • ¥30 SQL 数组,游标,递归覆盖原值
  • ¥15 为什么我的数据接收的那么慢呀有没有完整的 hal 库并 代码呀有的话能不能发我一份并且我用 printf 函数显示处理之后的数据,用 debug 就不能运行了呢
  • ¥20 gitlab 中文路径,无法下载
  • ¥15 用动态规划算法均分纸牌
  • ¥30 udp socket,bind 0.0.0.0 ,如何自动选取用户访问的服务器IP来回复数据