2301_80303186 2026-01-22 19:08 采纳率: 0%

Qwen3-VL GRPO训练：vLLM V1 引擎在多卡分布式环境下无法单卡初始化，卡在 Waiting for core engine

问题背景：我正在尝试实现 GRPO 算法训练 Qwen3-VL-8B-Thinking。采用“分布式训练 + 独立推理卡”的方案,在autodl上租了3张RTX PRO 6000 96G，前两张卡用于训练，另外指定一张显卡（CUDA:2）专门运行 vLLM，用于生成样本，目前是让主进程负责在最后一张显卡上加载vllm，

环境:

pytorch: 2.8.0 cuda: 12.8

vllm版本: 0.11.0 （支持Qwen3 VL ，引擎为V1）

模型：

Qwen3-VL-8B-Thinking

框架：

trl+deepseed+vllm

相关日志如下：

DEBUG 01-22 11:42:05 [plugins/init.py:36] Available plugins for group vllm.general_plugins:
DEBUG 01-22 11:42:05 [plugins/init.py:38] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 01-22 11:42:05 [plugins/init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 01-22 11:42:05 [entrypoints/utils.py:233] non-default args: {'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 3000, 'gpu_memory_utilization': 0.6, 'disable_log_stats': True, 'enforce_eager': True, 'model': '/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking'}
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
DEBUG 01-22 11:42:05 [model_executor/models/registry.py:498] Loaded model info for class vllm.model_executor.models.qwen3_vl.Qwen3VLForConditionalGeneration from cache
DEBUG 01-22 11:42:05 [logging_utils/log_time.py:27] Registry inspect model class: Elapsed time 0.0005291 secs
INFO 01-22 11:42:05 [config/model.py:547] Resolved architecture: Qwen3VLForConditionalGeneration
torch_dtype is deprecated! Use dtype instead!
INFO 01-22 11:42:05 [config/model.py:1510] Using max model len 3000
DEBUG 01-22 11:42:05 [engine/arg_utils.py:1672] Setting max_num_batched_tokens to 16384 for LLM_CLASS usage context.
DEBUG 01-22 11:42:05 [engine/arg_utils.py:1681] Setting max_num_seqs to 1024 for LLM_CLASS usage context.
INFO 01-22 11:42:06 [config/scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 01-22 11:42:06 [config/init.py:381] Cudagraph is disabled under eager mode
DEBUG 01-22 11:42:06 [v1/engine/llm_engine.py:173] Enabling multiprocessing for LLMEngine.
⚙️ Running in WANDB offline mode
DEBUG 01-22 11:42:11 [plugins/init.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 01-22 11:42:11 [platforms/init.py:34] Checking if TPU platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:52] TPU platform is not available because: No module named 'libtpu'
DEBUG 01-22 11:42:11 [platforms/init.py:58] Checking if CUDA platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:78] Confirmed CUDA platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:106] Checking if ROCm platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:120] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 01-22 11:42:11 [platforms/init.py:127] Checking if XPU platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:146] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 01-22 11:42:11 [platforms/init.py:153] Checking if CPU platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:58] Checking if CUDA platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:78] Confirmed CUDA platform is available.
INFO 01-22 11:42:11 [platforms/init.py:216] Automatically detected platform cuda.
[2026-01-22 11:42:11,653] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
wandb: Tracking run with wandb version 0.23.1
wandb: W&B syncing is set to offline in this directory. Run wandb online or set WANDB_MODE=online to enable cloud syncing.
wandb: Run data is saved locally in /root/autodl-tmp/Visionary-R1/wandb/offline-run-20260122_114213-8oobf1wi
wandb: Detected [huggingface_hub.inference, openai] in use.
wandb: Use W&B Weave for improved LLM call tracing. Install Weave with pip install weave then add import weave to the top of your script.
wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
[1;36m(EngineCore_DP0 pid=2180)[0;0m INFO 01-22 11:42:13 [v1/engine/core.py:644] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [v1/engine/core.py:652] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=['ipc:///tmp/7acf285c-099a-4866-8ee1-6d7639956f98'], outputs=['ipc:///tmp/b04f3893-d700-4811-9a2b-08f43b66487d'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None), parallel_config={'data_parallel_master_ip': '127.0.0.1', 'data_parallel_master_port': 0, '_data_parallel_master_port_list': [], 'data_parallel_size': 1})
DEBUG 01-22 11:42:13 [v1/engine/utils.py:859] HELLO from local core engine process 0.
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [v1/engine/core.py:487] Has DP Coordinator: False, stats publish address: None
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [plugins/init.py:36] Available plugins for group vllm.general_plugins:
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [plugins/init.py:38] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [plugins/init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
[1;36m(EngineCore_DP0 pid=2180)[0;0m INFO 01-22 11:42:13 [v1/engine/core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking', speculative_config=None, tokenizer='/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=3000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [compilation/decorators.py:155] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.deepseek_v2.DeepseekV2Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [compilation/decorators.py:155] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [compilation/decorators.py:155] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama_eagle3.LlamaModel'>: ['input_ids', 'positions', 'hidden_states']
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [utils/init.py:3188] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f0cb76f0110>
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [distributed/parallel_state.py:1029] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.6:54021 backend=nccl
DEBUG 01-22 11:42:23 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-22 11:42:33 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-22 11:42:43 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-22 11:42:53 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
檀越@新空间 2026-01-22 19:15
关注
晚上好🌙🌙🌙
本答案参考通义千问

你遇到的问题是 在多卡分布式训练环境下，使用 vLLM V1 引擎时，无法在单卡上初始化，并且卡在 Waiting for core engine 的状态。以下是详细的分析和解决方案。

一、问题原因分析

1. vLLM V1 引擎的并发机制限制

vLLM V1 版本（0.11.0）默认支持多进程并发，但不支持多 GPU 并发。
当你在主进程中加载 vLLM 模型时，它会尝试启动一个子进程来运行模型推理任务。如果这个子进程无法正确分配到指定的 GPU（如 CUDA:2），就会导致程序卡住。

2. CUDA_VISIBLE_DEVICES 设置错误或冲突

你在代码中设置了 os.environ["CUDA_VISIBLE_DEVICES"] = target_physical_gpu，但可能没有正确地将该环境变量传递给子进程。
在 Python 中，os.environ 只会影响当前进程的环境变量，不会自动继承到子进程。如果你在主进程中设置 CUDA_VISIBLE_DEVICES，但子进程（如 vLLM 启动的进程）没有正确读取该值，就可能导致 GPU 分配失败。

3. vLLM 的分布式配置问题

你设置了 VLLM_DISTRIBUTED_EXECUTOR_BACKEND = "uni"，这表明你想使用统一的执行后端，但 vLLM V1 不支持这种模式。
vLLM V1 的分布式功能仅限于多线程/多进程，而不支持跨 GPU 的分布式执行。

4. PyTorch 和 vLLM 的版本兼容性问题

PyTorch 2.8.0 和 vLLM 0.11.0 之间可能存在一些兼容性问题，尤其是在处理 GPU 资源管理方面。

二、解决方案

✅ 1. 确保 vLLM 进程能正确访问指定 GPU

修改代码如下：

import os from vllm import LLM, SamplingParams import torch.distributed as dist from contextlib import contextmanager @contextmanager def set_env_vars(**kwargs): original = os.environ.copy() try: os.environ.update(kwargs) yield finally: os.environ.clear() os.environ.update(original) if self.accelerator.is_main_process: target_physical_gpu = "2" # 假设你想给 vLLM 物理卡 3 with set_env_vars(CUDA_VISIBLE_DEVICES=target_physical_gpu): dist_keys = ["RANK", "WORLD_SIZE", "MASTER_ADDR", "MASTER_PORT", "LOCAL_RANK"] for key in dist_keys: if key in os.environ: del os.environ[key] world_size_patch = patch( "torch.distributed.get_world_size", return_value=1) is_init_patch = patch("torch.distributed.is_initialized", return_value=False) rank_patch = patch("torch.distributed.get_rank", return_value=0) with world_size_patch, is_init_patch, rank_patch: model_path = "/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking" self.llm = LLM( model=model_path, dtype="bfloat16", max_model_len=3000, trust_remote_code=True, gpu_memory_utilization=0.6, enforce_eager=True ) self.sampling_params = SamplingParams( temperature=args.temperature, top_p=0.9, top_k=50, max_tokens=self.max_completion_length )

重点说明：

使用 set_env_vars 上下文管理器确保子进程也能正确读取 CUDA_VISIBLE_DEVICES。
避免直接修改 os.environ，而是通过上下文管理器来临时设置环境变量。

✅ 2. 禁用 vLLM 的分布式功能

由于 vLLM V1 不支持跨 GPU 的分布式运行，建议禁用其分布式配置。

修改代码如下：

# 在加载 LLM 之前，禁用 vLLM 的分布式配置 os.environ["VLLM_DISTRIBUTED_EXECUTOR_BACKEND"] = "none"

重点说明：

将 VLLM_DISTRIBUTED_EXECUTOR_BACKEND 设置为 "none" 可以避免 vLLM 启动额外的进程，从而解决“卡在 Waiting for core engine”问题。

✅ 3. 升级 vLLM 到 V2 或更高版本（推荐）

vLLM V2 已经支持多 GPU 和分布式推理，并且对 GRPO 算法有更好的兼容性。

安装 vLLM V2（请确认是否支持 Qwen3-VL）：

pip install vllm==0.17.0

重点说明：

如果你使用的是 Qwen3-VL，需要确认 vLLM V2 是否已经适配了该模型。否则，仍然需要使用 V1，并按上述方法优化。

✅ 4. 确保模型路径和权限正确

检查以下几点：

模型路径是否正确（/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking）。
是否有权限访问该目录。
模型文件是否完整。

✅ 5. 简化分布式逻辑，只在主进程中加载 vLLM

目前你是在主进程中加载 vLLM，但 vLLM 可能会启动子进程。可以尝试只在主进程中加载，其他进程不加载。

if self.accelerator.is_main_process: # 加载 vLLM ... else: # 其他进程不加载 vLLM pass

重点说明：

避免在多个进程中同时加载 vLLM，防止资源竞争和冲突。

三、总结与建议

| 问题 | 解决方案 | |------|----------| | vLLM V1 无法在单卡初始化 | 禁用分布式配置，确保子进程能读取 CUDA_VISIBLE_DEVICES | | 卡在 Waiting for core engine | 升级 vLLM 到 V2 或更高版本 | | 多卡环境下的 GPU 冲突 | 确保每张卡独立使用，避免共享 GPU | | 模型加载失败 | 检查模型路径、权限和完整性 |

四、最终建议

强烈建议：

如果你使用的是 Qwen3-VL-8B-Thinking，请确认 vLLM 是否已支持该模型。
如果不支持，建议使用 HuggingFace Transformers + accelerate 替代 vLLM，以确保稳定性。

如有更多关于模型结构或 GRPO 实现的问题，欢迎继续提问！
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 1月22日

码龄粉丝数原力等级 --

Qwen3-VL GRPO训练：vLLM V1 引擎在多卡分布式环境下无法单卡初始化，卡在 Waiting for core engine

2条回答默认最新

码龄粉丝数原力等级 --

一、问题原因分析

1. vLLM V1 引擎的并发机制限制

2. CUDA_VISIBLE_DEVICES 设置错误或冲突

3. vLLM 的分布式配置问题

4. PyTorch 和 vLLM 的版本兼容性问题

二、解决方案

✅ 1. 确保 vLLM 进程能正确访问指定 GPU

修改代码如下：

✅ 2. 禁用 vLLM 的分布式功能

修改代码如下：

✅ 3. 升级 vLLM 到 V2 或更高版本（推荐）

安装 vLLM V2（请确认是否支持 Qwen3-VL）：

✅ 4. 确保模型路径和权限正确

✅ 5. 简化分布式逻辑，只在主进程中加载 vLLM

三、总结与建议

四、最终建议

问题事件

码龄粉丝数原力等级 --

Qwen3-VL GRPO训练：vLLM V1 引擎在多卡分布式环境下无法单卡初始化，卡在 Waiting for core engine

2条回答 默认 最新

一、问题原因分析

1. vLLM V1 引擎的并发机制限制

2. CUDA_VISIBLE_DEVICES 设置错误或冲突

3. vLLM 的分布式配置问题

4. PyTorch 和 vLLM 的版本兼容性问题

二、解决方案

✅ 1. 确保 vLLM 进程能正确访问指定 GPU

修改代码如下：

✅ 2. 禁用 vLLM 的分布式功能

修改代码如下：

✅ 3. 升级 vLLM 到 V2 或更高版本（推荐）

安装 vLLM V2（请确认是否支持 Qwen3-VL）：

✅ 4. 确保模型路径和权限正确

✅ 5. 简化分布式逻辑，只在主进程中加载 vLLM

三、总结与建议

四、最终建议

问题事件

2条回答默认最新