2301_80303186 2026-01-22 19:08 采纳率: 0%
浏览 23

Qwen3-VL GRPO训练:vLLM V1 引擎在多卡分布式环境下无法单卡初始化,卡在 Waiting for core engine

问题背景:我正在尝试实现 GRPO 算法训练 Qwen3-VL-8B-Thinking。采用“分布式训练 + 独立推理卡”的方案,在autodl上租了3张RTX PRO 6000 96G, 前两张卡用于训练,另外指定一张显卡(CUDA:2)专门运行 vLLM,用于生成样本,目前是让主进程负责在最后一张显卡上加载vllm,

环境:

pytorch: 2.8.0 cuda: 12.8

vllm版本: 0.11.0 (支持Qwen3 VL ,引擎为V1)

模型:

Qwen3-VL-8B-Thinking

框架:

trl+deepseed+vllm

相关代码如下:

if self.accelerator.is_main_process:
            target_physical_gpu = "2"  # 假设你想给 vLLM 物理卡 3
            os.environ["CUDA_VISIBLE_DEVICES"] =  target_physical_gpu
            os.environ["VLLM_DISTRIBUTED_EXECUTOR_BACKEND"] = "uni"
            dist_keys = ["RANK", "WORLD_SIZE", "MASTER_ADDR", "MASTER_PORT", "LOCAL_RANK"]
            for key in dist_keys:
                if key in os.environ:
                    del os.environ[key]
          
            world_size_patch = patch(
                "torch.distributed.get_world_size", return_value=1)
            is_init_patch = patch("torch.distributed.is_initialized", return_value=False)
            rank_patch = patch("torch.distributed.get_rank", return_value=0)
           
          
            model_path = "/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking"
            with world_size_patch,is_init_patch,rank_patch:
                self.llm = LLM(
                    model=model_path,
                    dtype="bfloat16",
                    max_model_len=3000,
                    trust_remote_code=True,
                    gpu_memory_utilization=0.6,
                    enforce_eager=True
                )
                self.sampling_params = SamplingParams(
                    temperature=args.temperature,
                    top_p=0.9,
                    top_k=50,
                    max_tokens=self.max_completion_length)
 

        self._last_loaded_step = 0  # tag to avoid useless loading during grad accumulation
        self.accelerator.wait_for_everyone()

相关日志如下:

DEBUG 01-22 11:42:05 [plugins/init.py:36] Available plugins for group vllm.general_plugins:
DEBUG 01-22 11:42:05 [plugins/init.py:38] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 01-22 11:42:05 [plugins/init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 01-22 11:42:05 [entrypoints/utils.py:233] non-default args: {'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 3000, 'gpu_memory_utilization': 0.6, 'disable_log_stats': True, 'enforce_eager': True, 'model': '/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking'}
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
DEBUG 01-22 11:42:05 [model_executor/models/registry.py:498] Loaded model info for class vllm.model_executor.models.qwen3_vl.Qwen3VLForConditionalGeneration from cache
DEBUG 01-22 11:42:05 [logging_utils/log_time.py:27] Registry inspect model class: Elapsed time 0.0005291 secs
INFO 01-22 11:42:05 [config/model.py:547] Resolved architecture: Qwen3VLForConditionalGeneration
torch_dtype is deprecated! Use dtype instead!
INFO 01-22 11:42:05 [config/model.py:1510] Using max model len 3000
DEBUG 01-22 11:42:05 [engine/arg_utils.py:1672] Setting max_num_batched_tokens to 16384 for LLM_CLASS usage context.
DEBUG 01-22 11:42:05 [engine/arg_utils.py:1681] Setting max_num_seqs to 1024 for LLM_CLASS usage context.
INFO 01-22 11:42:06 [config/scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 01-22 11:42:06 [config/init.py:381] Cudagraph is disabled under eager mode
DEBUG 01-22 11:42:06 [v1/engine/llm_engine.py:173] Enabling multiprocessing for LLMEngine.
⚙️ Running in WANDB offline mode
DEBUG 01-22 11:42:11 [plugins/init.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 01-22 11:42:11 [platforms/init.py:34] Checking if TPU platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:52] TPU platform is not available because: No module named 'libtpu'
DEBUG 01-22 11:42:11 [platforms/init.py:58] Checking if CUDA platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:78] Confirmed CUDA platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:106] Checking if ROCm platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:120] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 01-22 11:42:11 [platforms/init.py:127] Checking if XPU platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:146] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 01-22 11:42:11 [platforms/init.py:153] Checking if CPU platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:58] Checking if CUDA platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:78] Confirmed CUDA platform is available.
INFO 01-22 11:42:11 [platforms/init.py:216] Automatically detected platform cuda.
[2026-01-22 11:42:11,653] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
wandb: Tracking run with wandb version 0.23.1
wandb: W&B syncing is set to offline in this directory. Run wandb online or set WANDB_MODE=online to enable cloud syncing.
wandb: Run data is saved locally in /root/autodl-tmp/Visionary-R1/wandb/offline-run-20260122_114213-8oobf1wi
wandb: Detected [huggingface_hub.inference, openai] in use.
wandb: Use W&B Weave for improved LLM call tracing. Install Weave with pip install weave then add import weave to the top of your script.
wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
[1;36m(EngineCore_DP0 pid=2180)[0;0m INFO 01-22 11:42:13 [v1/engine/core.py:644] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [v1/engine/core.py:652] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=['ipc:///tmp/7acf285c-099a-4866-8ee1-6d7639956f98'], outputs=['ipc:///tmp/b04f3893-d700-4811-9a2b-08f43b66487d'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None), parallel_config={'data_parallel_master_ip': '127.0.0.1', 'data_parallel_master_port': 0, '_data_parallel_master_port_list': [], 'data_parallel_size': 1})
DEBUG 01-22 11:42:13 [v1/engine/utils.py:859] HELLO from local core engine process 0.
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [v1/engine/core.py:487] Has DP Coordinator: False, stats publish address: None
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [plugins/init.py:36] Available plugins for group vllm.general_plugins:
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [plugins/init.py:38] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [plugins/init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
[1;36m(EngineCore_DP0 pid=2180)[0;0m INFO 01-22 11:42:13 [v1/engine/core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking', speculative_config=None, tokenizer='/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=3000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [compilation/decorators.py:155] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.deepseek_v2.DeepseekV2Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [compilation/decorators.py:155] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [compilation/decorators.py:155] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama_eagle3.LlamaModel'>: ['input_ids', 'positions', 'hidden_states']
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [utils/init.py:3188] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f0cb76f0110>
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [distributed/parallel_state.py:1029] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.6:54021 backend=nccl
DEBUG 01-22 11:42:23 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-22 11:42:33 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-22 11:42:43 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-22 11:42:53 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.

  • 写回答

2条回答 默认 最新

  • 檀越@新空间 2026-01-22 19:15
    关注

    晚上好🌙🌙🌙
    本答案参考通义千问

    你遇到的问题是 在多卡分布式训练环境下,使用 vLLM V1 引擎时,无法在单卡上初始化,并且卡在 Waiting for core engine 的状态。以下是详细的分析和解决方案。


    一、问题原因分析

    1. vLLM V1 引擎的并发机制限制

    • vLLM V1 版本(0.11.0)默认支持多进程并发,但不支持多 GPU 并发
    • 当你在主进程中加载 vLLM 模型时,它会尝试启动一个子进程来运行模型推理任务。如果这个子进程无法正确分配到指定的 GPU(如 CUDA:2),就会导致程序卡住。

    2. CUDA_VISIBLE_DEVICES 设置错误或冲突

    • 你在代码中设置了 os.environ["CUDA_VISIBLE_DEVICES"] = target_physical_gpu,但可能没有正确地将该环境变量传递给子进程。
    • 在 Python 中,os.environ 只会影响当前进程的环境变量,不会自动继承到子进程。如果你在主进程中设置 CUDA_VISIBLE_DEVICES,但子进程(如 vLLM 启动的进程)没有正确读取该值,就可能导致 GPU 分配失败。

    3. vLLM 的分布式配置问题

    • 你设置了 VLLM_DISTRIBUTED_EXECUTOR_BACKEND = "uni",这表明你想使用统一的执行后端,但 vLLM V1 不支持这种模式。
    • vLLM V1 的分布式功能仅限于多线程/多进程,而不支持跨 GPU 的分布式执行。

    4. PyTorch 和 vLLM 的版本兼容性问题

    • PyTorch 2.8.0 和 vLLM 0.11.0 之间可能存在一些兼容性问题,尤其是在处理 GPU 资源管理方面。

    二、解决方案

    ✅ 1. 确保 vLLM 进程能正确访问指定 GPU

    修改代码如下:

    import os
    from vllm import LLM, SamplingParams
    import torch.distributed as dist
    from contextlib import contextmanager
    
    @contextmanager
    def set_env_vars(**kwargs):
        original = os.environ.copy()
        try:
            os.environ.update(kwargs)
            yield
        finally:
            os.environ.clear()
            os.environ.update(original)
    
    if self.accelerator.is_main_process:
        target_physical_gpu = "2"  # 假设你想给 vLLM 物理卡 3
        with set_env_vars(CUDA_VISIBLE_DEVICES=target_physical_gpu):
            dist_keys = ["RANK", "WORLD_SIZE", "MASTER_ADDR", "MASTER_PORT", "LOCAL_RANK"]
            for key in dist_keys:
                if key in os.environ:
                    del os.environ[key]
            
            world_size_patch = patch(
                "torch.distributed.get_world_size", return_value=1)
            is_init_patch = patch("torch.distributed.is_initialized", return_value=False)
            rank_patch = patch("torch.distributed.get_rank", return_value=0)
            
            with world_size_patch, is_init_patch, rank_patch:
                model_path = "/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking"
                self.llm = LLM(
                    model=model_path,
                    dtype="bfloat16",
                    max_model_len=3000,
                    trust_remote_code=True,
                    gpu_memory_utilization=0.6,
                    enforce_eager=True
                )
                self.sampling_params = SamplingParams(
                    temperature=args.temperature,
                    top_p=0.9,
                    top_k=50,
                    max_tokens=self.max_completion_length
                )
    

    重点说明:

    • 使用 set_env_vars 上下文管理器确保子进程也能正确读取 CUDA_VISIBLE_DEVICES
    • 避免直接修改 os.environ,而是通过上下文管理器来临时设置环境变量。

    ✅ 2. 禁用 vLLM 的分布式功能

    由于 vLLM V1 不支持跨 GPU 的分布式运行,建议禁用其分布式配置。

    修改代码如下:

    # 在加载 LLM 之前,禁用 vLLM 的分布式配置
    os.environ["VLLM_DISTRIBUTED_EXECUTOR_BACKEND"] = "none"
    

    重点说明:

    • VLLM_DISTRIBUTED_EXECUTOR_BACKEND 设置为 "none" 可以避免 vLLM 启动额外的进程,从而解决“卡在 Waiting for core engine”问题。

    ✅ 3. 升级 vLLM 到 V2 或更高版本(推荐)

    vLLM V2 已经支持多 GPU 和分布式推理,并且对 GRPO 算法有更好的兼容性。

    安装 vLLM V2(请确认是否支持 Qwen3-VL):

    pip install vllm==0.17.0
    

    重点说明:

    • 如果你使用的是 Qwen3-VL,需要确认 vLLM V2 是否已经适配了该模型。否则,仍然需要使用 V1,并按上述方法优化。

    ✅ 4. 确保模型路径和权限正确

    检查以下几点:

    • 模型路径是否正确(/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking)。
    • 是否有权限访问该目录。
    • 模型文件是否完整。

    ✅ 5. 简化分布式逻辑,只在主进程中加载 vLLM

    目前你是在主进程中加载 vLLM,但 vLLM 可能会启动子进程。可以尝试只在主进程中加载,其他进程不加载。

    if self.accelerator.is_main_process:
        # 加载 vLLM
        ...
    else:
        # 其他进程不加载 vLLM
        pass
    

    重点说明:

    • 避免在多个进程中同时加载 vLLM,防止资源竞争和冲突。

    三、总结与建议

    | 问题 | 解决方案 | |------|----------| | vLLM V1 无法在单卡初始化 | 禁用分布式配置,确保子进程能读取 CUDA_VISIBLE_DEVICES | | 卡在 Waiting for core engine | 升级 vLLM 到 V2 或更高版本 | | 多卡环境下的 GPU 冲突 | 确保每张卡独立使用,避免共享 GPU | | 模型加载失败 | 检查模型路径、权限和完整性 |


    四、最终建议

    强烈建议:

    • 如果你使用的是 Qwen3-VL-8B-Thinking,请确认 vLLM 是否已支持该模型
    • 如果不支持,建议使用 HuggingFace Transformers + accelerate 替代 vLLM,以确保稳定性。

    如有更多关于模型结构或 GRPO 实现的问题,欢迎继续提问!

    评论

报告相同问题?

问题事件

  • 创建了问题 1月22日