问题背景:我正在尝试实现 GRPO 算法训练 Qwen3-VL-8B-Thinking。采用“分布式训练 + 独立推理卡”的方案,在autodl上租了3张RTX PRO 6000 96G, 前两张卡用于训练,另外指定一张显卡(CUDA:2)专门运行 vLLM,用于生成样本,目前是让主进程负责在最后一张显卡上加载vllm,
环境:
pytorch: 2.8.0 cuda: 12.8
vllm版本: 0.11.0 (支持Qwen3 VL ,引擎为V1)
模型:
Qwen3-VL-8B-Thinking
框架:
trl+deepseed+vllm
相关代码如下:
if self.accelerator.is_main_process:
target_physical_gpu = "2" # 假设你想给 vLLM 物理卡 3
os.environ["CUDA_VISIBLE_DEVICES"] = target_physical_gpu
os.environ["VLLM_DISTRIBUTED_EXECUTOR_BACKEND"] = "uni"
dist_keys = ["RANK", "WORLD_SIZE", "MASTER_ADDR", "MASTER_PORT", "LOCAL_RANK"]
for key in dist_keys:
if key in os.environ:
del os.environ[key]
world_size_patch = patch(
"torch.distributed.get_world_size", return_value=1)
is_init_patch = patch("torch.distributed.is_initialized", return_value=False)
rank_patch = patch("torch.distributed.get_rank", return_value=0)
model_path = "/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking"
with world_size_patch,is_init_patch,rank_patch:
self.llm = LLM(
model=model_path,
dtype="bfloat16",
max_model_len=3000,
trust_remote_code=True,
gpu_memory_utilization=0.6,
enforce_eager=True
)
self.sampling_params = SamplingParams(
temperature=args.temperature,
top_p=0.9,
top_k=50,
max_tokens=self.max_completion_length)
self._last_loaded_step = 0 # tag to avoid useless loading during grad accumulation
self.accelerator.wait_for_everyone()
相关日志如下:
DEBUG 01-22 11:42:05 [plugins/init.py:36] Available plugins for group vllm.general_plugins:
DEBUG 01-22 11:42:05 [plugins/init.py:38] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
DEBUG 01-22 11:42:05 [plugins/init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
INFO 01-22 11:42:05 [entrypoints/utils.py:233] non-default args: {'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 3000, 'gpu_memory_utilization': 0.6, 'disable_log_stats': True, 'enforce_eager': True, 'model': '/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking'}
The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
DEBUG 01-22 11:42:05 [model_executor/models/registry.py:498] Loaded model info for class vllm.model_executor.models.qwen3_vl.Qwen3VLForConditionalGeneration from cache
DEBUG 01-22 11:42:05 [logging_utils/log_time.py:27] Registry inspect model class: Elapsed time 0.0005291 secs
INFO 01-22 11:42:05 [config/model.py:547] Resolved architecture: Qwen3VLForConditionalGenerationtorch_dtype is deprecated! Use dtype instead!
INFO 01-22 11:42:05 [config/model.py:1510] Using max model len 3000
DEBUG 01-22 11:42:05 [engine/arg_utils.py:1672] Setting max_num_batched_tokens to 16384 for LLM_CLASS usage context.
DEBUG 01-22 11:42:05 [engine/arg_utils.py:1681] Setting max_num_seqs to 1024 for LLM_CLASS usage context.
INFO 01-22 11:42:06 [config/scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 01-22 11:42:06 [config/init.py:381] Cudagraph is disabled under eager mode
DEBUG 01-22 11:42:06 [v1/engine/llm_engine.py:173] Enabling multiprocessing for LLMEngine.
⚙️ Running in WANDB offline mode
DEBUG 01-22 11:42:11 [plugins/init.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 01-22 11:42:11 [platforms/init.py:34] Checking if TPU platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:52] TPU platform is not available because: No module named 'libtpu'
DEBUG 01-22 11:42:11 [platforms/init.py:58] Checking if CUDA platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:78] Confirmed CUDA platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:106] Checking if ROCm platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:120] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 01-22 11:42:11 [platforms/init.py:127] Checking if XPU platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:146] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 01-22 11:42:11 [platforms/init.py:153] Checking if CPU platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:58] Checking if CUDA platform is available.
DEBUG 01-22 11:42:11 [platforms/init.py:78] Confirmed CUDA platform is available.
INFO 01-22 11:42:11 [platforms/init.py:216] Automatically detected platform cuda.
[2026-01-22 11:42:11,653] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
wandb: Tracking run with wandb version 0.23.1
wandb: W&B syncing is set to offline in this directory. Run wandb online or set WANDB_MODE=online to enable cloud syncing.
wandb: Run data is saved locally in /root/autodl-tmp/Visionary-R1/wandb/offline-run-20260122_114213-8oobf1wi
wandb: Detected [huggingface_hub.inference, openai] in use.
wandb: Use W&B Weave for improved LLM call tracing. Install Weave with pip install weave then add import weave to the top of your script.
wandb: For more information, check out the docs at: https://weave-docs.wandb.ai/
[1;36m(EngineCore_DP0 pid=2180)[0;0m INFO 01-22 11:42:13 [v1/engine/core.py:644] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [v1/engine/core.py:652] Received init message: EngineHandshakeMetadata(addresses=EngineZmqAddresses(inputs=['ipc:///tmp/7acf285c-099a-4866-8ee1-6d7639956f98'], outputs=['ipc:///tmp/b04f3893-d700-4811-9a2b-08f43b66487d'], coordinator_input=None, coordinator_output=None, frontend_stats_publish_address=None), parallel_config={'data_parallel_master_ip': '127.0.0.1', 'data_parallel_master_port': 0, '_data_parallel_master_port_list': [], 'data_parallel_size': 1})
DEBUG 01-22 11:42:13 [v1/engine/utils.py:859] HELLO from local core engine process 0.
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [v1/engine/core.py:487] Has DP Coordinator: False, stats publish address: None
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [plugins/init.py:36] Available plugins for group vllm.general_plugins:
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [plugins/init.py:38] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:13 [plugins/init.py:41] All plugins in this group will be loaded. Set VLLM_PLUGINS to control which plugins to load.
[1;36m(EngineCore_DP0 pid=2180)[0;0m INFO 01-22 11:42:13 [v1/engine/core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking', speculative_config=None, tokenizer='/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=3000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/root/autodl-tmp/Visionary-R1/llm_models/Qwen3-VL-8B-Thinking, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [compilation/decorators.py:155] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.deepseek_v2.DeepseekV2Model'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [compilation/decorators.py:155] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [compilation/decorators.py:155] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama_eagle3.LlamaModel'>: ['input_ids', 'positions', 'hidden_states']
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [utils/init.py:3188] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f0cb76f0110>
[1;36m(EngineCore_DP0 pid=2180)[0;0m DEBUG 01-22 11:42:14 [distributed/parallel_state.py:1029] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.6:54021 backend=nccl
DEBUG 01-22 11:42:23 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-22 11:42:33 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-22 11:42:43 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.
DEBUG 01-22 11:42:53 [v1/engine/utils.py:776] Waiting for 1 local, 0 remote core engine proc(s) to start.