Boolean_01 2025-07-11 11:13 采纳率: 25%

"with torch.cuda.stream()" 为什么会阻塞，而不是异步的？

起因

我在尝试实现dualpipe中的两个model chunk前向和反向之间的overlap。我打算先用torch.cuda.stream简单实现，一下：

def overlapped_forward_backward(
    module0: torch.nn.Module,
    inputs0: List[torch.Tensor],
    labels0: Optional[List[torch.Tensor]],
    loss_masks0: Optional[List[torch.Tensor]],
    loss1: Optional[torch.Tensor],
    outputs1: Optional[List[torch.Tensor]],
    output_grads1: Optional[List[torch.Tensor]],
    forward_step_func: Callable,
    is_last_stage0: bool,
) -> tuple[torch.Tensor, Optional[torch.Tensor]]:
    """
    You should implement custom forward-backward overlap strategy.
    The code below is just an example.
    """
    device = inputs0[0].device
    if not hasattr(overlapped_forward_backward, 'backward_streams'):
        overlapped_forward_backward.backward_streams = {}
    if device not in overlapped_forward_backward.backward_streams:
        overlapped_forward_backward.backward_streams[device] = torch.cuda.Stream(device=device)
    backward_stream = overlapped_forward_backward.backward_streams[device]

    with torch.cuda.stream(backward_stream):
        if loss1 is not None:
            loss1.backward()
            loss1.detach_()
        else:
            run_backward(outputs1, output_grads1)

    if len(inputs0) == 1:
        from megatron.core.utils import get_attr_wrapped_model
        set_input_tensor = get_attr_wrapped_model(module0, "set_input_tensor")
        set_input_tensor(inputs0)
    if is_last_stage0:
        inputs0_with_labels_loss_masks = list(inputs0)
        inputs0_with_labels_loss_masks.append(labels0)
        inputs0_with_labels_loss_masks.append(loss_masks0)
        outputs0, loss_func = forward_step_func(inputs0_with_labels_loss_masks, module0)
    else:
        outputs0, loss_func = forward_step_func(inputs0, module0)
    outputs0 = [outputs0] if isinstance(outputs0, torch.Tensor) else outputs0
    if is_last_stage0:
        loss0 = loss_func(outputs0[0])[0]
    else:
        loss0 = None

    torch.cuda.current_stream().wait_stream(backward_stream)

    return outputs0, loss0

我发现前向和反向并没有overlap，反向那块代码运行时间和没有加stream的情况是相同的。

小实验

于是我做了下面这个小实验：


import torch
import time

# GPU warmup
a = torch.randn(10000, 10000, device='cuda')
b = torch.randn(10000, 10000, device='cuda')
c = torch.mm(a, b)

# Count calculation time
calc_start = time.time()
a = torch.randn(10000, 10000, device='cuda')
b = torch.randn(10000, 10000, device='cuda')
for i in range(100):
    c = torch.mm(a, b)
calc_end = time.time()
print(f"calc time: {calc_end - calc_start}")

# Count stream time
calc_stream = torch.cuda.Stream()
torch.cuda.synchronize()
stream_start = time.time()
with torch.cuda.stream(calc_stream):
    a = torch.randn(10000, 10000, device='cuda')
    b = torch.randn(10000, 10000, device='cuda')
    for i in range(100):
        c = torch.mm(a, b)
stream_end = time.time()
print(f"stream time: {stream_end - stream_start}")

torch.cuda.synchronize()

测试了用stream和不用stream的时间，发现时间是差不多的，stream的时间反而更长：

如果增加循环数量的话，使不使用stream的时间都会相应的增加。

问题

按照我的理解，"with torch.cuda.stream()"应该是异步的，不会阻塞，其中的代码块运行时间应该趋近于零。这样才可以和后续的程序并行计算或通信。但是实验结果貌似说明它是阻塞的，整个计算做完之后才会往后继续执行。我代码哪里不对吗，还是有其他可能原因，请指点一下！

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

5条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
檀越@新空间 2025-07-11 11:18
关注
上午好☀️☀️☀️️
本答案参考通义千问

你的问题非常典型，涉及到 PyTorch 的 CUDA 流（stream）机制 以及其在异步执行中的行为。你提到的 with torch.cuda.stream() 并没有实现预期的异步执行效果，反而看起来像是阻塞的，这背后有多个可能的原因。

✅ 核心问题：with torch.cuda.stream() 是异步的，但需要配合 cuda.synchronize() 或其他机制才能看到效果

1. with torch.cuda.stream(...): 是异步的，但不会自动启动计算

with torch.cuda.stream(stream): 只是将后续操作调度到指定的流中，并不会立即执行。
如果你在同一个流中连续执行多个操作（如循环），那么这些操作会按顺序执行，并不会并行。
因此，如果你只在一个流中做大量计算，时间上和主线程（默认流）是一样的。

重点：
with torch.cuda.stream(...): 不会自动异步执行任务，它只是将操作放入该流中，等待该流被调度执行。

🧠 为什么你的实验结果看起来像阻塞？

2. 实验代码中没有使用 synchronize() 来测量正确的时间

stream_start = time.time() with torch.cuda.stream(calc_stream): a = torch.randn(10000, 10000, device='cuda') b = torch.randn(10000, 10000, device='cuda') for i in range(100): c = torch.mm(a, b) stream_end = time.time()

这段代码中，time.time() 记录的是 Python 主线程的时间，而 with torch.cuda.stream(...) 中的操作是异步的，Python 会在进入这个块后立即返回，不等待 GPU 完成计算。因此，stream_end - stream_start 实际上只记录了 Python 线程的时间，而不是 GPU 执行时间。

重点：
要测量 GPU 操作的实际耗时，必须使用 torch.cuda.synchronize() 来确保所有操作完成后再计时。

🔍 正确的实验方法

修改后的实验代码如下：

import torch import time # GPU warmup a = torch.randn(10000, 10000, device='cuda') b = torch.randn(10000, 10000, device='cuda') c = torch.mm(a, b) # Count calculation time calc_start = time.time() a = torch.randn(10000, 10000, device='cuda') b = torch.randn(10000, 10000, device='cuda') for i in range(100): c = torch.mm(a, b) calc_end = time.time() print(f"calc time: {calc_end - calc_start}") # Count stream time calc_stream = torch.cuda.Stream() torch.cuda.synchronize() # 确保之前操作完成 stream_start = time.time() with torch.cuda.stream(calc_stream): a = torch.randn(10000, 10000, device='cuda') b = torch.randn(10000, 10000, device='cuda') for i in range(100): c = torch.mm(a, b) # 等待流中的任务完成 torch.cuda.synchronize(calc_stream) stream_end = time.time() print(f"stream time: {stream_end - stream_start}")

重点：
必须使用 torch.cuda.synchronize() 来确保流中的任务完成后再计时，否则时间无法反映实际的 GPU 执行时间。

🛠️ 如何真正实现“前向和反向之间的 overlap”？

你希望在 前向计算 和 反向计算 之间实现 overlap（重叠），这需要以下几点：

✅ 解决方案步骤如下：

创建两个独立的 CUDA 流：

一个用于前向计算（default stream）
一个用于反向计算（backward stream）

将前向计算放在 default stream 中，反向计算放在另一个流中。

使用 wait_stream() 来控制依赖关系：

在反向计算开始前，让 backward stream 等待 forward stream 完成。
这样可以确保梯度计算只在前向完成后才开始。

使用 torch.cuda.synchronize() 来同步流，确保计算完成后再继续后续操作。

✅ 示例修改后的代码（优化版）

def overlapped_forward_backward( module0: torch.nn.Module, inputs0: List[torch.Tensor], labels0: Optional[List[torch.Tensor]], loss_masks0: Optional[List[torch.Tensor]], loss1: Optional[torch.Tensor], outputs1: Optional[List[torch.Tensor]], output_grads1: Optional[List[torch.Tensor]], forward_step_func: Callable, is_last_stage0: bool, ) -> tuple[torch.Tensor, Optional[torch.Tensor]]: """ You should implement custom forward-backward overlap strategy. The code below is just an example. """ device = inputs0[0].device if not hasattr(overlapped_forward_backward, 'backward_streams'): overlapped_forward_backward.backward_streams = {} if device not in overlapped_forward_backward.backward_streams: overlapped_forward_backward.backward_streams[device] = torch.cuda.Stream(device=device) backward_stream = overlapped_forward_backward.backward_streams[device] # 前向计算在 default stream if len(inputs0) == 1: from megatron.core.utils import get_attr_wrapped_model set_input_tensor = get_attr_wrapped_model(module0, "set_input_tensor") set_input_tensor(inputs0) if is_last_stage0: inputs0_with_labels_loss_masks = list(inputs0) inputs0_with_labels_loss_masks.append(labels0) inputs0_with_labels_loss_masks.append(loss_masks0) outputs0, loss_func = forward_step_func(inputs0_with_labels_loss_masks, module0) else: outputs0, loss_func = forward_step_func(inputs0, module0) outputs0 = [outputs0] if isinstance(outputs0, torch.Tensor) else outputs0 if is_last_stage0: loss0 = loss_func(outputs0[0])[0] else: loss0 = None # 将反向计算放入新的流中 with torch.cuda.stream(backward_stream): if loss1 is not None: loss1.backward() loss1.detach_() else: run_backward(outputs1, output_grads1) # 等待反向流完成（可选） # torch.cuda.synchronize(backward_stream) return outputs0, loss0

重点：
将前向计算放在 default stream，反向计算放在独立的流中，通过 wait_stream() 控制依赖，即可实现 overlap。

✅ 总结

| 问题 | 原因 | 解决方案 | |------|------|----------| | with torch.cuda.stream(...) 阻塞 | 未使用 synchronize() 测量 GPU 时间 | 使用 torch.cuda.synchronize() 后再计时 | | 前向和反向不能 overlap | 没有分离流或控制依赖 | 使用独立流 + wait_stream() 控制依赖 | | with torch.cuda.stream(...) 表现像同步 | 流中操作顺序执行 | 将不同任务分配到不同流中 |

📌 最终建议

务必使用 torch.cuda.synchronize() 来测量 GPU 操作的实际耗时。
合理分配流，将不同的计算任务（如前向、反向、通信）分配到不同的流中。
使用 wait_stream() 控制依赖关系，以实现真正的 overlap。

如有更多关于 PyTorch 流机制的问题，欢迎继续提问！
解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

【PyTorch】torch.cuda.synchronize() 函数：同步GPU计算流
2025-02-27 17:11

彬彬侠的博客在PyTorch进行GPU计算时，CUDA操作是异步执行的，而torch.cuda.synchronize()主要用于确保CUDA操作已经完成，通常用于性能测试或确保正确的时间测量。但不要在正常训练中频繁调用，否则会降低PyTorch的GPU计算效率。...
实现dualpipe中的两个model chunk前向和反向之间的overlap：发现“with torch.cuda.stream()“ 会阻塞，没有 overlap，而不是异步的？如何解决？？
2026-02-03 10:49

bug菌¹的博客本文收录于《全栈 Bug 调优（实战版）》专栏。专栏聚焦真实项目中的各类疑难 Bug，从成因剖析 → 排查路径 → 解决方案 → 预防优化全链路拆解，形成一套可复用、可沉淀的实战知识体系。无论你是初入职场的开发者...
torch.cuda.synchronize解析[代码]
2025-12-17 08:56

为了避免这种情况，需要使用torch.cuda.synchronize()函数，它能够在CPU线程中起到阻塞的作用，直到GPU上所有的CUDA核心完成当前的任务。在使用深度学习框架进行模型训练或推理时，通常会涉及到多个迭代的epoch，...
CUDA流（Stream）并发执行提升PyTorch计算效率
2025-12-29 00:14

徐校长的博客通过CUDA流实现计算与数据传输的重叠，有效提升GPU利用率。结合PyTorch的非默认流和事件机制，可在训练中隐藏数据加载延迟，尤其适用于I/O密集型场景。配合官方Docker镜像，快速构建稳定高性能环境，实现开箱即用的...
CUDA流(Stream)并行优化：提升PyTorch训练吞吐量
2025-12-29 22:59

codingdie的博客通过CUDA流实现计算与数据传输的重叠，有效提升GPU利用率，减少训练等待时间。结合异步预取、内存锁定和流调度，可在ResNet-50等模型上显著缩短单epoch耗时，尤其适用于小批量或高分辨率场景，是深度学习高性能训练...
突破GPU瓶颈：PyTorch Stream与Event异步计算完全指南
2025-09-19 07:02

谭妲茹的博客本文将通过PyTorch的Stream（流）与Event（事件）机制，带你掌握异步计算的核心技术，轻松提升模型训练效率30%以上。读完本文，你将能够： - 理解GPU异步计算的底层原理 - 掌握Stream的创建与任务调度技巧 - 学会用...
pytorch 同步机制
2024-09-21 22:48

Jakari的博客在 PyTorch 中，当多个算子（operators）和内核（kernels）被...CUDA 是一个异步计算平台，计算任务会被放入一个队列中异步执行，PyTorch 为了确保不同算子之间的依赖关系正确，使用了流和事件来管理任务的调度和同步。
动手学深度学习 - 计算性能 - 13.2 异步计算
2025-05-29 08:48

夏驰和徐策的博客 2）同步操作（如打印、类型转换）会阻塞计算；3）合理使用同步点（如每批次同步一次）可平衡性能和稳定性。文章还给出工程建议，如避免频繁数据转换、使用专业分析工具等，帮助开发者充分利用硬件并行能力。理解异步...
PyTorch-CUDA-v2.6镜像中的CUDA流（Stream）优化技巧
2025-12-29 05:38

老光私享的博客通过PyTorch中的CUDA流实现数据加载与模型计算的并行化，有效消除GPU空转，显著提升训练和推理效率。结合双流设计、预分配缓冲区和显式同步，可在I/O密集场景下充分发挥A100/H100硬件性能，尤其适用于PyTorch 2.6 + ...
CUDA流与异步
2024-04-23 21:09

Cheny1m的博客基于流的异步的内核启动和数据传输支持以下类型的粗粒度并发：1.重叠主机计算和设备计算；2.重叠主机计算和主机与设备间的数据传输；3.重叠主机与设备间的数据传输和设备计算；4.并发设备计算。理解一个CUDA程序，...
CUDA流式传输Stream：Miniconda-Python3.9异步执行计算任务
2025-12-30 15:55

创新工场的博客通过CUDA流实现数据传输与计算的重叠执行，显著提升GPU利用率；结合Miniconda-Python3.9构建可复现、轻量化的开发环境，解决依赖混乱问题...二者协同优化异步AI计算性能与工程稳定性，适用于实时推理与大规模训练场景。
【CUDA编程】getCurrentCUDAStream 详解
2025-06-17 22:14

量化投资和人工智能的博客 getCurrentCUDAStream 是 PyTorch 中用于获取当前线程绑定的 CUDA 流对象的关键函数，它在 GPU 异步计算、多流并行优化中扮演核心角色。
没有解决我的问题, 去提问

问题事件

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
创建了问题 7月11日

码龄粉丝数原力等级 --

"with torch.cuda.stream()" 为什么会阻塞，而不是异步的？

起因

小实验

问题

5条回答默认最新

码龄粉丝数原力等级 --

✅ 核心问题：`with torch.cuda.stream()` 是异步的，但需要配合 `cuda.synchronize()` 或其他机制才能看到效果

1. `with torch.cuda.stream(...):` 是异步的，但不会自动启动计算

🧠 为什么你的实验结果看起来像阻塞？

2. 实验代码中没有使用 `synchronize()` 来测量正确的时间

🔍 正确的实验方法

修改后的实验代码如下：

🛠️ 如何真正实现“前向和反向之间的 overlap”？

✅ 解决方案步骤如下：

✅ 示例修改后的代码（优化版）

✅ 总结

📌 最终建议

问题事件

码龄粉丝数原力等级 --

"with torch.cuda.stream()" 为什么会阻塞，而不是异步的？

起因

小实验

问题

5条回答 默认 最新

✅ 核心问题：with torch.cuda.stream() 是异步的，但需要配合 cuda.synchronize() 或其他机制才能看到效果

1. with torch.cuda.stream(...): 是异步的，但不会自动启动计算

🧠 为什么你的实验结果看起来像阻塞？

2. 实验代码中没有使用 synchronize() 来测量正确的时间

🔍 正确的实验方法

修改后的实验代码如下：

🛠️ 如何真正实现“前向和反向之间的 overlap”？

✅ 解决方案步骤如下：

✅ 示例修改后的代码（优化版）

✅ 总结

📌 最终建议

问题事件

5条回答默认最新

✅ 核心问题：`with torch.cuda.stream()` 是异步的，但需要配合 `cuda.synchronize()` 或其他机制才能看到效果

1. `with torch.cuda.stream(...):` 是异步的，但不会自动启动计算

2. 实验代码中没有使用 `synchronize()` 来测量正确的时间