救救孩子吧,快被这个弄死了。。。最近在自学CUDA的流,写了一段程序,win10, cuda10.2, GTX1660ti(计算能力7.x)。预期是可以将核函数执行和数据拷贝overlap起来,隐藏执行时间,但是用了Nsight Systems发现并没有,下面是代码及其分析结果
// src.Tens 是host端的一个数组地址, S_K_0,S_C_0,S_C_1 都是流, dev_开头的都指向device端内存
// copy the thrid part(next)
checkCudaErrors(cudaMemcpyAsync(dev_src_0, src.Tens + (src.plane << 1), src.plane * __SPACE__, cudaMemcpyHostToDevice, S_K_0));
// copy back the first part
checkCudaErrors(cudaMemcpyAsync(dst.Tens, dev_dst_0, dst.plane * __SPACE__, cudaMemcpyDeviceToHost, S_C_0));
// start the second part
Gaussian_blur_hor << < grid, threads, 0, S_C_1 >> > (dev_src_1,
kernel,
dev_mid_1,
ker_len,
radius,
dstDim,
dim3(Wsrc, Hsrc),
radius);
Gaussian_blur_ver << <grid, threads, 0, S_C_1 >> > (dev_mid_1,
kernel,
dev_dst_1,
ker_len,
radius,
dstDim,
dim3(Wsrc, Hsrc));
checkCudaErrors(cudaStreamSynchronize(S_C_1));
核函数执行时间最长,本来我同步核函数所在的流S_C_1的时候,两个拷贝应该已经执行完了,但是拷贝一定要等到核函数执行完再执行,这是为啥啊lol。然后我改了一下代码,把cudaStreamSynchronize()去掉了
// copy the thrid part(next)
checkCudaErrors(cudaMemcpyAsync(dev_src_0, src.Tens + (src.plane << 1), src.plane * __SPACE__, cudaMemcpyHostToDevice, S_K_0));
// copy back the first part
checkCudaErrors(cudaMemcpyAsync(dst.Tens, dev_dst_0, dst.plane * __SPACE__, cudaMemcpyDeviceToHost, S_C_0));
// start the second part
Gaussian_blur_hor << < grid, threads, 0, S_C_1 >> > (dev_src_1,
kernel,
dev_mid_1,
ker_len,
radius,
dstDim,
dim3(Wsrc, Hsrc),
radius);
Gaussian_blur_ver << <grid, threads, 0, S_C_1 >> > (dev_mid_1,
kernel,
dev_dst_1,
ker_len,
radius,
dstDim,
dim3(Wsrc, Hsrc));
//去掉了流同步
//checkCudaErrors(cudaStreamSynchronize(S_C_1));
然后完美重叠(我前面有一段拷贝H to D是因为我有一段代码拷贝了,没贴出来)
救救孩子吧lol,我研究了好久都不知道是为什么