函数实现目的:求大小为100的一维数组最小值的下标,使用的归约运算
出现问题:GPU总比CPU运行时间久(GPU:MX150,CPU:i5 8th)
@cuda.jit
def arggetmin(Fitness, IN_index, OutResult, OutIndex, n):
tid = cuda.threadIdx.x;
idx = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
tmp = cuda.shared.array(shape=BLOCK_SIZE, dtype=float32)
index = cuda.shared.array(shape=BLOCK_SIZE, dtype=int32)
tmp[tid] = MAX
if (tid > n): return
if (idx < n):
tmp[tid] = Fitness[idx]
index[tid] = IN_index[idx]
else:
tmp[tid] = MAX
index[tid] = IN_index[idx]
cuda.syncthreads();
stride = int(cuda.blockDim.x / 2)
while stride > 0:
if (tid < stride):
if (tmp[tid] > tmp[tid + stride]):
tmp[tid] = tmp[tid + stride]
index[tid] = index[tid + stride]
cuda.syncthreads()
stride = int(stride / 2)
if (tid == 0):
OutResult[cuda.blockIdx.x] = tmp[0]
OutIndex[cuda.blockIdx.x] = index[0]
def main():
n = 100
a = getRamdomlist(n)
index = getIndex(n)
a_device = cuda.to_device(a)
index_device = cuda.to_device(index)
threads_per_block = BLOCK_SIZE
block_per_grid = math.ceil(n / threads_per_block)
gpu_result = cuda.device_array(shape=block_per_grid, dtype=float)
gpu_index = cuda.device_array(shape=block_per_grid, dtype=int)
time1 = perf_counter()
arggetmin[block_per_grid, threads_per_block](a_device, index_device, gpu_result, gpu_index, n)
cuda.synchronize()
time2 = perf_counter()
print("matmul GPU time :", (time2 - time1))
start = perf_counter()
innn = np.argmin(a)
end = perf_counter()
print("matmul CPU time :", (end - start))
if __name__ == "__main__":
main()
GPU运算时间0.5s左右,CPU运算时间0.0005s左右
使用过“循环展开”的方法,但结果更加糟糕。(运行Nvidia官方的“矩阵乘法”的代码时,使用共享内存的代码比不使用共享内存的代码要慢)
为什么会出现这种情况,一直不太明白