关于CUDA共享内存的使用

共享内存是定义在kernel函数里面还是外面？

我下面有一个类似图像直方图统计功能的内核函数，每个线程对应一个像素点。

如果不使用共享内存是这样的：

atomicAdd(&gpu_EO_0_stats_pix_count[catagory_row_id*LCU_total + LCU_id_ab], 1);
atomicAdd(&gpu_EO_0_stats_E[catagory_row_id*LCU_total + LCU_id_ab], e);

现在我希望使用共享内存，对共享内存原子写，在将最后结果汇聚到全局内存：

__shared__ stats_count shared_count[CATA_NUM];
__shared__ stats_E shared_E[CATA_NUM];

__syncthreads();

atomicAdd(&shared_count[catagory_row_id], 1);
atomicAdd(&shared_E[catagory_row_id], e);

__syncthreads();

            if (threadIdx.x == 0)//每个线程块的第一个线程
            {
                atomicAdd(&gpu_EO_0_stats_pix_count[catagory_row_id*LCU_total + LCU_id_ab], shared_count[catagory_row_id]);
                atomicAdd(&gpu_EO_0_stats_E[catagory_row_id*LCU_total + LCU_id_ab], shared_E[catagory_row_id]);
            }

但这样会出错，error code 77。

我想请问下出错的原因是什么？共享内存到底应该怎么定义和使用？