主要问题
用多线程来同时处理一副图像的卷积,以达到并行加速的效果,运行环境是NanoPC T2的安卓开发板,arm v7 cpu
调用函数
- pthread create(不知道为什么打下划线会自动斜体,函数是有下划线的)
-
pthread join
主要思路
建立thread id为从0到NUM的多个线程,然后在每个线程中,分别从这幅图像的第0个,第1个,第2个。。。开始卷积运算,相当于把这幅图像逐个分配到NUM个线程中去计算,最后汇总
主体代码
for(int i=0; i {
param->thread_tid = i; //0,1,2,3
CHECK(!pthread_create(&thread_[i], &attr, im2col_cpu_pthread,(void*)param)) << "Pthread execution failed.";
sem_wait(&thread_sem);
}for(int i=0; i<THREAD_NUM; i++)
CHECK(!pthread_join(thread_[i], NULL)) << "Pthread joining failed.";
其中im2col_cpu_pthread是多线程函数入口,是并行卷积的主要对象
此外,由于一次程序每一次输入不止一张图像,所以会在循环中多次调用以上函数,也就是多次create和join结果
图像卷积的结果是正确的,我分别对28*28单通道图,32*32的三通道图,256*256的三通道图进行了测试,后两个的加速效果分别为20%和30%左右,但是28*28的单通道图反而变慢了,我觉得是在arm下CPU创建线程耗时太多?我用gettimeofday粗略打印了处理28*28图像时创建线程的时间,部分结果如下:
The thread 0's creation costs 75 time.
The thread 1's creation costs 937 time.
The thread 2's creation costs 63 time.
The thread 3's creation costs 62 time.
The thread 0's creation costs 75 time.
The thread 1's creation costs 933 time.
The thread 2's creation costs 62 time.
The thread 3's creation costs 62 time.
The thread 0's creation costs 80 time.
The thread 1's creation costs 896 time.
The thread 2's creation costs 64 time.
The thread 3's creation costs 63 time.
The thread 0's creation costs 75 time.
The thread 1's creation costs 937 time.
The thread 2's creation costs 65 time.
The thread 3's creation costs 63 time.
The thread 0's creation costs 75 time.
The thread 1's creation costs 933 time.
The thread 2's creation costs 64 time.
The thread 3's creation costs 64 time.
而对于28*28的每一个线程的卷积时间也就是200左右(单位相同),四个线程(NUM = 4)加起来900左右,所以可以看出线程创建中每一张图都有一个线程占用了相当于总处理时间的时间,32*32*3和256*256*3的线程创建时间也和28*28*1时间相同,但是占比没有这么大。
问题求助
为什么会在同一张图像的多线程创建时有一个线程耗时这么大?是我的代码问题吗?怎么解决呢?求点拨,暂时能想到的描述就这么多了,如果还需要补充我再添加,更新了一下,发现线程创建时间那里丢了两个。。