bug^君 2010-05-07 07:24 采纳率: 25%

已采纳

最快的固定长度6int 数组

Answering to another Stack Overflow question (this one) I stumbled upon an interesting sub-problem. What is the fastest way to sort an array of 6 ints?

As the question is very low level:

we can't assume libraries are available (and the call itself has its cost), only plain C
to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||).
room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.

Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening' code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.

As for why it is interesting, there is several layers:

the example is simple and easy to understand and measure, not much C skill involved
it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.

Here is my reference (naive, not optimized) implementation and my test set.

#include <stdio.h>

static __inline__ int sort6(int * d){

    char j, i, imin;
    int tmp;
    for (j = 0 ; j < 5 ; j++){
        imin = j;
        for (i = j + 1; i < 6 ; i++){
            if (d[i] < d[imin]){
                imin = i;
            }
        }
        tmp = d[j];
        d[j] = d[imin];
        d[imin] = tmp;
    }
}

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long long int x;
     __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
     return x;
}

int main(int argc, char ** argv){
    int i;
    int d[6][5] = {
        {1, 2, 3, 4, 5, 6},
        {6, 5, 4, 3, 2, 1},
        {100, 2, 300, 4, 500, 6},
        {100, 2, 3, 4, 500, 6},
        {1, 200, 3, 4, 5, 600},
        {1, 1, 2, 1, 2, 1}
    };

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
        sort6(d[i]);
        /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
         *  d[i][0], d[i][6], d[i][7],
         *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}

Raw results

As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).

I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O2

Direct call to qsort library function : 689.38
Naive implementation (insertion sort) : 285.70
Insertion Sort (Daniel Stutzbach) : 142.12
Insertion Sort Unrolled : 125.47
Rank Order : 102.26
Rank Order with registers : 58.03
Sorting Networks (Daniel Stutzbach) : 111.68
Sorting Networks (Paul R) : 66.36
Sorting Networks 12 with Fast Swap : 58.86
Sorting Networks 12 reordered Swap : 53.74
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.54
Reordered Sorting Network w/ fast swap V2 : 33.63
Inlined Bubble Sort (Paolo Bonzini) : 48.85
Unrolled Insertion Sort (Paolo Bonzini) : 75.30

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O1

Direct call to qsort library function : 705.93
Naive implementation (insertion sort) : 135.60
Insertion Sort (Daniel Stutzbach) : 142.11
Insertion Sort Unrolled : 126.75
Rank Order : 46.42
Rank Order with registers : 43.58
Sorting Networks (Daniel Stutzbach) : 115.57
Sorting Networks (Paul R) : 64.44
Sorting Networks 12 with Fast Swap : 61.98
Sorting Networks 12 reordered Swap : 54.67
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.24
Reordered Sorting Network w/ fast swap V2 : 33.07
Inlined Bubble Sort (Paolo Bonzini) : 45.79
Unrolled Insertion Sort (Paolo Bonzini) : 80.15

I included both -O1 and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?

Comments on proposed solutions

Insertion Sort (Daniel Stutzbach)

As expected minimizing branches is indeed a good idea.

Sorting Networks (Daniel Stutzbach)

Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).

Sorting Networks (Paul R)

The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?

Sorting Networks 12 SWAP with Fast Swap

As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.

It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.

Calling Library qsort

To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).

Rank order

Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimising this particular code (there was very little difference for other proposals).

update:

As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.

Sorting Networks 12 with reordered Swap

The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2); you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.

Sorting Networks 12 with Simple Swap

One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.

The "best" code is now as follow:

static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x) 
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
                    const int b = max(d[x], d[y]); \
                    d[x] = a; d[y] = b; }
    SWAP(1, 2);
    SWAP(4, 5);
    SWAP(0, 2);
    SWAP(3, 5);
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(1, 4);
    SWAP(0, 3);
    SWAP(2, 5);
    SWAP(1, 3);
    SWAP(2, 4);
    SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}

If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?

转载于:https://stackoverflow.com/questions/2786899/fastest-sort-of-fixed-length-6-int-array

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

20条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
笑故挽风 2010-05-07 15:02
关注
For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.

Do you know anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almost-sorted dat, so it will be the better choice if there's an above-average chance of almost-sorted data.

The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.

Here's an insertion sort implementation:

static __inline__ int sort6(int *d){ int i, j; for (i = 1; i < 6; i++) { int tmp = d[i]; for (j = i; j >= 1 && tmp < d[j-1]; j--) d[j] = d[j-1]; d[j] = tmp; } }

Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:

static __inline__ int sort6(int * d){ #define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; } SWAP(1, 2); SWAP(0, 2); SWAP(0, 1); SWAP(4, 5); SWAP(3, 5); SWAP(3, 4); SWAP(0, 3); SWAP(1, 4); SWAP(2, 5); SWAP(2, 4); SWAP(1, 3); SWAP(2, 3); #undef SWAP }
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(19条)

报告相同问题？

关注问题

最快的固定长度6int 数组
2010-05-07 07:24

回答 20 已采纳 For any optimization, it's always best to test, test, test. I would try at least sorting networks
固定长度6 int 数组的最快排序
2010-05-07 07:24

回答 20 已采纳 For any optimization, it's always best to test, test, test. I would try at least sorting networks
走一堆二维平面的点，如何实现最小路径？ c# 算法
2023-01-09 10:51

回答 2 已采纳这种必然要递归否则回溯会非常麻烦尤其有环路的时候容易造成死循环
java byte转成int数组_Java任意长度byte数组转换为int数组的方法
2021-02-25 18:54

高大卷的博客最近工程上遇到一个byte数组转换为int的问题，解决过程中遇到了几个坑，经过各种查资料终于还是解决了。撒花。Java的位运算以及byte数组与其他类型数据的转换比c/c++感觉麻烦一些。这里简单说明一下byte数组和int的...
(大一)Java报错NoSuchElementException java 开发语言
2022-07-15 17:18

回答 2 已采纳这个应该是Scanner读取输入，如果你关闭输入它就会报这个错误，重新分配是不起作用的这是nextInt方法抛出的异常，和你是不是重新分配没关系
力扣错误提示解惑，76题最小覆盖子串 c++ 深度优先算法
2023-01-13 00:03

回答 2 已采纳这篇文章讲的很详细，请看：力扣 76题最小覆盖子串（双指针 + 滑动窗口）
用单链表实现学生管理类 c语言有问必答
2022-12-27 09:40

回答 4 已采纳 int select; while(select) select 没有初始化，是一个随机值，如果产生的随机值是0，那么就不会进入循环了。这里应该要用死循环的，修改如下: while(select)
java数组遍历最快方式_java数组遍历的方法
2021-03-16 22:18

所以暂时将你眼睛的博客对于数组来说，基本上可以看做是一些数字。我们在使用字符串的时候，有过遍历的操作，那么对应的数组也能够进行遍历。这里为大家整理了三种遍历的方法、for循环、foreach、toString()，第二种可以看作是第一种的增强...
C语言读写文件如何解决 c语言
2023-01-17 22:53

回答 2 已采纳 1、这个长短的阴影可能是空格字符或者制表符，在读取每组第一行时，去除末尾的空格或制表符即可；2、后面这个小数点应该是printf("%c", buf1[strlen(buf1) - 23])这行代码，
问一个关于排序算法效率的问题。数据结构
2008-12-26 14:35

回答 18 已采纳稍微总结一下: JVM一定是有优化才造成冒泡的逆序反而快了这一点毫无疑问这个优化不是系统的而是JVM的因为在.net上结果是合理的交换的时候的位数对交换没有影响都是32位int类
mysql 存储int 类型数组吗_mysql中的数据类型
2021-02-07 17:04

袋熊宝宝的博客整数int 范围4个字节，使用int(m)例如：int(4) 存储的数是10时，在左边使用两个0凑足4位存储的数是100时，在左边使用1个0即可存储10000时，实际存储的位数超过指定的位数则不能存储，会报错bigint 范围...
C++输入未知长度的数组
2019-06-23 18:33

BuXianShan的博客静态数组int array[10],它的长度必须是个常数才可以定义。如果知道数组长度n，可以先cin>>n;然后用动态数组int* array = new int[n]。但是如果不知道数组长度，该怎么定义呢？经过查找资料，终于找到了一种...
Go基础——数组和切片（一个固定长度一个可变长度）
2021-09-27 20:26

小生听雨园的博客 6.数组 1.定义数组 var 数组变量名 [元素数量]T var aa [2]int 2.初始化数组 2.1使用初始化列表 var aa [2]int var aaa = [2]int{} var testArray = [3]int{1, 2, 3} var strArray = [3]string{"akamai", ...
创建一个长度为6的int型数组，要求数组元素的值都在1-30之间，且是随机赋值。同时，要求元素的值各不相同。
2022-07-06 12:05

阿白|的博客创建一个长度为6的int型数组，要求数组元素的值在1-30之间，且随机赋值。同时要求元素值各不相同。
没有解决我的问题, 去提问

悬赏问题

¥20 sub地址DHCP问题
¥15 delta降尺度计算的一些细节，有偿
¥15 Arduino红外遥控代码有问题
¥15 数值计算离散正交多项式
¥30 数值计算均差系数编程
¥15 redis-full-check比较两个集群的数据出错
¥15 Matlab编程问题
¥15 训练的多模态特征融合模型准确度很低怎么办
¥15 kylin启动报错log4j类冲突
¥15 超声波模块测距控制点灯，灯的闪烁很不稳定，经过调试发现测的距离偏大

最快的固定长度6int 数组

Raw results

Comments on proposed solutions

20条回答 默认 最新

悬赏问题

20条回答默认最新