Answering to another Stack Overflow question (this one) I stumbled upon an interesting sub-problem. What is the fastest way to sort an array of 6 ints?

As the question is very low level:

• we can't assume libraries are available (and the call itself has its cost), only plain C
• to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in `&&` or `||`).
• room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.

Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening' code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.

As for why it is interesting, there is several layers:

• the example is simple and easy to understand and measure, not much C skill involved
• it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.

Here is my reference (naive, not optimized) implementation and my test set.

``````#include <stdio.h>

static __inline__ int sort6(int * d){

char j, i, imin;
int tmp;
for (j = 0 ; j < 5 ; j++){
imin = j;
for (i = j + 1; i < 6 ; i++){
if (d[i] < d[imin]){
imin = i;
}
}
tmp = d[j];
d[j] = d[imin];
d[imin] = tmp;
}
}

static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}

int main(int argc, char ** argv){
int i;
int d[6][5] = {
{1, 2, 3, 4, 5, 6},
{6, 5, 4, 3, 2, 1},
{100, 2, 300, 4, 500, 6},
{100, 2, 3, 4, 500, 6},
{1, 200, 3, 4, 5, 600},
{1, 1, 2, 1, 2, 1}
};

unsigned long long cycles = rdtsc();
for (i = 0; i < 6 ; i++){
sort6(d[i]);
/*
* printf("d%d : %d %d %d %d %d %d\n", i,
*  d[i][0], d[i][6], d[i][7],
*  d[i][8], d[i][9], d[i][10]);
*/
}
cycles = rdtsc() - cycles;
printf("Time is %d\n", (unsigned)cycles);
}
``````

## Raw results

As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).

I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O2

• Direct call to qsort library function : 689.38
• Naive implementation (insertion sort) : 285.70
• Insertion Sort (Daniel Stutzbach) : 142.12
• Insertion Sort Unrolled : 125.47
• Rank Order : 102.26
• Rank Order with registers : 58.03
• Sorting Networks (Daniel Stutzbach) : 111.68
• Sorting Networks (Paul R) : 66.36
• Sorting Networks 12 with Fast Swap : 58.86
• Sorting Networks 12 reordered Swap : 53.74
• Sorting Networks 12 reordered Simple Swap : 31.54
• Reordered Sorting Network w/ fast swap : 31.54
• Reordered Sorting Network w/ fast swap V2 : 33.63
• Inlined Bubble Sort (Paolo Bonzini) : 48.85
• Unrolled Insertion Sort (Paolo Bonzini) : 75.30

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O1

• Direct call to qsort library function : 705.93
• Naive implementation (insertion sort) : 135.60
• Insertion Sort (Daniel Stutzbach) : 142.11
• Insertion Sort Unrolled : 126.75
• Rank Order : 46.42
• Rank Order with registers : 43.58
• Sorting Networks (Daniel Stutzbach) : 115.57
• Sorting Networks (Paul R) : 64.44
• Sorting Networks 12 with Fast Swap : 61.98
• Sorting Networks 12 reordered Swap : 54.67
• Sorting Networks 12 reordered Simple Swap : 31.54
• Reordered Sorting Network w/ fast swap : 31.24
• Reordered Sorting Network w/ fast swap V2 : 33.07
• Inlined Bubble Sort (Paolo Bonzini) : 45.79
• Unrolled Insertion Sort (Paolo Bonzini) : 80.15

I included both -O1 and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?

Insertion Sort (Daniel Stutzbach)

As expected minimizing branches is indeed a good idea.

Sorting Networks (Daniel Stutzbach)

Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).

Sorting Networks (Paul R)

The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?

Sorting Networks 12 SWAP with Fast Swap

As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.

It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.

Calling Library qsort

To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).

Rank order

Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimising this particular code (there was very little difference for other proposals).

update:

As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.

Sorting Networks 12 with reordered Swap

The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do `SWAP(1, 2); SWAP(0, 2);` you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do `SWAP(1, 2); SWAP(4, 5);`the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.

Sorting Networks 12 with Simple Swap

One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.

The "best" code is now as follow:

``````static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x)
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
const int b = max(d[x], d[y]); \
d[x] = a; d[y] = b; }
SWAP(1, 2);
SWAP(4, 5);
SWAP(0, 2);
SWAP(3, 5);
SWAP(0, 1);
SWAP(3, 4);
SWAP(1, 4);
SWAP(0, 3);
SWAP(2, 5);
SWAP(1, 3);
SWAP(2, 4);
SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}
``````

If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?

3 年多之前 回复

3 年多之前 回复

3 年多之前 回复

3 年多之前 回复

4 年多之前 回复

4 年多之前 回复

4 年多之前 回复

Memor.の The links to copypastecode.com are broken; now it seems to be a dodgy advertising site.

℡Wang Yan "Zening", I like it. Not sure where to go with this word, but I will try to use it.
6 年多之前 回复

bug^君 I think what you said is true for the case of maximising the speed, which is completely understandable. What I don't understand is that why would you minimise it? Why you want to write a slow program?

csdnceshi62 Pentium 2 and newer have fast conditional moves, see Steinar Gunderson's assembly code.

bug^君 What is the point in minimising the execution speed?

YaoRaoLov I think you're saying "the CPU has to have different behavior based on if x<y or not" which is true, but the CPU makes decisions all the time without altering control flow -- for example it has to decide during addition if a certain output bit is 1 or 0 based on the input bits. The word "branch" specifically means a jump to another set of instructions, so that the control flow is changed. The difference is important since less branching allows a CPU to anticipate future instructions and begin work on them before the last instruction completes, called pipelining.

local-host it's just a cmp instruction which sets a status bit.

YaoRaoLov there's no branch from computing x<y, it's just a binary operation just like + or *. If it were in an if statement, e.g., then it could cause branching, but it's not, which is the beauty of that swap implementation - it gives us conditional behavior with zero jumps.

YaoRaoLov Maybe the compiler is smart enough to do this for you, but a small speedup in the min/max SWAP appears to be available via reusing the xor'd value from min/max, like this: int s = (x^y) & -(x<y); int t = y^s; y = x^s; x = t; (Replace {x,y} with d[{x,y}]; I wanted to keep the code snippet readable.)

MAO-EYE Direct call to qsort library function : 101 Naive implementation (insertion sort) : 299 Insertion Sort (Daniel Stutzbach) : 108 Insertion Sort Unrolled : 51 Sorting Networks (Daniel Stutzbach) : 26 Sorting Networks (Paul R) : 85 Sorting Networks 12 with Fast Swap : 117 Sorting Networks 12 reordered Swap : 116 Rank Order : 56

MAO-EYE Some advice - 6 loops is not enough for all platforms, also you should use a power of two test case to allow mod by power of two with and to repeat multiple times without polluting the results with potentially expensive mod operations. Here are some results for a platform I don't think I can name owing to NDA, but which uses PPC architecture:

csdnceshi62 Note that the correct implementation of rdtsc on 64-bit is __asm__ volatile (".byte 0x0f, 0x31; shlq \$32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx"); because rdtsc puts the answer in EDX:EAX while GCC expects it in a single 64-bit register. You can see the bug by compiling at -O3. Also see below my comment to Paul R about a faster SWAP.

hurriedly% Great analysis, and good catch on reordering the swaps. From the page I linked to that generates the SWAP macros, it looks like you can get an optimal order with the "View the network using text characters" option instead of the "Create a set of SWAP macros" option. I'm emailing the author to see if can get the SWAP macros to output in the optimum order.

python小菜 : As a non-C guy, I'm quite fascinated by this post. I'm just wondering what -O1 , -O2 , etc. are. Are they compiler-optimization levels?

hurriedly% You should try combining my 12-swap sorting network with Paul's branchless swap function. His solution passes all of the parameters as separate elements on the stack instead of a single pointer to an array. That might also make a difference.

elliott.david Do you have some constraints on the ints ? For example, can we assume that for any 2 x,y x-y and x+y won't cause underflow or overflow ?

Didn"t forge ints on a GPU... can you use floats instead? Then you have min/max functions available. (At least GLSL does not support min/max for ints.) Also, it is probably faster to use two vec3 or similar types instead of an array, so you can use swizzling to sort.

?yb? It's not really golf unless you can objectively score the answers, so you need to specify a particular architecture (and whether it's going to be scored on average-case or worst-case).

20个回答

For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.

Do you know anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almost-sorted dat, so it will be the better choice if there's an above-average chance of almost-sorted data.

The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.

Here's an insertion sort implementation:

``````static __inline__ int sort6(int *d){
int i, j;
for (i = 1; i < 6; i++) {
int tmp = d[i];
for (j = i; j >= 1 && tmp < d[j-1]; j--)
d[j] = d[j-1];
d[j] = tmp;
}
}
``````

Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:

``````static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
SWAP(1, 2);
SWAP(0, 2);
SWAP(0, 1);
SWAP(4, 5);
SWAP(3, 5);
SWAP(3, 4);
SWAP(0, 3);
SWAP(1, 4);
SWAP(2, 5);
SWAP(2, 4);
SWAP(1, 3);
SWAP(2, 3);
#undef SWAP
}
``````
Lotus＠ I found that replacing the if statement in the SWAP macro with a branch-free two-element sort (using min / max) doubles the speed of this code.

hurriedly% why is that?

from.. Would moving the int tmp out of the macro scope into function scope be of benefit?

ℙℕℤℝ I believe the cost you have to pay just to call the library function (instead of static inline) is so high it defeats the library optimizations. But you are right, I should provide figures for plain library call to give a reference point.

The test code is pretty bad; it overflows the initial array (don't people here read compiler warnings?), the printf is printing out the wrong elements, it uses .byte for rdtsc for no good reason, there's only one run (!), there's nothing checking that the end results are actually correct (so it's very easy to “optimize” into something subtly wrong), the included tests are very rudimentary (no negative numbers?) and there's nothing to stop the compiler from just discarding the entire function as dead code.

That being said, it's also pretty easy to improve on the bitonic network solution; simply change the min/max/SWAP stuff to

``````#define SWAP(x,y) { int tmp; asm("mov %0, %2 ; cmp %1, %0 ; cmovg %1, %0 ; cmovg %2, %1" : "=r" (d[x]), "=r" (d[y]), "=r" (tmp) : "0" (d[x]), "1" (d[y]) : "cc"); }
``````

and it comes out about 65% faster for me (Debian gcc 4.4.5 with -O2, amd64, Core i7).

local-host â¦and finally, if your numbers are floats, and you don't have to worry about NaN etc., GCC can convert this to minss/maxss SSE instructions, which is yet ~25% faster. Morale: Drop the clever bitfiddling tricks and let the compiler do its job. :-)

local-host You dont' even need assembler, actually; if you just drop all the clever tricks, GCC will recognize the sequence and insert the conditional moves for you: #define min(a, b) ((a < b) ? a : b) #define max(a, b) ((a < b) ? b : a) #define SWAP(x,y) { int a = min(d[x], d[y]); int b = max(d[x], d[y]); d[x] = a; d[y] = b; } It comes out maybe a few percent slower than the inline asm variant, but that's hard to say given the lack of proper benchmarking.

Didn"t forge Thanks for noticing the array overflow, I corrected it. Other people may not have noticed it because the clicked on the link to copy/paste code, where there is no overflow.

Didn"t forge OK, test code is poor. Feel free to improve it. And yes, you can use assembly code. Why not going all the way and fully code it using x86 assembler ? It may be a bit less portable but why bother ?

Since these are integers and compares are fast, why not compute the rank order of each directly:

``````inline void sort6(int *d) {
int e[6];
memcpy(e,d,6*sizeof(int));
int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]);
int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]);
int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]);
int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]);
int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]);
int o5 = 15-(o0+o1+o2+o3+o4);
d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5];
}
``````

elliott.david what is magic number 15 mean?

10.24 Aha. That is not completely surprising--there are a lot of variables floating around, and they have to be carefully ordered and cached in registers and so on.

YaoRaoLov I updated my answer. The true reason was version of compiler (gcc441 vs gcc 443) not target architecture. I didn't identified exactly what optimization. Your solution seems to hard push gcc. For example gcc443 yield much better results with O1 than with O2). I guess I will have to look at assembly code if I really want to understand why.

10.24 I think a factor of 2 difference in cycles is quite large, especially since I was testing on a 2-core machine of the same vintage as the Q8300!

YaoRaoLov 3GHz with native Linux 64bits OS) and on it your program is the fastest (~370 cycles vs ~390). I should edit my question to provide results for both architectures (with your answer).

YaoRaoLov sorry, I missed the > vs >= pattern at first sight. It works in every case.

YaoRaoLov I also wonder if your method is really working on every dataset. I wonder if you do not have cases where several values are mapped to the same place when sorted data are repeated.

YaoRaoLov stepping 0a (though with testing method frequency should be irrelevant).

10.24 It's faster than the sorting network for me with -O2. Is there some reason why -O2 isn't okay, or is it slower for you on -O2 also? Maybe it's a difference in machine architecture?

YaoRaoLov with gcc -O1 it's below 1000 cycles, quite fast but slower than sorting network. Any idea to improve code ? Maybe if we could avoid array copy...

Looking forward to trying my hand at this and learning from these examples, but first some timings from my 1.5 GHz PPC Powerbook G4 w/ 1 GB DDR RAM. (I borrowed a similar rdtsc-like timer for PPC from http://www.mcs.anl.gov/~kazutomo/rdtsc.html for the timings.) I ran the program a few times and the absolute results varied but the consistently fastest test was "Insertion Sort (Daniel Stutzbach)", with "Insertion Sort Unrolled" a close second.

Here's the last set of times:

``````**Direct call to qsort library function** : 164
**Naive implementation (insertion sort)** : 138
**Insertion Sort (Daniel Stutzbach)**     : 85
**Insertion Sort Unrolled**               : 97
**Sorting Networks (Daniel Stutzbach)**   : 457
**Sorting Networks (Paul R)**             : 179
**Sorting Networks 12 with Fast Swap**    : 238
**Sorting Networks 12 reordered Swap**    : 236
**Rank Order**                            : 116
``````

Looks like I got to the party a year late, but here we go...

Looking at the assembly generated by gcc 4.5.2 I observed that loads and stores are being done for every swap, which really isn't needed. It would be better to load the 6 values into registers, sort those, and store them back into memory. I ordered the loads at stores to be as close as possible to there the registers are first needed and last used. I also used Steinar H. Gunderson's SWAP macro. Update: I switched to Paolo Bonzini's SWAP macro which gcc converts into something similar to Gunderson's, but gcc is able to better order the instructions since they aren't given as explicit assembly.

I used the same swap order as the reordered swap network given as the best performing, although there may be a better ordering. If I find some more time I'll generate and test a bunch of permutations.

I changed the testing code to consider over 4000 arrays and show the average number of cycles needed to sort each one. On an i5-650 I'm getting ~34.1 cycles/sort (using -O3), compared to the original reordered sorting network getting ~65.3 cycles/sort (using -O1, beats -O2 and -O3).

``````#include <stdio.h>

static inline void sort6_fast(int * d) {
#define SWAP(x,y) { int dx = x, dy = y, tmp; tmp = x = dx < dy ? dx : dy; y ^= dx ^ tmp; }
register int x0,x1,x2,x3,x4,x5;
x1 = d[1];
x2 = d[2];
SWAP(x1, x2);
x4 = d[4];
x5 = d[5];
SWAP(x4, x5);
x0 = d[0];
SWAP(x0, x2);
x3 = d[3];
SWAP(x3, x5);
SWAP(x0, x1);
SWAP(x3, x4);
SWAP(x1, x4);
SWAP(x0, x3);
d[0] = x0;
SWAP(x2, x5);
d[5] = x5;
SWAP(x1, x3);
d[1] = x1;
SWAP(x2, x4);
d[4] = x4;
SWAP(x2, x3);
d[2] = x2;
d[3] = x3;

#undef SWAP
#undef min
#undef max
}

static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile ("rdtsc; shlq \$32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx");
return x;
}

void ran_fill(int n, int *a) {
static int seed = 76521;
while (n--) *a++ = (seed = seed *1812433253 + 12345);
}

#define NTESTS 4096
int main() {
int i;
int d[6*NTESTS];
ran_fill(6*NTESTS, d);

unsigned long long cycles = rdtsc();
for (i = 0; i < 6*NTESTS ; i+=6) {
sort6_fast(d+i);
}
cycles = rdtsc() - cycles;
printf("Time is %.2lf\n", (double)cycles/(double)NTESTS);

for (i = 0; i < 6*NTESTS ; i+=6) {
if (d[i+0] > d[i+1] || d[i+1] > d[i+2] || d[i+2] > d[i+3] || d[i+3] > d[i+4] || d[i+4] > d[i+5])
printf("d%d : %d %d %d %d %d %d\n", i,
d[i+0], d[i+1], d[i+2],
d[i+3], d[i+4], d[i+5]);
}
return 0;
}
``````

I changed modified the test suite to also report clocks per sort and run more tests (the cmp function was updated to handle integer overflow as well), here are the results on some different architectures. I attempted testing on an AMD cpu but rdtsc isn't reliable on the X6 1100T I have available.

``````Clarkdale (i5-650)
==================
Direct call to qsort library function      635.14   575.65   581.61   577.76   521.12
Naive implementation (insertion sort)      538.30   135.36   134.89   240.62   101.23
Insertion Sort (Daniel Stutzbach)          424.48   159.85   160.76   152.01   151.92
Insertion Sort Unrolled                    339.16   125.16   125.81   129.93   123.16
Rank Order                                 184.34   106.58   54.74    93.24    94.09
Rank Order with registers                  127.45   104.65   53.79    98.05    97.95
Sorting Networks (Daniel Stutzbach)        269.77   130.56   128.15   126.70   127.30
Sorting Networks (Paul R)                  551.64   103.20   64.57    73.68    73.51
Sorting Networks 12 with Fast Swap         321.74   61.61    63.90    67.92    67.76
Sorting Networks 12 reordered Swap         318.75   60.69    65.90    70.25    70.06
Reordered Sorting Network w/ fast swap     145.91   34.17    32.66    32.22    32.18

========================
Direct call to qsort library function      870.01   736.39   723.39   725.48   721.85
Naive implementation (insertion sort)      503.67   174.09   182.13   284.41   191.10
Insertion Sort (Daniel Stutzbach)          345.32   152.84   157.67   151.23   150.96
Insertion Sort Unrolled                    316.20   133.03   129.86   118.96   105.06
Rank Order                                 164.37   138.32   46.29    99.87    99.81
Rank Order with registers                  115.44   116.02   44.04    116.04   116.03
Sorting Networks (Daniel Stutzbach)        230.35   114.31   119.15   110.51   111.45
Sorting Networks (Paul R)                  498.94   77.24    63.98    62.17    65.67
Sorting Networks 12 with Fast Swap         315.98   59.41    58.36    60.29    55.15
Sorting Networks 12 reordered Swap         307.67   55.78    51.48    51.67    50.74
Reordered Sorting Network w/ fast swap     149.68   31.46    30.91    31.54    31.58

Sandy Bridge (i7-2600k)
=======================
Direct call to qsort library function      559.97   451.88   464.84   491.35   458.11
Naive implementation (insertion sort)      341.15   160.26   160.45   154.40   106.54
Insertion Sort (Daniel Stutzbach)          284.17   136.74   132.69   123.85   121.77
Insertion Sort Unrolled                    239.40   110.49   114.81   110.79   117.30
Rank Order                                 114.24   76.42    45.31    36.96    36.73
Rank Order with registers                  105.09   32.31    48.54    32.51    33.29
Sorting Networks (Daniel Stutzbach)        210.56   115.68   116.69   107.05   124.08
Sorting Networks (Paul R)                  364.03   66.02    61.64    45.70    44.19
Sorting Networks 12 with Fast Swap         246.97   41.36    59.03    41.66    38.98
Sorting Networks 12 reordered Swap         235.39   38.84    47.36    38.61    37.29
Reordered Sorting Network w/ fast swap     115.58   27.23    27.75    27.25    26.54

Nehalem (Xeon E5640)
====================
Direct call to qsort library function      911.62   890.88   681.80   876.03   872.89
Naive implementation (insertion sort)      457.69   236.87   127.68   388.74   175.28
Insertion Sort (Daniel Stutzbach)          317.89   279.74   147.78   247.97   245.09
Insertion Sort Unrolled                    259.63   220.60   116.55   221.66   212.93
Rank Order                                 140.62   197.04   52.10    163.66   153.63
Rank Order with registers                  84.83    96.78    50.93    109.96   54.73
Sorting Networks (Daniel Stutzbach)        214.59   220.94   118.68   120.60   116.09
Sorting Networks (Paul R)                  459.17   163.76   56.40    61.83    58.69
Sorting Networks 12 with Fast Swap         284.58   95.01    50.66    53.19    55.47
Sorting Networks 12 reordered Swap         281.20   96.72    44.15    56.38    54.57
Reordered Sorting Network w/ fast swap     128.34   50.87    26.87    27.91    28.02
``````
from.. the "Rank Order" solution will tend to be mostly in regs anyway since it defers the stores until all the main calculation is done. Could be improved by using reads into 6 locals instead of the memcpy to 'e' though. I wouldn't be surprised if gcc can do that itself (it can treat a small local array as separate int vars; here it would also need to transform the memcpy into 6 separate ops e[0]=d[0]; e[1] = d[1] .. in order to make this optimization. But it would be better to make that explicit in the code).

YaoRaoLov It seems to me that the ordering of the loads is likely to cause stalls. Since all the loads are happening directly before the first use, the swap has to wait for the variables to be loaded, causing the stall. The interleaved stores are ok, you should hopefully get come overlap between the data being stored and the next swap occurring. Anyway, this is probably all a moot point for a sufficiently good compiler, it'll reorder things to that work occurs while it's waiting for data to be fetched from RAM. It's quite possible that doing all the loads in one block would be quite beneficial on a GPU.

℡Wang Yan Your code still uses Gunderson's swap, mine would be #define SWAP(x,y) { int oldx = x; x = x < y ? x : y; y ^= oldx ^ x; }.

lrony* +1 nothing is faster than using register

While I really like the swap macro provided:

``````#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }
``````

I see an improvement (which a good compiler might make):

``````#define SWAP(x,y) { int tmp = ((x ^ y) & -(y < x)); y ^= tmp; x ^= tmp; }
``````

We take note of how min and max work and pull the common sub-expression explicitly. This eliminates the min and max macros completely.

perhaps? made both fixes you pointed out. Thanks.

℡Wang Yan I noticed the same thing; I think for your implementation to be correct you want d[x] instead of x (same for y), and d[y] < d[x] for the inequality here (yep, different from the min/max code).

An XOR swap may be useful in your swapping functions.

``````void xorSwap (int *x, int *y) {
if (*x != *y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
}
``````

The if may cause too much divergence in your code, but if you have a guarantee that all your ints are unique this could be handy.

Didn"t forge Anyway when used with sorting networks we never call with both x and y pointing to the same location. There is still to find a way to avoid testing wich is greater to get the same effect as the branchless swap. I have an idea to achieve that.

This question is becoming quite old, but I actually had to solve the same problem these days: fast agorithms to sort small arrays. I thought it would be a good idea to share my knowledge. While I first started by using sorting networks, I finally managed to find other algorithms for which the total number of comparisons performed to sort every permutation of 6 values was smaller than with sorting networks, and smaller than with insertion sort. I didn't count the number of swaps; I would expect it to be roughly equivalent (maybe a bit higher sometimes).

The algorithm `sort6` uses the algorithm `sort4` which uses the algorithm `sort3`. Here is the implementation in some light C++ form (the original is template-heavy so that it can work with any random-access iterator and any suitable comparison function).

# Sorting 3 values

The following algorithm is an unrolled insertion sort. When two swaps (6 assignments) have to be performed, it uses 4 assignments instead:

``````void sort3(int* array)
{
if (array[1] < array[0]) {
if (array[2] < array[0]) {
if (array[2] < array[1]) {
std::swap(array[0], array[2]);
} else {
int tmp = array[0];
array[0] = array[1];
array[1] = array[2];
array[2] = tmp;
}
} else {
std::swap(array[0], array[1]);
}
} else {
if (array[2] < array[1]) {
if (array[2] < array[0]) {
int tmp = array[2];
array[2] = array[1];
array[1] = array[0];
array[0] = tmp;
} else {
std::swap(array[1], array[2]);
}
}
}
}
``````

It looks a bit complex because the sort has more or less one branch for every possible permutation of the array, using 2~3 comparisons and at most 4 assignments to sort the three values.

# Sorting 4 values

This one calls `sort3` then performs an unrolled insertion sort with the last element of the array:

``````void sort4(int* array)
{
// Sort the first 3 elements
sort3(array);

// Insert the 4th element with insertion sort
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
if (array[1] < array[0]) {
std::swap(array[0], array[1]);
}
}
}
}
``````

This algorithm performs 3 to 6 comparisons and at most 5 swaps. It is easy to unroll an insertion sort, but we will be using another algorithm for the last sort...

# Sorting 6 values

This one uses an unrolled version of what I called a double insertion sort. The name isn't that great, but it's quite descriptive, here is how it works:

• Sort everything but the first and the last elements of the array.
• Swap the first and the elements of the array if the first is greater than the last.
• Insert the first element into the sorted sequence from the front then the last element from the back.

After the swap, the first element is always smaller than the last, which means that, when inserting them into the sorted sequence, there won't be more than N comparisons to insert the two elements in the worst case: for example, if the first element has been insert in the 3rd position, then the last one can't be inserted lower than the 4th position.

``````void sort6(int* array)
{
// Sort everything but first and last elements
sort4(array+1);

// Switch first and last elements if needed
if (array[5] < array[0]) {
std::swap(array[0], array[5]);
}

// Insert first element from the front
if (array[1] < array[0]) {
std::swap(array[0], array[1]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[4] < array[3]) {
std::swap(array[3], array[4]);
}
}
}
}

// Insert last element from the back
if (array[5] < array[4]) {
std::swap(array[4], array[5]);
if (array[4] < array[3]) {
std::swap(array[3], array[4]);
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
}
}
}
}
}
``````

My tests on every permutation of 6 values ever show that this algorithms always performs between 6 and 13 comparisons. I didn't compute the number of swaps performed, but I don't expect it to be higher than 11 in the worst case.

I hope that this helps, even if this question may not represent an actual problem anymore :)

EDIT: after putting it in the provided benchmark, it is cleary slower than most of the interesting alternatives. It tends to perform a bit better than the unrolled insertion sort, but that's pretty much it. Basically, it isn't the best sort for integers but could be interesting for types with an expensive comparison operation.

YaoRaoLov The solution by RexKerr (Order Rank) became the fastest on X86 architecture since gcc compiler 4.2.3 (and as of gcc 4.9 became nearly two times faster than the second best). But it's heavily dependant of compiler optimisations and may not be true on other architectures.

YaoRaoLov You should have a look at the way the other answers are timed. The point is that with such small dataset counting comparisons or even comparisons and swaps doesn't really say how fast is an algorithm (basically sorting 6 ints is always O(1) because O(6*6) is O(1)). The current fastest of previously proposed solutions is immediately finding the position of each value using a big comparison (by RexKerr).

YaoRaoLov These are nice. As the problem solved is many decades old, probably as old a C programming, that the question now has nearly 5 years looks not that much relevant.

Never optimize min/max without benchmarking and looking at actual compiler generated assembly. If I let GCC optimize min with conditional move instructions I get a 33% speedup:

``````#define SWAP(x,y) { int dx = d[x], dy = d[y], tmp; tmp = d[x] = dx < dy ? dx : dy; d[y] ^= dx ^ tmp; }
``````

(280 vs. 420 cycles in the test code). Doing max with ?: is more or less the same, almost lost in the noise, but the above is a little bit faster. This SWAP is faster with both GCC and Clang.

Compilers are also doing an exceptional job at register allocation and alias analysis, effectively moving d[x] into local variables upfront, and only copying back to memory at the end. In fact, they do so even better than if you worked entirely with local variables (like `d0 = d[0], d1 = d[1], d2 = d[2], d3 = d[3], d4 = d[4], d5 = d[5]`). I'm writing this because you are assuming strong optimization and yet trying to outsmart the compiler on min/max. :)

By the way, I tried Clang and GCC. They do the same optimization, but due to scheduling differences the two have some variation in the results, can't say really which is faster or slower. GCC is faster on the sorting networks, Clang on the quadratic sorts.

Just for completeness, unrolled bubble sort and insertion sorts are possible too. Here is the bubble sort:

``````SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(4,5);
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4);
SWAP(0,1); SWAP(1,2); SWAP(2,3);
SWAP(0,1); SWAP(1,2);
SWAP(0,1);
``````

and here is the insertion sort:

``````//#define ITER(x) { if (t < d[x]) { d[x+1] = d[x]; d[x] = t; } }
//Faster on x86, probably slower on ARM or similar:
#define ITER(x) { d[x+1] ^= t < d[x] ? d[x] ^ d[x+1] : 0; d[x] = t < d[x] ? t : d[x]; }
static inline void sort6_insertion_sort_unrolled_v2(int * d){
int t;
t = d[1]; ITER(0);
t = d[2]; ITER(1); ITER(0);
t = d[3]; ITER(2); ITER(1); ITER(0);
t = d[4]; ITER(3); ITER(2); ITER(1); ITER(0);
t = d[5]; ITER(4); ITER(3); ITER(2); ITER(1); ITER(0);
``````

This insertion sort is faster than Daniel Stutzbach's, and is especially good on a GPU or a computer with predication because ITER can be done with only 3 instructions (vs. 4 for SWAP). For example, here is the `t = d[2]; ITER(1); ITER(0);` line in ARM assembly:

``````    MOV    r6, r2
CMP    r6, r1
MOVLT  r2, r1
MOVLT  r1, r6
CMP    r6, r0
MOVLT  r1, r0
MOVLT  r0, r6
``````

For six elements the insertion sort is competitive with the sorting network (12 swaps vs. 15 iterations balances 4 instructions/swap vs. 3 instructions/iteration); bubble sort of course is slower. But it's not going to be true when the size grows, since insertion sort is O(n^2) while sorting networks are O(n log n).

ℙℕℤℝ More or less related: I submitted a report to GCC so that it could implement the optimization directly in the compiler. Not sure that it will be done, but at least you can follow how it evolves.

I know this is an old question.

But I just wrote a different kind of solution I want to share.
Using nothing but nested MIN MAX,

It's not fast as it uses 114 of each,
could reduce it to 75 pretty simply like so -> pastebin

But then it's not purely min max anymore.

What might work is doing min/max on multiple integers at once with AVX

``````#include <stdio.h>

static __inline__ int MIN(int a, int b){
int result =a;
__asm__ ("pminsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ int MAX(int a, int b){
int result = a;
__asm__ ("pmaxsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ unsigned long long rdtsc(void){
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" :
"=A" (x));
return x;
}

#define MIN3(a, b, c) (MIN(MIN(a,b),c))
#define MIN4(a, b, c, d) (MIN(MIN(a,b),MIN(c,d)))

static __inline__ void sort6(int * in) {
const int A=in[0], B=in[1], C=in[2], D=in[3], E=in[4], F=in[5];

in[0] = MIN( MIN4(A,B,C,D),MIN(E,F) );

const int
AB = MAX(A, B),
AC = MAX(A, C),
AE = MAX(A, E),
AF = MAX(A, F),
BC = MAX(B, C),
BD = MAX(B, D),
BE = MAX(B, E),
BF = MAX(B, F),
CD = MAX(C, D),
CE = MAX(C, E),
CF = MAX(C, F),
DE = MAX(D, E),
DF = MAX(D, F),
EF = MAX(E, F);

in[1] = MIN4 (
MIN4( AB, AC, AD, AE ),
MIN4( AF, BC, BD, BE ),
MIN4( BF, CD, CE, CF ),
MIN3( DE, DF, EF)
);

const int
ABC = MAX(AB,C),
ABD = MAX(AB,D),
ABE = MAX(AB,E),
ABF = MAX(AB,F),
ACD = MAX(AC,D),
ACE = MAX(AC,E),
ACF = MAX(AC,F),
AEF = MAX(AE,F),
BCD = MAX(BC,D),
BCE = MAX(BC,E),
BCF = MAX(BC,F),
BDE = MAX(BD,E),
BDF = MAX(BD,F),
BEF = MAX(BE,F),
CDE = MAX(CD,E),
CDF = MAX(CD,F),
CEF = MAX(CE,F),
DEF = MAX(DE,F);

in[2] = MIN( MIN4 (
MIN4( ABC, ABD, ABE, ABF ),
MIN4( ACD, ACE, ACF, ADE ),
MIN4( ADF, AEF, BCD, BCE ),
MIN4( BCF, BDE, BDF, BEF )),
MIN4( CDE, CDF, CEF, DEF )
);

const int
ABCD = MAX(ABC,D),
ABCE = MAX(ABC,E),
ABCF = MAX(ABC,F),
ABDE = MAX(ABD,E),
ABDF = MAX(ABD,F),
ABEF = MAX(ABE,F),
ACDE = MAX(ACD,E),
ACDF = MAX(ACD,F),
ACEF = MAX(ACE,F),
BCDE = MAX(BCD,E),
BCDF = MAX(BCD,F),
BCEF = MAX(BCE,F),
BDEF = MAX(BDE,F),
CDEF = MAX(CDE,F);

in[3] = MIN4 (
MIN4( ABCD, ABCE, ABCF, ABDE ),
MIN4( ABDF, ABEF, ACDE, ACDF ),
MIN4( ACEF, ADEF, BCDE, BCDF ),
MIN3( BCEF, BDEF, CDEF )
);

const int
ABCDE= MAX(ABCD,E),
ABCDF= MAX(ABCD,F),
ABCEF= MAX(ABCE,F),
ABDEF= MAX(ABDE,F),
ACDEF= MAX(ACDE,F),
BCDEF= MAX(BCDE,F);

in[4]= MIN (
MIN4( ABCDE, ABCDF, ABCEF, ABDEF ),
MIN ( ACDEF, BCDEF )
);

in[5] = MAX(ABCDE,F);
}

int main(int argc, char ** argv) {
int d[6][6] = {
{1, 2, 3, 4, 5, 6},
{6, 5, 4, 3, 2, 1},
{100, 2, 300, 4, 500, 6},
{100, 2, 3, 4, 500, 6},
{1, 200, 3, 4, 5, 600},
{1, 1, 2, 1, 2, 1}
};

unsigned long long cycles = rdtsc();
for (int i = 0; i < 6; i++) {
sort6(d[i]);
}
cycles = rdtsc() - cycles;
printf("Time is %d\n", (unsigned)cycles);

for (int i = 0; i < 6; i++) {
printf("d%d : %d %d %d %d %d %d\n", i,
d[i][0], d[i][1], d[i][2],
d[i][3], d[i][4], d[i][5]);
}
}
``````

EDIT:
Rank order solution inspired by Rex Kerr's, Much faster than the mess above

``````static void sort6(int *o) {
const int
A=o[0],B=o[1],C=o[2],D=o[3],E=o[4],F=o[5];
const unsigned char
AB = A>B, AC = A>C, AD = A>D, AE = A>E,
BC = B>C, BD = B>D, BE = B>E,
CD = C>D, CE = C>E,
DE = D>E,
a =          AB + AC + AD + AE + (A>F),
b = 1 - AB      + BC + BD + BE + (B>F),
c = 2 - AC - BC      + CD + CE + (C>F),
d = 3 - AD - BD - CD      + DE + (D>F),
e = 4 - AE - BE - CE - DE      + (E>F);
o[a]=A; o[b]=B; o[c]=C; o[d]=D; o[e]=E;
o[15-a-b-c-d-e]=F;
}
``````
Memor.の Yes the number of MIN and MAX could possibly be reduced, for example MIN(AB, CD) repeats a few times, but reducing them alot will be hard I think. I added your test cases.

csdnceshi62 always nice to see new solutions. It looks like some easy optimisation are possible. In the end it may not prove so different from Sorting Networks.