 最快的固定长度6int 数组

Answering to another Stack Overflow question (this one) I stumbled upon an interesting subproblem. What is the fastest way to sort an array of 6 ints?
As the question is very low level:
 we can't assume libraries are available (and the call itself has its cost), only plain C
 to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in
&&
or
).  room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.
Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening' code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.
As for why it is interesting, there is several layers:
 the example is simple and easy to understand and measure, not much C skill involved
 it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.
Here is my reference (naive, not optimized) implementation and my test set.
#include <stdio.h> static __inline__ int sort6(int * d){ char j, i, imin; int tmp; for (j = 0 ; j < 5 ; j++){ imin = j; for (i = j + 1; i < 6 ; i++){ if (d[i] < d[imin]){ imin = i; } } tmp = d[j]; d[j] = d[imin]; d[imin] = tmp; } } static __inline__ unsigned long long rdtsc(void) { unsigned long long int x; __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x; } int main(int argc, char ** argv){ int i; int d[6][5] = { {1, 2, 3, 4, 5, 6}, {6, 5, 4, 3, 2, 1}, {100, 2, 300, 4, 500, 6}, {100, 2, 3, 4, 500, 6}, {1, 200, 3, 4, 5, 600}, {1, 1, 2, 1, 2, 1} }; unsigned long long cycles = rdtsc(); for (i = 0; i < 6 ; i++){ sort6(d[i]); /* * printf("d%d : %d %d %d %d %d %d\n", i, * d[i][0], d[i][6], d[i][7], * d[i][8], d[i][9], d[i][10]); */ } cycles = rdtsc()  cycles; printf("Time is %d\n", (unsigned)cycles); }
Raw results
As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).
I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).
Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, O2
 Direct call to qsort library function : 689.38
 Naive implementation (insertion sort) : 285.70
 Insertion Sort (Daniel Stutzbach) : 142.12
 Insertion Sort Unrolled : 125.47
 Rank Order : 102.26
 Rank Order with registers : 58.03
 Sorting Networks (Daniel Stutzbach) : 111.68
 Sorting Networks (Paul R) : 66.36
 Sorting Networks 12 with Fast Swap : 58.86
 Sorting Networks 12 reordered Swap : 53.74
 Sorting Networks 12 reordered Simple Swap : 31.54
 Reordered Sorting Network w/ fast swap : 31.54
 Reordered Sorting Network w/ fast swap V2 : 33.63
 Inlined Bubble Sort (Paolo Bonzini) : 48.85
 Unrolled Insertion Sort (Paolo Bonzini) : 75.30
Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, O1
 Direct call to qsort library function : 705.93
 Naive implementation (insertion sort) : 135.60
 Insertion Sort (Daniel Stutzbach) : 142.11
 Insertion Sort Unrolled : 126.75
 Rank Order : 46.42
 Rank Order with registers : 43.58
 Sorting Networks (Daniel Stutzbach) : 115.57
 Sorting Networks (Paul R) : 64.44
 Sorting Networks 12 with Fast Swap : 61.98
 Sorting Networks 12 reordered Swap : 54.67
 Sorting Networks 12 reordered Simple Swap : 31.54
 Reordered Sorting Network w/ fast swap : 31.24
 Reordered Sorting Network w/ fast swap V2 : 33.07
 Inlined Bubble Sort (Paolo Bonzini) : 45.79
 Unrolled Insertion Sort (Paolo Bonzini) : 80.15
I included both O1 and O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?
Comments on proposed solutions
Insertion Sort (Daniel Stutzbach)
As expected minimizing branches is indeed a good idea.
Sorting Networks (Daniel Stutzbach)
Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).
Sorting Networks (Paul R)
The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?
Sorting Networks 12 SWAP with Fast Swap
As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.
It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.
Calling Library qsort
To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).
Rank order
Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimising this particular code (there was very little difference for other proposals).
update:
As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.
Sorting Networks 12 with reordered Swap
The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do
SWAP(1, 2); SWAP(0, 2);
you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you doSWAP(1, 2); SWAP(4, 5);
the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.Sorting Networks 12 with Simple Swap
One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.
The "best" code is now as follow:
static inline void sort6_sorting_network_simple_swap(int * d){ #define min(x, y) (x<y?x:y) #define max(x, y) (x<y?y:x) #define SWAP(x,y) { const int a = min(d[x], d[y]); \ const int b = max(d[x], d[y]); \ d[x] = a; d[y] = b; } SWAP(1, 2); SWAP(4, 5); SWAP(0, 2); SWAP(3, 5); SWAP(0, 1); SWAP(3, 4); SWAP(1, 4); SWAP(0, 3); SWAP(2, 5); SWAP(1, 3); SWAP(2, 4); SWAP(2, 3); #undef SWAP #undef min #undef max }
If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?
转载于:https://stackoverflow.com/questions/2786899/fastestsortoffixedlength6intarray
 撒拉嘿哟木头 To run C code on a GPU you can use CUDA or OpenCL. It raises some restrictions, but it's still C code and benefits from GPU. By the way, if you have a GPU, just sorting 6 numbers would probably be waste of power.
 3 年多之前 回复
 撒拉嘿哟木头 Monov: still interrested, but I believe that the code would be quite different on a GPU. Henceforth it probably should be another question. As I'm lazy I din't opened it. So if you have a good answer working on GPU feel free to give it.
 3 年多之前 回复
 胖鸭 Also, I see that you removed the "will be run on a GPU" statement. Yet the gpgpu tag remains. Do you no longer intend the question to be about GPU code?
 3 年多之前 回复
 胖鸭 Hm... you specify that the language must be C, yet you say the code will be run on a GPU. How do you run C code on a GPU?
 3 年多之前 回复
 斗士狗 For C++, I've recently written a templated class to generate BoseNelson networks on compile time. With optimizations on, the performance is on par with the fastest handcoded answers here.
 4 年多之前 回复
 撒拉嘿哟木头 11: some posters used asm for answers, I used it to access hardware timers on x86 for accurate mesures but this is marginal. There is no more assembly here than in most typical libc implementations. Would you argue using libc makes it not plain C ? Most answers are pure C. But I disagree with you, knowing how the language is compiled under the hood (and what could be efficiently compiled on one target or another) is a very important part of C knowledge. Of course nobody is forced to read that question or it's answers. Don't care if you don't master that level of C.
 4 年多之前 回复
 游.程 This question actually has little to do with C, and more to do with assembly, as you are inspecting the machine code that is produced. Additionally, you've used __asm__ which really isn't plain C.
 4 年多之前 回复
 撒拉嘿哟木头 indeed copypastecode.com went down two years ago (!) As soon as I have time available I will put back the content on some working website (probably github). Thanks for the warning.
 接近 5 年之前 回复
 Memor.の The links to copypastecode.com are broken; now it seems to be a dodgy advertising site.
 接近 5 年之前 回复
 ℡Wang Yan "Zening", I like it. Not sure where to go with this word, but I will try to use it.
 6 年多之前 回复
 撒拉嘿哟木头 bad word, my fault, I want to minimize execution time not speed ;). Thanks to have spotted the typo.
 接近 9 年之前 回复
 bug^君 I think what you said is true for the case of maximising the speed, which is completely understandable. What I don't understand is that why would you minimise it? Why you want to write a slow program?
 接近 9 年之前 回复
 csdnceshi62 Pentium 2 and newer have fast conditional moves, see Steinar Gunderson's assembly code.
 接近 9 年之前 回复
 撒拉嘿哟木头 as pointless as minimizing source length, as I stated it's a game, a kind of golf. But I believe the process has it's points. It learns us things on what type of code is efficient or not at several levels : algorithmic (usually the greater benefits), but also compiler and hardware levels. It also give some reference point as to the kind of performance that can be expected from a C program (if necessary).
 接近 9 年之前 回复
 三生石@ CMP EAX, EBX; SBB EAX, EAX will put either 0 or 0xFFFFFFFF in EAX depending on whether EAX is larger or smaller than EBX, respectively. SBB is "subtract with borrow", the counterpart of ADC ("add with carry"); the status bit you refer to is the carry bit. Then again, I remember that ADC and SBB had terrible latency & throughput on the Pentium 4 vs. ADD and SUB, and were still twice as slow on Core CPUs. Since the 80386 there are also SETcc conditionalstore and CMOVcc conditionalmove instructions, but they're also slow.
 接近 9 年之前 回复
 谁还没个明天 I guess my assembly is too rusty as I don't recall any way to get that status bit into a register without jumping through a lot of hoops (push the flags, pop AX, mask and then shift. 4 cycles minimum {it's been a LONG time since I looked up cycles used, I think these are all onecycle operations these days} and one memory stall.) The only bit I recall being able to use directly is doing an add with carry.
 接近 9 年之前 回复
 YaoRaoLov I think you're saying "the CPU has to have different behavior based on if x<y or not" which is true, but the CPU makes decisions all the time without altering control flow  for example it has to decide during addition if a certain output bit is 1 or 0 based on the input bits. The word "branch" specifically means a jump to another set of instructions, so that the control flow is changed. The difference is important since less branching allows a CPU to anticipate future instructions and begin work on them before the last instruction completes, called pipelining.
 接近 9 年之前 回复
 狐狸.fox You're doing Selection Sort, not Insertion Sort. Otherwise, great analysis, with convincing proof. The swaporder optimization is most interesting.
 接近 9 年之前 回复
 localhost it's just a cmp instruction which sets a status bit.
 接近 9 年之前 回复
 YaoRaoLov there's no branch from computing x<y, it's just a binary operation just like + or *. If it were in an if statement, e.g., then it could cause branching, but it's not, which is the beauty of that swap implementation  it gives us conditional behavior with zero jumps.
 接近 9 年之前 回复
 YaoRaoLov Maybe the compiler is smart enough to do this for you, but a small speedup in the min/max SWAP appears to be available via reusing the xor'd value from min/max, like this: int s = (x^y) & (x<y); int t = y^s; y = x^s; x = t; (Replace {x,y} with d[{x,y}]; I wanted to keep the code snippet readable.)
 接近 9 年之前 回复
 谁还没个明天 I've got a beef with the Sorting networks 12 with fast swap code: Look at the min & max functions: There's a common element that contains a branch.
 接近 9 年之前 回复
 MAOEYE Direct call to qsort library function : 101 Naive implementation (insertion sort) : 299 Insertion Sort (Daniel Stutzbach) : 108 Insertion Sort Unrolled : 51 Sorting Networks (Daniel Stutzbach) : 26 Sorting Networks (Paul R) : 85 Sorting Networks 12 with Fast Swap : 117 Sorting Networks 12 reordered Swap : 116 Rank Order : 56
 接近 9 年之前 回复
 MAOEYE Some advice  6 loops is not enough for all platforms, also you should use a power of two test case to allow mod by power of two with and to repeat multiple times without polluting the results with potentially expensive mod operations. Here are some results for a platform I don't think I can name owing to NDA, but which uses PPC architecture:
 接近 9 年之前 回复
 csdnceshi62 Note that the correct implementation of rdtsc on 64bit is __asm__ volatile (".byte 0x0f, 0x31; shlq $32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx"); because rdtsc puts the answer in EDX:EAX while GCC expects it in a single 64bit register. You can see the bug by compiling at O3. Also see below my comment to Paul R about a faster SWAP.
 接近 9 年之前 回复
 hurriedly% Great analysis, and good catch on reordering the swaps. From the page I linked to that generates the SWAP macros, it looks like you can get an optimal order with the "View the network using text characters" option instead of the "Create a set of SWAP macros" option. I'm emailing the author to see if can get the SWAP macros to output in the optimum order.
 大约 10 年之前 回复
 撒拉嘿哟木头 yes, that's the predefined optimizations sets of gcc (you can also define optimization features individually). You can see here gcc.gnu.org/onlinedocs/gcc/OptimizeOptions.html what are those optimizations Other compilers have similar optimizations flags.
 大约 10 年之前 回复
 python小菜 : As a nonC guy, I'm quite fascinated by this post. I'm just wondering what O1 , O2 , etc. are. Are they compileroptimization levels?
 大约 10 年之前 回复
 hurriedly% You should try combining my 12swap sorting network with Paul's branchless swap function. His solution passes all of the parameters as separate elements on the stack instead of a single pointer to an array. That might also make a difference.
 大约 10 年之前 回复
 撒拉嘿哟木头 in the context I have not information on the int (you can imagine they are arbitrary pixels coding colors), so no, sorry you can't spare a bit for the swap trick ;)
 大约 10 年之前 回复
 elliott.david Do you have some constraints on the ints ? For example, can we assume that for any 2 x,y xy and x+y won't cause underflow or overflow ?
 大约 10 年之前 回复
 撒拉嘿哟木头 I will use some reference timer that read cycle register, like this one fit.vutbr.cz/~peringer/SIMLIB/doc/html/rdtsc_8hsource.html . In the GPU context I know it's cheating, but it keep rules easy for golfing purpose.
 大约 10 年之前 回复
 撒拉嘿哟木头 yes, you are right, it would be more logical using array of float. Really type of vector content is not relevant for the problem beyond that it is some kind of 'builtin' type hold in a register.
 大约 10 年之前 回复
 Didn"t forge ints on a GPU... can you use floats instead? Then you have min/max functions available. (At least GLSL does not support min/max for ints.) Also, it is probably faster to use two vec3 or similar types instead of an array, so you can use swizzling to sort.
 大约 10 年之前 回复
For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.
Do you know anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almostsorted dat, so it will be the better choice if there's an aboveaverage chance of almostsorted data.
The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.
Here's an insertion sort implementation:
static __inline__ int sort6(int *d){
int i, j;
for (i = 1; i < 6; i++) {
int tmp = d[i];
for (j = i; j >= 1 && tmp < d[j1]; j)
d[j] = d[j1];
d[j] = tmp;
}
}
Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:
static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
SWAP(1, 2);
SWAP(0, 2);
SWAP(0, 1);
SWAP(4, 5);
SWAP(3, 5);
SWAP(3, 4);
SWAP(0, 3);
SWAP(1, 4);
SWAP(2, 5);
SWAP(2, 4);
SWAP(1, 3);
SWAP(2, 3);
#undef SWAP
}
 Lotus＠ I found that replacing the if statement in the SWAP macro with a branchfree twoelement sort (using min / max) doubles the speed of this code.
 大约 3 年之前 回复
 妄徒之命 There are a number of reasons  see e.g. this question and answer as a starting point.
 大约 7 年之前 回复
 hurriedly% why is that?
 大约 7 年之前 回复
 七度＆光 You can get rid of tmp entirely by using an XOR swap: d[x]^=d[y]; d[y]^=d[x]; d[x]^=d[y];
 接近 9 年之前 回复
 from.. Would moving the int tmp out of the macro scope into function scope be of benefit?
 接近 9 年之前 回复
 笑故挽风 Well, a C library sort function requires that you specify the comparison operation via a function porter. The overhead of calling a function for every comparison is huge. Usually, that's still the cleanest way to go, because this is rarely a critical path in the program. However, if it is the critical path, we really can sort much faster if we know we're sorting integers and exactly 6 of them. :)
 大约 10 年之前 回复
 ℙℕℤℝ I believe the cost you have to pay just to call the library function (instead of static inline) is so high it defeats the library optimizations. But you are right, I should provide figures for plain library call to give a reference point.
 大约 10 年之前 回复
 北城已荒凉 good point; I should have thought of that. Doesn't that imply the correct answer to the question then is to just use the library sort?
 大约 10 年之前 回复
 笑故挽风 A good library sort function will already have a fastpath for small arrays. Many modern libraries will use a recursive QuickSort or MergeSort that switches to InsertionSort after recursing down to n < SMALL_CONSTANT.
 大约 10 年之前 回复
The test code is pretty bad; it overflows the initial array (don't people here read compiler warnings?), the printf is printing out the wrong elements, it uses .byte for rdtsc for no good reason, there's only one run (!), there's nothing checking that the end results are actually correct (so it's very easy to “optimize” into something subtly wrong), the included tests are very rudimentary (no negative numbers?) and there's nothing to stop the compiler from just discarding the entire function as dead code.
That being said, it's also pretty easy to improve on the bitonic network solution; simply change the min/max/SWAP stuff to
#define SWAP(x,y) { int tmp; asm("mov %0, %2 ; cmp %1, %0 ; cmovg %1, %0 ; cmovg %2, %1" : "=r" (d[x]), "=r" (d[y]), "=r" (tmp) : "0" (d[x]), "1" (d[y]) : "cc"); }
and it comes out about 65% faster for me (Debian gcc 4.4.5 with O2, amd64, Core i7).
 localhost â¦and finally, if your numbers are floats, and you don't have to worry about NaN etc., GCC can convert this to minss/maxss SSE instructions, which is yet ~25% faster. Morale: Drop the clever bitfiddling tricks and let the compiler do its job. :)
 接近 9 年之前 回复
 localhost You dont' even need assembler, actually; if you just drop all the clever tricks, GCC will recognize the sequence and insert the conditional moves for you: #define min(a, b) ((a < b) ? a : b) #define max(a, b) ((a < b) ? b : a) #define SWAP(x,y) { int a = min(d[x], d[y]); int b = max(d[x], d[y]); d[x] = a; d[y] = b; } It comes out maybe a few percent slower than the inline asm variant, but that's hard to say given the lack of proper benchmarking.
 接近 9 年之前 回复
 Didn"t forge Thanks for noticing the array overflow, I corrected it. Other people may not have noticed it because the clicked on the link to copy/paste code, where there is no overflow.
 接近 9 年之前 回复
 Didn"t forge OK, test code is poor. Feel free to improve it. And yes, you can use assembly code. Why not going all the way and fully code it using x86 assembler ? It may be a bit less portable but why bother ?
 接近 9 年之前 回复
Since these are integers and compares are fast, why not compute the rank order of each directly:
inline void sort6(int *d) {
int e[6];
memcpy(e,d,6*sizeof(int));
int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]);
int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]);
int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]);
int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]);
int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]);
int o5 = 15(o0+o1+o2+o3+o4);
d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5];
}
 零零乙 0+1+2+3+4+5=15 Since one of them is missing, 15 minus the sum of the rest yields missing one
 大约 4 年之前 回复
 elliott.david what is magic number 15 mean?
 接近 5 年之前 回复
 撒拉嘿哟木头 A big 'plus' here is that there are no 'stores' in the main part, so the optimizer will be able to easily work in registers, and only read each d[i] once. If the compiler can determine that, e.g. e[0] has already been read as d[0] (and you have enough registers), it would be even better. That could be made explicit by using 6 local variables d0..d5 instead of the memcpy to e.
 接近 6 年之前 回复
 撒拉嘿哟木头 The efficiency of this approach depends on a lot on how easy it is to do (int)(a>=b) on the particular machine. Pentiums gave the 'setcc' instruction allowing this to be done in two instructions but on some processors it could require more complex code. Although on almost anything modern it should be possible without a branch, compilers may not always oblige. If you can assume no overflow in the subtract, ((ab)>>31) gives 0 or 1, and that could be used instead.
 接近 6 年之前 回复
 10.24 Aha. That is not completely surprisingthere are a lot of variables floating around, and they have to be carefully ordered and cached in registers and so on.
 大约 10 年之前 回复
 YaoRaoLov I updated my answer. The true reason was version of compiler (gcc441 vs gcc 443) not target architecture. I didn't identified exactly what optimization. Your solution seems to hard push gcc. For example gcc443 yield much better results with O1 than with O2). I guess I will have to look at assembly code if I really want to understand why.
 大约 10 年之前 回复
 10.24 I think a factor of 2 difference in cycles is quite large, especially since I was testing on a 2core machine of the same vintage as the Q8300!
 大约 10 年之前 回复
 YaoRaoLov 3GHz with native Linux 64bits OS) and on it your program is the fastest (~370 cycles vs ~390). I should edit my question to provide results for both architectures (with your answer).
 大约 10 年之前 回复
 YaoRaoLov I also wonder if your method is really working on every dataset. I wonder if you do not have cases where several values are mapped to the same place when sorted data are repeated.
 大约 10 年之前 回复
Looking forward to trying my hand at this and learning from these examples, but first some timings from my 1.5 GHz PPC Powerbook G4 w/ 1 GB DDR RAM. (I borrowed a similar rdtsclike timer for PPC from http://www.mcs.anl.gov/~kazutomo/rdtsc.html for the timings.) I ran the program a few times and the absolute results varied but the consistently fastest test was "Insertion Sort (Daniel Stutzbach)", with "Insertion Sort Unrolled" a close second.
Here's the last set of times:
**Direct call to qsort library function** : 164
**Naive implementation (insertion sort)** : 138
**Insertion Sort (Daniel Stutzbach)** : 85
**Insertion Sort Unrolled** : 97
**Sorting Networks (Daniel Stutzbach)** : 457
**Sorting Networks (Paul R)** : 179
**Sorting Networks 12 with Fast Swap** : 238
**Sorting Networks 12 reordered Swap** : 236
**Rank Order** : 116
Looks like I got to the party a year late, but here we go...
Looking at the assembly generated by gcc 4.5.2 I observed that loads and stores are being done for every swap, which really isn't needed. It would be better to load the 6 values into registers, sort those, and store them back into memory. I ordered the loads at stores to be as close as possible to there the registers are first needed and last used. I also used Steinar H. Gunderson's SWAP macro. Update: I switched to Paolo Bonzini's SWAP macro which gcc converts into something similar to Gunderson's, but gcc is able to better order the instructions since they aren't given as explicit assembly.
I used the same swap order as the reordered swap network given as the best performing, although there may be a better ordering. If I find some more time I'll generate and test a bunch of permutations.
I changed the testing code to consider over 4000 arrays and show the average number of cycles needed to sort each one. On an i5650 I'm getting ~34.1 cycles/sort (using O3), compared to the original reordered sorting network getting ~65.3 cycles/sort (using O1, beats O2 and O3).
#include <stdio.h>
static inline void sort6_fast(int * d) {
#define SWAP(x,y) { int dx = x, dy = y, tmp; tmp = x = dx < dy ? dx : dy; y ^= dx ^ tmp; }
register int x0,x1,x2,x3,x4,x5;
x1 = d[1];
x2 = d[2];
SWAP(x1, x2);
x4 = d[4];
x5 = d[5];
SWAP(x4, x5);
x0 = d[0];
SWAP(x0, x2);
x3 = d[3];
SWAP(x3, x5);
SWAP(x0, x1);
SWAP(x3, x4);
SWAP(x1, x4);
SWAP(x0, x3);
d[0] = x0;
SWAP(x2, x5);
d[5] = x5;
SWAP(x1, x3);
d[1] = x1;
SWAP(x2, x4);
d[4] = x4;
SWAP(x2, x3);
d[2] = x2;
d[3] = x3;
#undef SWAP
#undef min
#undef max
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile ("rdtsc; shlq $32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx");
return x;
}
void ran_fill(int n, int *a) {
static int seed = 76521;
while (n) *a++ = (seed = seed *1812433253 + 12345);
}
#define NTESTS 4096
int main() {
int i;
int d[6*NTESTS];
ran_fill(6*NTESTS, d);
unsigned long long cycles = rdtsc();
for (i = 0; i < 6*NTESTS ; i+=6) {
sort6_fast(d+i);
}
cycles = rdtsc()  cycles;
printf("Time is %.2lf\n", (double)cycles/(double)NTESTS);
for (i = 0; i < 6*NTESTS ; i+=6) {
if (d[i+0] > d[i+1]  d[i+1] > d[i+2]  d[i+2] > d[i+3]  d[i+3] > d[i+4]  d[i+4] > d[i+5])
printf("d%d : %d %d %d %d %d %d\n", i,
d[i+0], d[i+1], d[i+2],
d[i+3], d[i+4], d[i+5]);
}
return 0;
}
I changed modified the test suite to also report clocks per sort and run more tests (the cmp function was updated to handle integer overflow as well), here are the results on some different architectures. I attempted testing on an AMD cpu but rdtsc isn't reliable on the X6 1100T I have available.
Clarkdale (i5650)
==================
Direct call to qsort library function 635.14 575.65 581.61 577.76 521.12
Naive implementation (insertion sort) 538.30 135.36 134.89 240.62 101.23
Insertion Sort (Daniel Stutzbach) 424.48 159.85 160.76 152.01 151.92
Insertion Sort Unrolled 339.16 125.16 125.81 129.93 123.16
Rank Order 184.34 106.58 54.74 93.24 94.09
Rank Order with registers 127.45 104.65 53.79 98.05 97.95
Sorting Networks (Daniel Stutzbach) 269.77 130.56 128.15 126.70 127.30
Sorting Networks (Paul R) 551.64 103.20 64.57 73.68 73.51
Sorting Networks 12 with Fast Swap 321.74 61.61 63.90 67.92 67.76
Sorting Networks 12 reordered Swap 318.75 60.69 65.90 70.25 70.06
Reordered Sorting Network w/ fast swap 145.91 34.17 32.66 32.22 32.18
Kentsfield (Core 2 Quad)
========================
Direct call to qsort library function 870.01 736.39 723.39 725.48 721.85
Naive implementation (insertion sort) 503.67 174.09 182.13 284.41 191.10
Insertion Sort (Daniel Stutzbach) 345.32 152.84 157.67 151.23 150.96
Insertion Sort Unrolled 316.20 133.03 129.86 118.96 105.06
Rank Order 164.37 138.32 46.29 99.87 99.81
Rank Order with registers 115.44 116.02 44.04 116.04 116.03
Sorting Networks (Daniel Stutzbach) 230.35 114.31 119.15 110.51 111.45
Sorting Networks (Paul R) 498.94 77.24 63.98 62.17 65.67
Sorting Networks 12 with Fast Swap 315.98 59.41 58.36 60.29 55.15
Sorting Networks 12 reordered Swap 307.67 55.78 51.48 51.67 50.74
Reordered Sorting Network w/ fast swap 149.68 31.46 30.91 31.54 31.58
Sandy Bridge (i72600k)
=======================
Direct call to qsort library function 559.97 451.88 464.84 491.35 458.11
Naive implementation (insertion sort) 341.15 160.26 160.45 154.40 106.54
Insertion Sort (Daniel Stutzbach) 284.17 136.74 132.69 123.85 121.77
Insertion Sort Unrolled 239.40 110.49 114.81 110.79 117.30
Rank Order 114.24 76.42 45.31 36.96 36.73
Rank Order with registers 105.09 32.31 48.54 32.51 33.29
Sorting Networks (Daniel Stutzbach) 210.56 115.68 116.69 107.05 124.08
Sorting Networks (Paul R) 364.03 66.02 61.64 45.70 44.19
Sorting Networks 12 with Fast Swap 246.97 41.36 59.03 41.66 38.98
Sorting Networks 12 reordered Swap 235.39 38.84 47.36 38.61 37.29
Reordered Sorting Network w/ fast swap 115.58 27.23 27.75 27.25 26.54
Nehalem (Xeon E5640)
====================
Direct call to qsort library function 911.62 890.88 681.80 876.03 872.89
Naive implementation (insertion sort) 457.69 236.87 127.68 388.74 175.28
Insertion Sort (Daniel Stutzbach) 317.89 279.74 147.78 247.97 245.09
Insertion Sort Unrolled 259.63 220.60 116.55 221.66 212.93
Rank Order 140.62 197.04 52.10 163.66 153.63
Rank Order with registers 84.83 96.78 50.93 109.96 54.73
Sorting Networks (Daniel Stutzbach) 214.59 220.94 118.68 120.60 116.09
Sorting Networks (Paul R) 459.17 163.76 56.40 61.83 58.69
Sorting Networks 12 with Fast Swap 284.58 95.01 50.66 53.19 55.47
Sorting Networks 12 reordered Swap 281.20 96.72 44.15 56.38 54.57
Reordered Sorting Network w/ fast swap 128.34 50.87 26.87 27.91 28.02
 from.. the "Rank Order" solution will tend to be mostly in regs anyway since it defers the stores until all the main calculation is done. Could be improved by using reads into 6 locals instead of the memcpy to 'e' though. I wouldn't be surprised if gcc can do that itself (it can treat a small local array as separate int vars; here it would also need to transform the memcpy into 6 separate ops e[0]=d[0]; e[1] = d[1] .. in order to make this optimization. But it would be better to make that explicit in the code).
 接近 6 年之前 回复
 斗士狗 There are exactly 720 different orderings of 6 elements. The 'best' sort in one sense would be one that had the best worstcase performance among those 720 orderings.
 接近 6 年之前 回复
 叼花硬汉 You're right. In one of my earlier variants I found that interleaving the loads was slightly beneficial, and so it stuck around. However, the compiler now moves most or all of the loads to the beginning.
 接近 9 年之前 回复
 YaoRaoLov It seems to me that the ordering of the loads is likely to cause stalls. Since all the loads are happening directly before the first use, the swap has to wait for the variables to be loaded, causing the stall. The interleaved stores are ok, you should hopefully get come overlap between the data being stored and the next swap occurring. Anyway, this is probably all a moot point for a sufficiently good compiler, it'll reorder things to that work occurs while it's waiting for data to be fetched from RAM. It's quite possible that doing all the loads in one block would be quite beneficial on a GPU.
 接近 9 年之前 回复
 叼花硬汉 Bonzini: I just forgot to update the code on this page (the pastebin link has the correct code). The results are using your macro. Fixed.
 接近 9 年之前 回复
 程序go Bonzini: also it seems that using two temporary variables instead of one in swap has a positive effect, but i'm not really sure of this one.
 接近 9 年之前 回复
 程序go Bonzini: Yes, I intend to add a test case with yours, just not had time yet. But I will avoid inline assembly.
 接近 9 年之前 回复
 ℡Wang Yan Your code still uses Gunderson's swap, mine would be #define SWAP(x,y) { int oldx = x; x = x < y ? x : y; y ^= oldx ^ x; }.
 接近 9 年之前 回复
 程序go Would you mind to add the simple swap version to you test suite, I guess it could be interesting to compare it with assembly fast swap optimized by hand.
 接近 9 年之前 回复
 叼花硬汉 I just tested it, I'm not seeing improvement (except a few cycles at O0 and Os). Looking at the asm it appears gcc already managed to figure out to use registers and eliminate the call to memcpy.
 接近 9 年之前 回复
 必承其重  欲带皇冠 Your idea of register variables should be applied to Rex Kerr's "Rank Order" solution. That should be fastest, and perhaps then the O3 optimization will not be counterproductive.
 接近 9 年之前 回复
While I really like the swap macro provided:
#define min(x, y) (y ^ ((x ^ y) & (x < y)))
#define max(x, y) (x ^ ((x ^ y) & (x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }
I see an improvement (which a good compiler might make):
#define SWAP(x,y) { int tmp = ((x ^ y) & (y < x)); y ^= tmp; x ^= tmp; }
We take note of how min and max work and pull the common subexpression explicitly. This eliminates the min and max macros completely.
 衫裤跑路 I tried with your swap, but local optimization has negative effects at larger level (I guess it introduce dependencies). And the result is slower than the other swap. But as you can see with new solution proposed there was indeed much performance to gain optimizing swap.
 接近 9 年之前 回复
 ℡Wang Yan I noticed the same thing; I think for your implementation to be correct you want d[x] instead of x (same for y), and d[y] < d[x] for the inequality here (yep, different from the min/max code).
 接近 9 年之前 回复
 必承其重  欲带皇冠 That gets them backwards, notice that d[y] gets the max, which is x^(common subexpression).
 接近 9 年之前 回复
An XOR swap may be useful in your swapping functions.
void xorSwap (int *x, int *y) {
if (*x != *y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
}
The if may cause too much divergence in your code, but if you have a guarantee that all your ints are unique this could be handy.
 Didn"t forge Anyway when used with sorting networks we never call with both x and y pointing to the same location. There is still to find a way to avoid testing wich is greater to get the same effect as the branchless swap. I have an idea to achieve that.
 接近 9 年之前 回复
This question is becoming quite old, but I actually had to solve the same problem these days: fast agorithms to sort small arrays. I thought it would be a good idea to share my knowledge. While I first started by using sorting networks, I finally managed to find other algorithms for which the total number of comparisons performed to sort every permutation of 6 values was smaller than with sorting networks, and smaller than with insertion sort. I didn't count the number of swaps; I would expect it to be roughly equivalent (maybe a bit higher sometimes).
The algorithm sort6
uses the algorithm sort4
which uses the algorithm sort3
. Here is the implementation in some light C++ form (the original is templateheavy so that it can work with any randomaccess iterator and any suitable comparison function).
Sorting 3 values
The following algorithm is an unrolled insertion sort. When two swaps (6 assignments) have to be performed, it uses 4 assignments instead:
void sort3(int* array)
{
if (array[1] < array[0]) {
if (array[2] < array[0]) {
if (array[2] < array[1]) {
std::swap(array[0], array[2]);
} else {
int tmp = array[0];
array[0] = array[1];
array[1] = array[2];
array[2] = tmp;
}
} else {
std::swap(array[0], array[1]);
}
} else {
if (array[2] < array[1]) {
if (array[2] < array[0]) {
int tmp = array[2];
array[2] = array[1];
array[1] = array[0];
array[0] = tmp;
} else {
std::swap(array[1], array[2]);
}
}
}
}
It looks a bit complex because the sort has more or less one branch for every possible permutation of the array, using 2~3 comparisons and at most 4 assignments to sort the three values.
Sorting 4 values
This one calls sort3
then performs an unrolled insertion sort with the last element of the array:
void sort4(int* array)
{
// Sort the first 3 elements
sort3(array);
// Insert the 4th element with insertion sort
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
if (array[1] < array[0]) {
std::swap(array[0], array[1]);
}
}
}
}
This algorithm performs 3 to 6 comparisons and at most 5 swaps. It is easy to unroll an insertion sort, but we will be using another algorithm for the last sort...
Sorting 6 values
This one uses an unrolled version of what I called a double insertion sort. The name isn't that great, but it's quite descriptive, here is how it works:
 Sort everything but the first and the last elements of the array.
 Swap the first and the elements of the array if the first is greater than the last.
 Insert the first element into the sorted sequence from the front then the last element from the back.
After the swap, the first element is always smaller than the last, which means that, when inserting them into the sorted sequence, there won't be more than N comparisons to insert the two elements in the worst case: for example, if the first element has been insert in the 3rd position, then the last one can't be inserted lower than the 4th position.
void sort6(int* array)
{
// Sort everything but first and last elements
sort4(array+1);
// Switch first and last elements if needed
if (array[5] < array[0]) {
std::swap(array[0], array[5]);
}
// Insert first element from the front
if (array[1] < array[0]) {
std::swap(array[0], array[1]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[4] < array[3]) {
std::swap(array[3], array[4]);
}
}
}
}
// Insert last element from the back
if (array[5] < array[4]) {
std::swap(array[4], array[5]);
if (array[4] < array[3]) {
std::swap(array[3], array[4]);
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
}
}
}
}
}
My tests on every permutation of 6 values ever show that this algorithms always performs between 6 and 13 comparisons. I didn't compute the number of swaps performed, but I don't expect it to be higher than 11 in the worst case.
I hope that this helps, even if this question may not represent an actual problem anymore :)
EDIT: after putting it in the provided benchmark, it is cleary slower than most of the interesting alternatives. It tends to perform a bit better than the unrolled insertion sort, but that's pretty much it. Basically, it isn't the best sort for integers but could be interesting for types with an expensive comparison operation.
 胖鸭 That's interesting to know. And I could indeed more differences again with O3. I guess I will adopt another strategy for my sorting library then: providing three kinds of algorithms to have either a low number of comparisons, a low number of swaps or potentially the best performance. At least, what happens will be transparent for the reader. Thanks for your insights :)
 接近 5 年之前 回复
 YaoRaoLov The solution by RexKerr (Order Rank) became the fastest on X86 architecture since gcc compiler 4.2.3 (and as of gcc 4.9 became nearly two times faster than the second best). But it's heavily dependant of compiler optimisations and may not be true on other architectures.
 接近 5 年之前 回复
 YaoRaoLov You should have a look at the way the other answers are timed. The point is that with such small dataset counting comparisons or even comparisons and swaps doesn't really say how fast is an algorithm (basically sorting 6 ints is always O(1) because O(6*6) is O(1)). The current fastest of previously proposed solutions is immediately finding the position of each value using a big comparison (by RexKerr).
 接近 5 年之前 回复
Never optimize min/max without benchmarking and looking at actual compiler generated assembly. If I let GCC optimize min with conditional move instructions I get a 33% speedup:
#define SWAP(x,y) { int dx = d[x], dy = d[y], tmp; tmp = d[x] = dx < dy ? dx : dy; d[y] ^= dx ^ tmp; }
(280 vs. 420 cycles in the test code). Doing max with ?: is more or less the same, almost lost in the noise, but the above is a little bit faster. This SWAP is faster with both GCC and Clang.
Compilers are also doing an exceptional job at register allocation and alias analysis, effectively moving d[x] into local variables upfront, and only copying back to memory at the end. In fact, they do so even better than if you worked entirely with local variables (like d0 = d[0], d1 = d[1], d2 = d[2], d3 = d[3], d4 = d[4], d5 = d[5]
). I'm writing this because you are assuming strong optimization and yet trying to outsmart the compiler on min/max. :)
By the way, I tried Clang and GCC. They do the same optimization, but due to scheduling differences the two have some variation in the results, can't say really which is faster or slower. GCC is faster on the sorting networks, Clang on the quadratic sorts.
Just for completeness, unrolled bubble sort and insertion sorts are possible too. Here is the bubble sort:
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(4,5);
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4);
SWAP(0,1); SWAP(1,2); SWAP(2,3);
SWAP(0,1); SWAP(1,2);
SWAP(0,1);
and here is the insertion sort:
//#define ITER(x) { if (t < d[x]) { d[x+1] = d[x]; d[x] = t; } }
//Faster on x86, probably slower on ARM or similar:
#define ITER(x) { d[x+1] ^= t < d[x] ? d[x] ^ d[x+1] : 0; d[x] = t < d[x] ? t : d[x]; }
static inline void sort6_insertion_sort_unrolled_v2(int * d){
int t;
t = d[1]; ITER(0);
t = d[2]; ITER(1); ITER(0);
t = d[3]; ITER(2); ITER(1); ITER(0);
t = d[4]; ITER(3); ITER(2); ITER(1); ITER(0);
t = d[5]; ITER(4); ITER(3); ITER(2); ITER(1); ITER(0);
This insertion sort is faster than Daniel Stutzbach's, and is especially good on a GPU or a computer with predication because ITER can be done with only 3 instructions (vs. 4 for SWAP). For example, here is the t = d[2]; ITER(1); ITER(0);
line in ARM assembly:
MOV r6, r2
CMP r6, r1
MOVLT r2, r1
MOVLT r1, r6
CMP r6, r0
MOVLT r1, r0
MOVLT r0, r6
For six elements the insertion sort is competitive with the sorting network (12 swaps vs. 15 iterations balances 4 instructions/swap vs. 3 instructions/iteration); bubble sort of course is slower. But it's not going to be true when the size grows, since insertion sort is O(n^2) while sorting networks are O(n log n).
I know this is an old question.
But I just wrote a different kind of solution I want to share.
Using nothing but nested MIN MAX,
It's not fast as it uses 114 of each,
could reduce it to 75 pretty simply like so > pastebin
But then it's not purely min max anymore.
What might work is doing min/max on multiple integers at once with AVX
#include <stdio.h>
static __inline__ int MIN(int a, int b){
int result =a;
__asm__ ("pminsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ int MAX(int a, int b){
int result = a;
__asm__ ("pmaxsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ unsigned long long rdtsc(void){
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" :
"=A" (x));
return x;
}
#define MIN3(a, b, c) (MIN(MIN(a,b),c))
#define MIN4(a, b, c, d) (MIN(MIN(a,b),MIN(c,d)))
static __inline__ void sort6(int * in) {
const int A=in[0], B=in[1], C=in[2], D=in[3], E=in[4], F=in[5];
in[0] = MIN( MIN4(A,B,C,D),MIN(E,F) );
const int
AB = MAX(A, B),
AC = MAX(A, C),
AD = MAX(A, D),
AE = MAX(A, E),
AF = MAX(A, F),
BC = MAX(B, C),
BD = MAX(B, D),
BE = MAX(B, E),
BF = MAX(B, F),
CD = MAX(C, D),
CE = MAX(C, E),
CF = MAX(C, F),
DE = MAX(D, E),
DF = MAX(D, F),
EF = MAX(E, F);
in[1] = MIN4 (
MIN4( AB, AC, AD, AE ),
MIN4( AF, BC, BD, BE ),
MIN4( BF, CD, CE, CF ),
MIN3( DE, DF, EF)
);
const int
ABC = MAX(AB,C),
ABD = MAX(AB,D),
ABE = MAX(AB,E),
ABF = MAX(AB,F),
ACD = MAX(AC,D),
ACE = MAX(AC,E),
ACF = MAX(AC,F),
ADE = MAX(AD,E),
ADF = MAX(AD,F),
AEF = MAX(AE,F),
BCD = MAX(BC,D),
BCE = MAX(BC,E),
BCF = MAX(BC,F),
BDE = MAX(BD,E),
BDF = MAX(BD,F),
BEF = MAX(BE,F),
CDE = MAX(CD,E),
CDF = MAX(CD,F),
CEF = MAX(CE,F),
DEF = MAX(DE,F);
in[2] = MIN( MIN4 (
MIN4( ABC, ABD, ABE, ABF ),
MIN4( ACD, ACE, ACF, ADE ),
MIN4( ADF, AEF, BCD, BCE ),
MIN4( BCF, BDE, BDF, BEF )),
MIN4( CDE, CDF, CEF, DEF )
);
const int
ABCD = MAX(ABC,D),
ABCE = MAX(ABC,E),
ABCF = MAX(ABC,F),
ABDE = MAX(ABD,E),
ABDF = MAX(ABD,F),
ABEF = MAX(ABE,F),
ACDE = MAX(ACD,E),
ACDF = MAX(ACD,F),
ACEF = MAX(ACE,F),
ADEF = MAX(ADE,F),
BCDE = MAX(BCD,E),
BCDF = MAX(BCD,F),
BCEF = MAX(BCE,F),
BDEF = MAX(BDE,F),
CDEF = MAX(CDE,F);
in[3] = MIN4 (
MIN4( ABCD, ABCE, ABCF, ABDE ),
MIN4( ABDF, ABEF, ACDE, ACDF ),
MIN4( ACEF, ADEF, BCDE, BCDF ),
MIN3( BCEF, BDEF, CDEF )
);
const int
ABCDE= MAX(ABCD,E),
ABCDF= MAX(ABCD,F),
ABCEF= MAX(ABCE,F),
ABDEF= MAX(ABDE,F),
ACDEF= MAX(ACDE,F),
BCDEF= MAX(BCDE,F);
in[4]= MIN (
MIN4( ABCDE, ABCDF, ABCEF, ABDEF ),
MIN ( ACDEF, BCDEF )
);
in[5] = MAX(ABCDE,F);
}
int main(int argc, char ** argv) {
int d[6][6] = {
{1, 2, 3, 4, 5, 6},
{6, 5, 4, 3, 2, 1},
{100, 2, 300, 4, 500, 6},
{100, 2, 3, 4, 500, 6},
{1, 200, 3, 4, 5, 600},
{1, 1, 2, 1, 2, 1}
};
unsigned long long cycles = rdtsc();
for (int i = 0; i < 6; i++) {
sort6(d[i]);
}
cycles = rdtsc()  cycles;
printf("Time is %d\n", (unsigned)cycles);
for (int i = 0; i < 6; i++) {
printf("d%d : %d %d %d %d %d %d\n", i,
d[i][0], d[i][1], d[i][2],
d[i][3], d[i][4], d[i][5]);
}
}
EDIT:
Rank order solution inspired by Rex Kerr's,
Much faster than the mess above
static void sort6(int *o) {
const int
A=o[0],B=o[1],C=o[2],D=o[3],E=o[4],F=o[5];
const unsigned char
AB = A>B, AC = A>C, AD = A>D, AE = A>E,
BC = B>C, BD = B>D, BE = B>E,
CD = C>D, CE = C>E,
DE = D>E,
a = AB + AC + AD + AE + (A>F),
b = 1  AB + BC + BD + BE + (B>F),
c = 2  AC  BC + CD + CE + (C>F),
d = 3  AD  BD  CD + DE + (D>F),
e = 4  AE  BE  CE  DE + (E>F);
o[a]=A; o[b]=B; o[c]=C; o[d]=D; o[e]=E;
o[15abcde]=F;
}
 Memor.の Yes the number of MIN and MAX could possibly be reduced, for example MIN(AB, CD) repeats a few times, but reducing them alot will be hard I think. I added your test cases.
 接近 3 年之前 回复
 csdnceshi62 always nice to see new solutions. It looks like some easy optimisation are possible. In the end it may not prove so different from Sorting Networks.
 接近 3 年之前 回复
MySQL数据库从入门实战课
20191231限时福利1：购课进答疑群专享柳峰（刘运强）老师答疑服务。 限时福利2：购课后添加学习助手（微信号：csdn590），按消息提示即可领取编程大礼包！ 注意：原价129的课程，最后2天限时秒杀仅需49元！！ 为什么说每一个程序员都应该学习MySQL？ 根据《20192020年中国开发者调查报告》显示，超83%的开发者都在使用MySQL数据库。 使用量大同时，掌握MySQL早已是运维、DBA的必备技能，甚至部分IT开发岗位也要求对数据库使用和原理有深入的了解和掌握。 学习编程，你可能会犹豫选择 C++ 还是 Java；入门数据科学，你可能会纠结于选择 Python 还是 R；但无论如何， MySQL 都是 IT 从业人员不可或缺的技能！ 【课程设计】 在本课程中，刘运强老师会结合自己十多年来对MySQL的心得体会，通过课程给你分享一条高效的MySQL入门捷径，让学员少走弯路，彻底搞懂MySQL。 本课程包含3大模块： 一、基础篇： 主要以最新的MySQL8.0安装为例帮助学员解决安装与配置MySQL的问题，并对MySQL8.0的新特性做一定介绍，为后续的课程展开做好环境部署。 二、SQL语言篇： 本篇主要讲解SQL语言的四大部分数据查询语言DQL，数据操纵语言DML，数据定义语言DDL，数据控制语言DCL，学会熟练对库表进行增删改查等必备技能。 三、MySQL进阶篇： 本篇可以帮助学员更加高效的管理线上的MySQL数据库；具备MySQL的日常运维能力，语句调优、备份恢复等思路。
如何定义一个不知道长度的数组？_course
20040129需要一个数组但事先不知道长度，并要求初始值为0，如何写？ long UserTime={0};
固定长度6 int 数组的最快排序_course
20100507<div class="posttext" itemprop="text"> <p>Answering to another Stack Overflow question (<a href="https://stackoverflow.com/questions/2775774/whatisthebestalgorithmforthisarraycomparisonproblem/2777202#2777202">this one</a>) I stumbled upon an interesting subproblem. What is the fastest way to sort an array of 6 ints?</p> <p>As the question is very low level:</p> <ul> <li>we can't assume libraries are available (and the call itself has its cost), only plain C</li> <li>to avoid emptying instruction pipeline (that has a <em>very</em> high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in <code>&&</code> or <code></code>).</li> <li>room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.</li> </ul> <p>Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening' code as used in the title of the book <a href="https://rads.stackoverflow.com/amzn/click/1883577039" rel="noreferrer">Zen of Code optimization</a> by <a href="http://en.wikipedia.org/wiki/Michael_Abrash" rel="noreferrer">Michael Abrash</a> and its <a href="http://www.codinghorror.com/blog/2008/02/thereaintnosuchthingasthefastestcode.html" rel="noreferrer">sequels</a>.</p> <p>As for why it is interesting, there is several layers:</p> <ul> <li>the example is simple and easy to understand and measure, not much C skill involved</li> <li>it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.</li> </ul> <p>Here is my reference (naive, not optimized) implementation and my test set.</p> <pre><code>#include <stdio.h> static __inline__ int sort6(int * d){ char j, i, imin; int tmp; for (j = 0 ; j < 5 ; j++){ imin = j; for (i = j + 1; i < 6 ; i++){ if (d[i] < d[imin]){ imin = i; } } tmp = d[j]; d[j] = d[imin]; d[imin] = tmp; } } static __inline__ unsigned long long rdtsc(void) { unsigned long long int x; __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x)); return x; } int main(int argc, char ** argv){ int i; int d[6][5] = { {1, 2, 3, 4, 5, 6}, {6, 5, 4, 3, 2, 1}, {100, 2, 300, 4, 500, 6}, {100, 2, 3, 4, 500, 6}, {1, 200, 3, 4, 5, 600}, {1, 1, 2, 1, 2, 1} }; unsigned long long cycles = rdtsc(); for (i = 0; i < 6 ; i++){ sort6(d[i]); /* * printf("d%d : %d %d %d %d %d %d\n", i, * d[i][0], d[i][6], d[i][7], * d[i][8], d[i][9], d[i][10]); */ } cycles = rdtsc()  cycles; printf("Time is %d\n", (unsigned)cycles); } </code></pre> <h2><strong>Raw results</strong></h2> <p>As number of variants is becoming large, I gathered them all in a test suite that can be found <a href="http://pastebin.com/azzuk072" rel="noreferrer">here</a>. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset). </p> <p>I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).</p> <p><strong>Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, O2</strong></p> <ul> <li>Direct call to qsort library function : 689.38</li> <li>Naive implementation (insertion sort) : 285.70</li> <li>Insertion Sort (Daniel Stutzbach) : 142.12</li> <li>Insertion Sort Unrolled : 125.47</li> <li>Rank Order : 102.26</li> <li>Rank Order with registers : 58.03</li> <li>Sorting Networks (Daniel Stutzbach) : 111.68</li> <li>Sorting Networks (Paul R) : 66.36</li> <li>Sorting Networks 12 with Fast Swap : 58.86</li> <li>Sorting Networks 12 reordered Swap : 53.74</li> <li>Sorting Networks 12 reordered Simple Swap : 31.54</li> <li>Reordered Sorting Network w/ fast swap : 31.54</li> <li>Reordered Sorting Network w/ fast swap V2 : 33.63</li> <li>Inlined Bubble Sort (Paolo Bonzini) : 48.85</li> <li>Unrolled Insertion Sort (Paolo Bonzini) : 75.30</li> </ul> <p><strong>Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, O1</strong></p> <ul> <li>Direct call to qsort library function : 705.93</li> <li>Naive implementation (insertion sort) : 135.60</li> <li>Insertion Sort (Daniel Stutzbach) : 142.11</li> <li>Insertion Sort Unrolled : 126.75</li> <li>Rank Order : 46.42</li> <li>Rank Order with registers : 43.58</li> <li>Sorting Networks (Daniel Stutzbach) : 115.57</li> <li>Sorting Networks (Paul R) : 64.44</li> <li>Sorting Networks 12 with Fast Swap : 61.98</li> <li>Sorting Networks 12 reordered Swap : 54.67</li> <li>Sorting Networks 12 reordered Simple Swap : 31.54</li> <li>Reordered Sorting Network w/ fast swap : 31.24</li> <li>Reordered Sorting Network w/ fast swap V2 : 33.07</li> <li>Inlined Bubble Sort (Paolo Bonzini) : 45.79</li> <li>Unrolled Insertion Sort (Paolo Bonzini) : 80.15</li> </ul> <p>I included both O1 and O2 results because surprisingly for several programs O2 is <strong>less</strong> efficient than O1. I wonder what specific optimization has this effect ?</p> <h2><strong>Comments on proposed solutions</strong></h2> <p><strong>Insertion Sort (Daniel Stutzbach)</strong></p> <p>As expected minimizing branches is indeed a good idea.</p> <p><strong>Sorting Networks (Daniel Stutzbach)</strong></p> <p>Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is <a href="http://pastebin.com/azzuk072" rel="noreferrer">here</a>).</p> <p><strong>Sorting Networks (Paul R)</strong></p> <p>The best so far. The actual code I used to test is <a href="http://pastebin.com/azzuk072" rel="noreferrer">here</a>. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?</p> <p><strong>Sorting Networks 12 SWAP with Fast Swap</strong></p> <p>As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is <a href="http://pastebin.com/azzuk072" rel="noreferrer">here</a>). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap. </p> <p>It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.</p> <p><strong>Calling Library qsort</strong></p> <p>To give another reference point I also tried as suggested to just call library qsort (code is <a href="http://pastebin.com/azzuk072" rel="noreferrer">here</a>). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).</p> <p><strong>Rank order</strong></p> <p>Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimising this particular code (there was very little difference for other proposals).</p> <p><em>update</em>:</p> <p>As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.</p> <p><strong>Sorting Networks 12 with reordered Swap</strong></p> <p>The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do <code>SWAP(1, 2); SWAP(0, 2);</code> you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do <code>SWAP(1, 2); SWAP(4, 5);</code>the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster. </p> <p><strong>Sorting Networks 12 with Simple Swap</strong></p> <p>One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is <a href="http://pastebin.com/azzuk072" rel="noreferrer">here</a>. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.</p> <p>The "best" code is now as follow:</p> <pre><code>static inline void sort6_sorting_network_simple_swap(int * d){ #define min(x, y) (x<y?x:y) #define max(x, y) (x<y?y:x) #define SWAP(x,y) { const int a = min(d[x], d[y]); \ const int b = max(d[x], d[y]); \ d[x] = a; d[y] = b; } SWAP(1, 2); SWAP(4, 5); SWAP(0, 2); SWAP(3, 5); SWAP(0, 1); SWAP(3, 4); SWAP(1, 4); SWAP(0, 3); SWAP(2, 5); SWAP(1, 3); SWAP(2, 4); SWAP(2, 3); #undef SWAP #undef min #undef max } </code></pre> <p>If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?</p> </div> <p>转载于:https://stackoverflow.com/questions/2786899/fastestsortoffixedlength6intarray</p>
CUDA编程中如何使用双精度FP64单元，计算单精度FP32指令_course
20170814我的CUDA核函数中只使用了INT32和FP32的数据类型，可是在用NVprofiler拿到的profiling中： ![图片说明](https://imgask.csdn.net/upload/201708/14/1502708584_799483.png) 却看到使用了FP64执行单元。 请问，这是什么原因呢？ 另外，如何在kernel里面显式指定计算单元，比如指定使用FP64单元计算FP32指令？
HoloLens2开发入门教程
20200501本课程为HoloLens2开发入门教程，讲解部署开发环境，安装VS2019，Unity版本，Windows SDK，创建Unity项目，讲解如何使用MRTK，编辑器模拟手势交互，打包VS工程并编译部署应用到HoloLens上等。
Java进阶高手课核心篇_course
20200417<p> <br> </p> <p> Java进阶的必经之路！<span></span> </p> <p> <br> </p> <p> <b>【超实用课程内容】</b><b></b> </p> <p> 本课程囊括了<span>Java</span>语言进阶的核心知识点，以真实场景项目实战为导向，循序渐进，深入浅出的了解Java并发编程、JVM虚拟机、网络编程和MySQL应用，讲解<span>Java</span>这门使用广泛的编程语言，助你能够游刃有余地游走在这些技术之中。<span> </span> </p> <p> <br> </p> <p> 套餐中一共包含<span>4</span>门<span>Java</span>进阶必学的核心知识（共<span>57</span>讲） </p> <p> 课程<span>1</span>：《<span>Java</span>进阶高手课<span></span>并发编程透彻理解》 </p> <p> 课程<span>2</span>：《<span>Java</span>进阶高手课<span></span>深入<span>JVM</span>虚拟机》 </p> <p> 课程<span>3</span>：《<span>Java</span>进阶高手课<span></span>深入浅出<span>Java</span>网络编程》 </p> <p> 课程<span>4</span>：《<span>Java</span>进阶高手课<span></span>必知必会<span>MySQL</span>》 </p> <p> <br> </p> <p> <strong>【</strong><strong>哪些人适合学习这门课程？</strong><strong>】</strong><strong></strong> </p> <p> 1）大学生，平时只接触了语言基础，并未学习深入语言内核； </p> <p> 2）对<span>Java</span>掌握程度薄弱的人，课程可以让你更好的理解<span>Java</span>语言原理及应用 </p> <p> 3）想修炼更好的<span>Java</span>内功，工作中遇到<span>Bug</span>可以游刃有余 </p> <p> 4）被面试官打破沙锅问到底的问题问到怀疑人生的应聘者 </p> <p> <br> </p> <p> <strong>【</strong><strong>你能收获到什么？</strong><strong>】</strong> </p> <p> 1.基础再提高，针对<span>Java</span>核心知识点学透，用对<span> </span> </p> <p> 2.能力再提高，日常工作中的代码换新貌，不怕问题<span> </span> </p> <p> 3.面试再加分，巴不得面试官打破沙锅问到底，竞争力<span>MAX</span> </p> <p> <br> <strong>【课程如何观看？】</strong> </p> <p> 1、登录<span>CSDN</span>学院<span> APP </span>在我的课程中进行学习； </p> <p> 2、移动端：<span>CSDN </span>学院<span>APP</span>（注意不是<span>CSDN APP</span>哦） </p> <p> 本课程为录播课，课程<span>2</span>年有效观看时长 </p> <p> <br> </p> <p class="qllong24357476"> <strong>【</strong><strong>资料开放</strong><strong>】</strong><strong></strong> </p> <p class="qllong24357476"> 课件、课程案例代码完全开放给你，你可以根据所学知识，自行修改、优化 </p> <p class="qllong24357476"> 下载方式：电脑登录课程观看页面，点击右下方课程资料、代码、课件等打包下载 </p> <p class="qllong24357476"> <img src="https://imgbss.csdn.net/202004200153008539.png" alt=""> </p> <p> <br> </p>
 我敢说：你一定也没看到过这么复杂的算法题（Java实现LeetCode LCP 13 寻宝（分队列+BFS+DP）） 865520200729LCP 13. 寻宝 我们得到了一副藏宝图，藏宝图显示，在一个迷宫中存在着未被世人发现的宝藏。 迷宫是一个二维矩阵，用一个字符串数组表示。它标识了唯一的入口（用 ‘S’ 表示），和唯一的宝藏地点（用 ‘T’ 表示）。但是，宝藏被一些隐蔽的机关保护了起来。在地图上有若干个机关点（用 ‘M’ 表示），只有所有机关均被触发，才可以拿到宝藏。 要保持机关的触发，需要把一个重石放在上面。迷宫中有若干个石堆（用 ‘O’ 表示），每个石堆都有无限个足够触发机关的重石。但是由于石头太重，我们一次只能搬一个石头到指定地点。
 24.46MB
图书管理系统（Java + Mysql）我的第一个完全自己做的实训项目
20190104图书管理系统 Java + MySQL 完整实训代码，MVC三层架构组织，包含所有用到的图片资源以及数据库文件，大三上学期实训，注释很详细，按照阿里巴巴Java编程规范编写
 16KB
《计算机体系结构基础》第二版 部分课后习题答案（ 胡伟武 ）
20190531资源包含《计算机体系结构基础》第二版 胡伟武 部分课后习题习题答案。不是每个题都有答案，答案保证正确。
零基础学C#编程—C#从小白到大咖
20190604本课程从初学者角度出发，提供了C#从入门到成为程序开发高手所需要掌握的各方面知识和技术。 【课程特点】 1 由浅入深，编排合理； 2 视频讲解，精彩详尽； 3 丰富实例，轻松易学； 4 每章总结配有难点解析文档。 15大章节，228课时，1756分钟与你一同进步！
YOLOv3目标检测实战：训练自己的数据集
20190529YOLOv3是一种基于深度学习的端到端实时目标检测方法，以速度快见长。本课程将手把手地教大家使用labelImg标注和使用YOLOv3训练自己的数据集。课程分为三个小项目：足球目标检测（单目标检测）、梅西目标检测（单目标检测）、足球和梅西同时目标检测（两目标检测）。 本课程的YOLOv3使用Darknet，在Ubuntu系统上做项目演示。包括：安装Darknet、给自己的数据集打标签、整理自己的数据集、修改配置文件、训练自己的数据集、测试训练出的网络模型、性能统计(mAP计算和画出PR曲线)和先验框聚类。 Darknet是使用C语言实现的轻型开源深度学习框架，依赖少，可移植性好，值得深入探究。 除本课程《YOLOv3目标检测实战：训练自己的数据集》外，本人推出了有关YOLOv3目标检测的系列课程，请持续关注该系列的其它课程视频，包括： 《YOLOv3目标检测实战：交通标志识别》 《YOLOv3目标检测：原理与源码解析》 《YOLOv3目标检测：网络模型改进方法》 敬请关注并选择学习！
Java软件开发工程师全套课程（笔记+项目实战案例）_course
20200608Java软件开发系列课程，一站式学习全套Java技术。 包含三个阶段课程： 第一阶段： Java基础入门——JavaSE核心技术 本阶段为Java基础入门，包含：初识Java、变量、运算符、选择结构、循环结构、方法、数组、面向对象、抽象类和接口、常用类、枚举、泛型、内部类、集合、异常、I/O、设计模式、数据库、JDBC、项目实战 第二阶段： Java进阶开发——Web开发技术 本阶段为JavaWeb开发技术，包含：HTML、CSS、JavaScript、jQuery、Bootstrap、Servlet、JSP、Ajax、MVC等 第三阶段： Java高级开发——JavaEE框架技术 Java框架技术，包含：IDEA、Maven、MyBatis、Spring、SpringMVC、SpringBoot、SpringCloud、Shiro、Redis、ZooKeeper、Dubbo、Kafka、Nginx、Git、Docker、Vue.js、在线商城实战等 教学全程采用笔记+代码案例的形式讲解，由浅入深，每个知识点都有详细的讲解，通俗易懂！
 177.10MB
pokemmo的资源
20180822pokemmo必须的4个rom 分别为绿宝石 火红 心金 黑白 还有汉化补丁 资源不错哦 记得下载
 15KB
实现简单的文件系统
20180126实验内容： 通过对具体的文件存储空间的管理、文件的物理结构、目录结构和文件操作的实现，加深对文件系统内部功能和实现过程的理解。 要求： 1.在内存中开辟一个虚拟磁盘空间作为文件存储器，在其上实现一个简
QT/C++从新手到老手系列之QT基础篇
20180121本系列课程励志于带领你学习QT5/C++，从开发环境（QTCreator和VS2013两种）搭建到实际项目实战，从入门到精通。每一个部分均有理论知识介绍、接口讲解、实例代码讲解，讲解过程中不断穿插老师在开发过程中遇到的问题及解决方法。本阶段主要学习Qt开发环境搭建(QTCreator及VS)、程序的发布、GUI控件的属性、方法、布局管理、容器类、QT事件处理等，学完本阶段后可以开发小型的应用程序。
Python入门到实战一卡通_course
20200609<span><span><span><span> <p class="qllong24357476"> <span> </span> </p> <p class="qllong24357476"> 【课程特色】 </p> <p class="qllong24357476"> <span class="qlauthor24357476">1、超强师资+体系全面+ 1 对 1 答疑+离线缓存+永久有效，无限回放</span> </p> <p class="qllong24357476"> 2、知识全面系统，从Python入门到逐步进阶到爬虫、数据分析、Web框架、人工智能应用 </p> <p class="qllong24357476"> <br> </p> <p class="qllong24357476"> <span class="qlauthor24357476">【优惠说明】</span> </p> <p class="qllong24357476"> <span class="qlauthor24357476">1、8大课程，250余节视频课，原价998元，今日联报立减800，仅需198元</span> </p> <p class="qllong24357476"> <span class="qlauthor24357476">2、</span>现在购课，就送价值800元的编程大礼包！ </p> <p class="qllong24357476"> 备注：请添加微信：itxy41，按提示获取讲师答疑服务。 </p> <p> <br> </p> <p class="qllong24357476"> 讲师介绍：裴帅帅，前百度资深架构师，现爱奇艺算法架构师全程亲自授课。 </p> <p> <br> </p> <p class="qllong24357476"> 【为什么要学习这门套餐课？】 </p> <p class="qllong24357476"> Python无论是在web/爬虫/人工智能/大数据/机器学习/测试/运维/数据分析等等领域都有大量的应用，但是作为小白来讲，很难确定最适合自己的应用方向。 </p> <p> <br> </p> <p class="qllong24357476"> 在这门课程中，将带你从零入门Python，并向你讲授实战 Python 各个应用方向的核心知识点，同时应用于实战项目。 </p> <p> <br> </p> <p class="qllong24357476"> 【学完后我将达到什么水平？】 </p> <p class="qllong24357476"> 你将能够熟练掌握 Python 在人工智能时代的多种技能，具备使用 Python 编写代码完成 Web 后台开发、网络爬虫、数据分析、机器学习、推荐系统等多种项目实战的能力，掌握 Python 全栈工程师的核心能力。 </p> <p> <br> </p> <p class="qllong24357476"> 【课程学习路径】 </p> <p class="qllong24357476"> 本套课以市场就业和职位需求为核心，从 Python 入门到多领域实战，并结合 Python 全栈工程师的进阶路线，共分为八大模块，分别是：Python 基础、Python Web 开发、Python 爬虫、Numpy 数据计算、Pandas 数据分析、Python数据可视化、Tensorflow 深度学习、推荐系统实战应用模块。 </p> <p> <br> </p> <p class="qllong24357476"> 套餐中一共包含8门Python课程（共246讲）助你从零进阶Python全栈工程师！ </p> <p class="qllong24357476"> 课程1：《Python零基础入门视频教程》 </p> <p class="qllong24357476"> 课程2：《Python爬虫从入门到实战》 </p> <p class="qllong24357476"> 课程3：《Python使用Flask开发Web服务》 </p> <p class="qllong24357476"> 课程4：《Python使用Numpy入门数据计算》 </p> <p class="qllong24357476"> 课程5：《Python使用Pandas入门数据分析》 </p> <p class="qllong24357476"> 课程6：《Python数据图表可视化》 </p> <p class="qllong24357476"> 课程7：《Tensorflow深度学习从入门到实战》 </p> <p class="qllong24357476"> 课程8：《推荐系统技术入门到实战》 </p> <p> <br> </p> <p class="qllong24357476"> 【面向人群】 </p> <p class="qllong24357476"> 1、在校计算机专业或者对软件编程感兴趣的学生； </p> <p class="qllong24357476"> 2、想要使用数据分析、网络爬虫提升职场竞争力升职加薪的各行各业的企业白领； </p> <p class="qllong24357476"> 3、想要转方向成为数据分析师、大数据开发、机器学习算法、推荐系统工程师的职场码农； </p> <p class="qllong24357476"> 4、准备从事人工智能、Python开发的程序员。 </p> </span> <p> <br> </p> <p class="qllong24357476"> <br> </p> <p> <br> </p> <p class="qllong24357476"> 【课程知识体系图】 </p> </span></span></span> <p> <img src="https://imgbss.csdnimg.cn/202006100818561687.png" alt=""> </p>
 23.15MB
Xilinx《Zynq设计方法指南》中文版
20190507本书详细讲解了Xilinx Zynq的特色、应用、方法，是官方出版的非常好的一本学习Zynq的权威资料。
21天通关Python（仅视频课）
20190521本页面购买不发书！！！仅为视频课购买！！！ 请务必到https://edu.csdn.net/bundled/detail/49下单购买课+书。 本页面，仅为观看视频页面，如需一并购买图书，请务必到https://edu.csdn.net/bundled/detail/49下单购买课程+图书！！！ 疯狂Python精讲课程覆盖《疯狂Python讲义》全书的主体内容。 内容包括Python基本数据类型、Python列表、元组和字典、流程控制、函数式编程、面向对象编程、文件读写、异常控制、数据库编程、并发编程与网络编程、数据可视化分析、Python爬虫等。 全套课程从Python基础开始介绍，逐步步入当前就业热点。将会带着大家从Python基础语法开始学习，为每个知识点都提供对应的代码实操、代码练习，逐步过渡到文件IO、数据库编程、并发编程、网络编程、数据分 析和网络爬虫等内容，本课程会从小案例起，至爬虫、数据分析案例终、以Python知识体系作为内在逻辑，以Python案例作为学习方式，最终达到“知行合一”。
 3KB
谷粒商城2020.5月最新升级全套版基础+高级+高可用集群
20200716谷粒商城2020.5月最新升级全套版基础+高级+高可用集群，已完结！
 38.68MB
2019数学建模历年题目及优秀论文
201904022019数学建模历年题目及优秀论文 ，好资源与大家分享！！
 36.19MB
老王加速器 2.2.16.rar
20200630老王加速器最新版本，旨在提高游戏网络速度，可用于网络优化相关的服务。 本应用仅供科研、学习、保护隐私之用，切勿用于其它任何用途。
Java学习指南（Java入门与进阶）
20170809这是Java学习指南系列课程的第1篇，介绍Java语言的入门语法，引领希望学习Java语言编程的初学者进入Java大门。 本课程不需要其他语言作为基础，可以直接学习。 课程从Java开发平台的下载和安装开始，从浅到深、从易到难，循序渐进地进行语法讲解。 为了让学员更好的掌握Java语言，本课程配套在线的Java题库及答案解析。 相比于其他语言，Java语言更科学、更容易掌握，快来和大家一起学习Java吧。
 2.48MB
土豆SDK(Java版)非官方
20120714由于土豆SDK一直建设中，最近几天抽空写了一套java的SDK。包含了现有的所有请求协议。本套SDK中仅提供了oAuth的方式(引用oAuth.net的java版示例)，并没有在框架中实现，涉及到登录
 博客 关于 Hibernate StackOverflowError 异常的一个解决方法
 博客 PreparedStatement与Statement的区别及SQL注入
 下载 外夹式超声波流量计的优缺点具体表现哪里
 博客 cephceph wrongly marked me down
 下载 基于数字化变电站的光纤纵差保护研究
 下载 线性温度传感器使用指南
 博客 Ubuntu18.04+CUDA10.1+cudnn7.6.5+tensorflow1.14+pytorch1.4安装全过程
 博客 java字符串相关操作（一）
 博客 Spring中Bean的实例化过程
 博客 k8s部署prometheus + grafana
 学院 手把手教你蓝牙协议栈入门
 学院 大白话系列算法
 下载 datagrip2020.2.exe
 博客 div动态水平居中的解决
 下载 bitcoinjcore0.16SNAPSHOT.jar
 博客 如何学好Web前端开发，就业待遇如何？
 博客 JavaScript 原型与原型链
 下载 S7200 SMART 系统手册 V2.2
 学院 Java全栈工程师Maven高级
 博客 Java数据结构与算法如何用O(1)的时间复杂度求栈中最小元素
 学院 Python自动化爬虫实战与高级架构技巧
 下载 TIA942《数据中心电信基础设施标准》（中文版）.pdf
 学院 SoapUI接口测试之高阶敏捷实战
 学院 Spark+AI深度学习动手实践
 下载 基于PLL和TDA7010T的无线收发系统设计
 学院 5分钟带你安装Docker
 学院 Kubernetes中间件容器化
 博客 oracle实验报告四
 下载 Oracle Autonomous Data Warehous(ADW)操作手册.docx
 学院 程序员的MySQL学习宝典