什么时候装配比 c 快?

One of the stated reasons for knowing assembler is that, on occasion, it can be employed to write code that will be more performant than writing that code in a higher-level language, C in particular. However, I've also heard it stated many times that although that's not entirely false, the cases where assembler can actually be used to generate more performant code are both extremely rare and require expert knowledge of and experience with assembly.

This question doesn't even get into the fact that assembler instructions will be machine-specific and non-portable, or any of the other aspects of assembler. There are plenty of good reasons for knowing assembly besides this one, of course, but this is meant to be a specific question soliciting examples and data, not an extended discourse on assembler versus higher-level languages.

Can anyone provide some specific examples of cases where assembly will be faster than well-written C code using a modern compiler, and can you support that claim with profiling evidence? I am pretty confident these cases exist, but I really want to know exactly how esoteric these cases are, since it seems to be a point of some contention.

转载于:https://stackoverflow.com/questions/577554/when-is-assembly-faster-than-c

csdnceshi65
larry*wei Earlier in my career, I was writing a lot of C and mainframe assembler at a software company. One of my peers was what I'd call an "assembler purist" (everything had to be assembler), so I bet him I could write a given routine that ran faster in C than what he could write in assembler. I won. But to top it off, after I won, I told him I wanted a second bet - that I could write something faster in assembler than the C program that beat him on the prior wager. I won that too, proving that most of it comes down to the skill and ability of the programmer more than anythings else.
大约 3 年之前 回复
weixin_41568134
MAO-EYE For an esoteric example, do a web search for pclmulqdq crc. pclmulqdq is a special assembly instruction. The optimized examples use about 500 lines of assembly code. Some X86 also have a crc32c instruction for a specific case of crc32. Benchmark results to generate crc32 over 256MB (256*1024*1024) byte array: c code using table => 0.516749 sec, assembly using pcmuldq => 0.0783919 sec, c code using crc32 intrinsic => 0.0541801 sec.
3 年多之前 回复
csdnceshi78
程序go It's not even always the case that you need to rewrite something in assembly to reap the benefits of knowing assembly. Simply recompiling your C algorithm in various forms and observing the assembly that the compiler generates will allow you to write more efficient code in C.
3 年多之前 回复
csdnceshi56
lrony* I strongly disagree that answers to this question need to be "opinion based" - they can quite objective - it is not something like trying to compare performance of favorite pet languages, for which each will have strong points and draw backs. This is a matter of understanding how far compilers can take us, and from which point it is better to take over.
大约 5 年之前 回复
weixin_41568174
from.. All the greatest questions have 256 votes.
5 年多之前 回复
weixin_41568127
?yb? Actually, the short answer is: Assembler is always faster or equal to the speed of C. The reason is that you can have assembly without C, but you can't have C without assembly (in the binary form, which we in the old days called "machine code"). That said, the long answer is: C Compilers are pretty good at optimizing and "thinking" about things you don't usually think of, so it really depends on your skills, but normally you can always beat the C compiler; it's still only a software that can't think and get ideas. You can also write portable assembler if you use macros and you're patient.
5 年多之前 回复
csdnceshi69
YaoRaoLov actually it is quite trivial to improve upon compiled code. Anyone with a solid knowledge of assembly language and C can see this by examining the code generated. Any easy one is the first performance cliff you fall off of when you run out of disposable registers in the compiled version. On average the compiler will do far better than a human for a large project, but it is not hard in a decent sized project to find performance issues in the compiled code.
接近 6 年之前 回复
csdnceshi58
Didn"t forge Unfortunately questions like this would be closed for being too broad, or opinionated in 5 seconds these days.
6 年多之前 回复
csdnceshi50
三生石@ As someone just begining to learn asm, I find the responses to this question very useful.
接近 7 年之前 回复
weixin_41568184
叼花硬汉 One of the greatest questions I've seen. Thank you Adam!
接近 8 年之前 回复
weixin_41568196
撒拉嘿哟木头 And now an other question would be appropriate: When does the fact that assembler is faster than C actually matters?
大约 11 年之前 回复

30个回答

Here is a real world example: Fixed point multiplies on old compilers.

These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).


Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see


C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:

// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
  long long a_long = a; // cast to 64 bit.

  long long product = a_long * b; // perform multiplication

  return (int) (product >> 16);  // shift by the fixed point bias
}

The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.

x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).

So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.

If you rewrite the same code in (inline) assembler you can gain a significant speed boost.

In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.

Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.

For reference: The end-result for the fixed-point mul for the VS.NET compiler is:

int inline FixedPointMul (int a, int b)
{
    return (int) __ll_rshift(__emul(a,b),16);
}

The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.


Using Visual C++ 2013 gives the same assembly code for both ways.

gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)

See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)


Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).

Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.

Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.

Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.


Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.

csdnceshi54
hurriedly% To be fair, this is example is a poor one, at least today. C compilers have long been able to do a 32x32 -> 64 multiply even if the language doesn't offer it directly: they recognize that when you cast 32-bit arguments to 64-bit and then multiply them, it doesn't need to do a full 64-bit multiply, but that a 32x32 -> 64 will do just fine. I checked and all of clang, gcc and MSVC in their current version get this right. This isn't new - I remember looking at compiler output and noticing this a decade a ago.
大约 2 年之前 回复
csdnceshi60
℡Wang Yan Amazing answer. I had to look up several (3), in depth things discussed here that i didn't know, in order to understand it. This is probably because i don't know much about how compilers work yet. But i will soon. : )
大约 3 年之前 回复
csdnceshi78
程序go the point of the answer is to show that writing optimized assembly code by hand is not even always the best answer, because the compiler does not know the intention of your code. The intrinsic lets the compiler know what you intend to do, and it allows it to even further optimize it through various features. Usually such intrinsics translate to relatively simple assembler code, but they carry extra information that the compiler can use in its optimization phase. Plus, if the target platform does not support it, the compiler can provide a compatible alternative.
接近 5 年之前 回复
csdnceshi50
三生石@ Note that at least with regard to register allocation flexibility, one should use an "extended inline assembly" rather than simple asm() calls. This way, the compiler is able to allocate registers in build time.
大约 5 年之前 回复
csdnceshi52
妄徒之命 originally you was showing us a case when asm is more efficient than C. But what is you finished with? __ll_rshift is a C construct! Compiler specific although, but not asm.
接近 6 年之前 回复
csdnceshi77
狐狸.fox Actually, the code here is quite readable: the inline code does one unique operation, which is immediately understable reading the method signature. The code lost only slowly in readibility when an obscure instruction is used. What matters here is we have a method which does only one clearly identifiable operation, and that's really the best way to produce readable code these atomic functions. By the way, this is not so obscure a small comment like /* (a * b) >> 16 */ can't immediately explain it.
6 年多之前 回复
csdnceshi76
斗士狗 The "right" solution would be for the language to provide a way of requesting what one wants to do. Writing code whose naive interpretation would be horribly inefficient in the hope that the compiler will perform a particular optimization is a good way to get burned when switching compiler versions. This can be especially true in cases where one knows things about the sizes of operands that the compiler does not (for example, what compilers could optimize uint_quotient=ulong_dividend/uint_divisor in the case where ulong_dividend is known to be less than 2^32 times uint_divisor?)
6 年多之前 回复
weixin_41568196
撒拉嘿哟木头 also for these one liners, it doesn't hurt to use an #if #else pre-processor statement.
6 年多之前 回复
weixin_41568208
北城已荒凉 Hi Slacker, I think you've never had to work on time-critical code before... inline assembly can make a *huge difference. Also for the compiler an intrinsic is the same as normal arithmetic in C. That's the point in intrinsics. They let you use a architecture feature without having to deal with the drawbacks.
接近 10 年之前 回复
csdnceshi70
笑故挽风 Actually, a good compiler would produce the optimal code from the first function. Obscuring the source code with intrinsics or inline assembly with absolutely no benefit is not the best thing to do.
接近 10 年之前 回复
csdnceshi74
7*4 How about things like {x=c%d; y=c/d;}, are compilers clever enough to make that a single div or idiv?
大约 10 年之前 回复
csdnceshi57
perhaps? I've long known about this one.
大约 11 年之前 回复

Many years ago I was teaching someone to program in C. The exercise was to rotate a graphic through 90 degrees. He came back with a solution that took several minutes to complete, mainly because he was using multiplies and divides etc.

I showed him how to recast the problem using bit shifts, and the time to process came down to about 30 seconds on the non-optimizing compiler he had.

I had just got an optimizing compiler and the same code rotated the graphic in < 5 seconds. I looked at the assembly code that the compiler was generating, and from what I saw decided there and then that my days of writing assembler were over.

csdnceshi76
斗士狗 I really can't think of any platform where a compiler would be likely to get within a factor or two of optimal code for an 8x8 rotate.
2 年多之前 回复
csdnceshi76
斗士狗 On what processor? On 8086, I'd expect that optimal code for an 8x8 rotate would load DI with 16 bits of data using SI, repeat add di,di / adc al,al / add di,di / adc ah,ah etc. for all eight 8-bit registers, then do all 8 registers again, and then repeat the whole procedure three more times, and finally save four words in ax/bx/cx/dx. No way an assembler is going to come close to that.
2 年多之前 回复
csdnceshi69
YaoRaoLov Did the optimizing compiler compile the original program or your version?
5 年多之前 回复
csdnceshi77
狐狸.fox He may have seen a code he couldn't write :/
接近 6 年之前 回复
csdnceshi52
妄徒之命 Yes it was a one bit monochrome system, specifically it was the monochrome image blocks on an Atari ST.
11 年多之前 回复
csdnceshi78
程序go Just wondering: Was the graphic in 1 bit per pixel format?
11 年多之前 回复

A use case which might not apply anymore but for your nerd pleasure: On the Amiga, the CPU and the graphics/audio chips would fight for accessing a certain area of RAM (the first 2MB of RAM to be specific). So when you had only 2MB RAM (or less), displaying complex graphics plus playing sound would kill the performance of the CPU.

In assembler, you could interleave your code in such a clever way that the CPU would only try to access the RAM when the graphics/audio chips were busy internally (i.e. when the bus was free). So by reordering your instructions, clever use of the CPU cache, the bus timing, you could achieve some effects which were simply not possible using any higher level language because you had to time every command, even insert NOPs here and there to keep the various chips out of each others radar.

Which is another reason why the NOP (No Operation - do nothing) instruction of the CPU can actually make your whole application run faster.

[EDIT] Of course, the technique depends on a specific hardware setup. Which was the main reason why many Amiga games couldn't cope with faster CPUs: The timing of the instructions was off.

csdnceshi76
斗士狗 This sounds like the sort of optimization that one could program a C compiler to be very good at.
接近 4 年之前 回复
weixin_41568131
10.24 My mistake. The 68k CPU had only 24 address lanes, that's why I had the 16MB in my head.
11 年多之前 回复
csdnceshi71
Memor.の Digulla: Wikipedia has more info about the distinctions between chip/fast/slow RAM: en.wikipedia.org/wiki/Amiga_Chip_RAM
11 年多之前 回复
weixin_41568131
10.24 I stand corrected. My memory may fail me but wasn't chip RAM restricted to the first 24bit address space (i.e. 16MB)? And Fast was mapped above that?
11 年多之前 回复
csdnceshi58
Didn"t forge - Amiga produced a large range of different models of computers, the Amiga 500 shipped with 512K ram extended to 1Meg in my case. amigahistory.co.uk/amiedevsys.html is an amiga with 128Meg Ram
11 年多之前 回复
csdnceshi71
Memor.の The Amiga didn't have 16 MB of chip RAM, more like 512 kB to 2 MB depending on chipset. Also, a lot of Amiga games didn't work with faster CPUs due to techniques like you describe.
11 年多之前 回复

http://cr.yp.to/qhasm.html has many examples.

Longpoke, there is just one limitation: time. When you don't have the resources to optimize every single change to code and spend your time allocating registers, optimize few spills away and what not, the compiler will win every single time. You do your modification to the code, recompile and measure. Repeat if necessary.

Also, you can do a lot in the high-level side. Also, inspecting the resulting assembly may give the IMPRESSION that the code is crap, but in practice it will run faster than what you think would be quicker. Example:

int y = data[i]; // do some stuff here.. call_function(y, ...);

The compiler will read the data, push it to stack (spill) and later read from stack and pass as argument. Sounds shite? It might actually be very effective latency compensation and result in faster runtime.

// optimized version call_function(data[i], ...); // not so optimized after all..

The idea with the optimized version was, that we have reduced register pressure and avoid spilling. But in truth, the "shitty" version was faster!

Looking at the assembly code, just looking at the instructions and concluding: more instructions, slower, would be a misjudgment.

The thing here to pay attention is: many assembly experts think they know a lot, but know very little. The rules change from architecture to next, too. There is no silver-bullet x86 code, for example, which is always the fastest. These days is better to go by rules-of-thumb:

  • memory is slow
  • cache is fast
  • try to use cached better
  • how often you going to miss? do you have latency compensation strategy?
  • you can execute 10-100 ALU/FPU/SSE instructions for one single cache miss
  • application architecture is important..
  • .. but it does't help when the problem isn't in the architecture

Also, trusting too much into compiler magically transforming poorly-thought-out C/C++ code into "theoretically optimum" code is wishful thinking. You have to know the compiler and tool chain you use if you care about "performance" at this low-level.

Compilers in C/C++ are generally not very good at re-ordering sub-expressions because the functions have side effects, for starters. Functional languages don't suffer from this caveat but don't fit the current ecosystem that well. There are compiler options to allow relaxed precision rules which allow order of operations to be changed by the compiler/linker/code generator.

This topic is a bit of a dead-end; for most it's not relevant, and the rest, they know what they are doing already anyway.

It all boils down to this: "to understand what you are doing", it's a bit different from knowing what you are doing.

Pretty much anytime the compiler sees floating point code, a hand written version will be quicker. The primary reason is that the compiler can't perform any robust optimisations. See this article from MSDN for a discussion on the subject. Here's an example where the assembly version is twice the speed as the C version (compiled with VS2K5):

#include "stdafx.h"
#include <windows.h>

float KahanSum
(
  const float *data,
  int n
)
{
   float
     sum = 0.0f,
     C = 0.0f,
     Y,
     T;

   for (int i = 0 ; i < n ; ++i)
   {
      Y = *data++ - C;
      T = sum + Y;
      C = T - sum - Y;
      sum = T;
   }

   return sum;
}

float AsmSum
(
  const float *data,
  int n
)
{
  float
    result = 0.0f;

  _asm
  {
    mov esi,data
    mov ecx,n
    fldz
    fldz
l1:
    fsubr [esi]
    add esi,4
    fld st(0)
    fadd st(0),st(2)
    fld st(0)
    fsub st(0),st(3)
    fsub st(0),st(2)
    fstp st(2)
    fstp st(2)
    loop l1
    fstp result
    fstp result
  }

  return result;
}

int main (int, char **)
{
  int
    count = 1000000;

  float
    *source = new float [count];

  for (int i = 0 ; i < count ; ++i)
  {
    source [i] = static_cast <float> (rand ()) / static_cast <float> (RAND_MAX);
  }

  LARGE_INTEGER
    start,
    mid,
    end;

  float
    sum1 = 0.0f,
    sum2 = 0.0f;

  QueryPerformanceCounter (&start);

  sum1 = KahanSum (source, count);

  QueryPerformanceCounter (&mid);

  sum2 = AsmSum (source, count);

  QueryPerformanceCounter (&end);

  cout << "  C code: " << sum1 << " in " << (mid.QuadPart - start.QuadPart) << endl;
  cout << "asm code: " << sum2 << " in " << (end.QuadPart - mid.QuadPart) << endl;

  return 0;
}

And some numbers from my PC running a default release build*:

  C code: 500137 in 103884668
asm code: 500137 in 52129147

Out of interest, I swapped the loop with a dec/jnz and it made no difference to the timings - sometimes quicker, sometimes slower. I guess the memory limited aspect dwarves other optimisations.

Whoops, I was running a slightly different version of the code and it outputted the numbers the wrong way round (i.e. C was faster!). Fixed and updated the results.

csdnceshi65
larry*wei FP add is commutative (a+b == b+a), but not associative (reordering of operations, so rounding of intermediates is different). re: this code: I don't think uncommented x87 and a loop instruction are a very awesome demonstration of fast asm. loop is apparently not actually a bottleneck because of FP latency. I'm not sure if he's pipelining FP operations or not; x87 is hard for humans to read. Two fstp results insns at the end is clearly not optimal. Popping the extra result from the stack would be better done with a non-store. Like fstp st(0) IIRC.
4 年多之前 回复
csdnceshi74
7*4 are there?
4 年多之前 回复
csdnceshi64
游.程 You mean associative? Or are there really cases when operations on reals are commutative but floating points are not?
4 年多之前 回复
csdnceshi73
喵-见缝插针 Did you try SSE math? Performance was one of the reasons MS abandoned x87 completely in x86_64 and 80-bit long double in x86
6 年多之前 回复
weixin_41568126
乱世@小熊 Yeah, floats are not commutative, the compiler must do EXACTLY what you wrote, basically what @DavidStone said.
6 年多之前 回复
weixin_41568174
from.. Or in GCC, you can untie the compiler's hands on floating point optimization (as long as you promise not to do anything with infinities or NaNs) by using the flag -ffast-math. They have an optimization level, -Ofast that is currently equivalent to -O3 -ffast-math, but in the future may include more optimizations that can lead to incorrect code generation in corner cases (such as code that relies on IEEE NaNs).
接近 8 年之前 回复
csdnceshi56
lrony* nice one, using the vs compiler I got similar result (asm is faster). When using the /fp:fast as mentionned in the MSDN article then the C version goes faster.
大约 10 年之前 回复
csdnceshi70
笑故挽风 I used to do a bit of FPU assembly back in the day, but currently on x86 if you need to do hand optimised FPU assembly you should be doing it with the expanded instruction sets like SSE etc. As you won't gain much in real world performance using the FPU.
11 年多之前 回复
csdnceshi68
local-host True - if you're memory bound not much can be done.
11 年多之前 回复
csdnceshi68
local-host FYI: The code could even be faster if you replace the loop with sub ecx, 1 / bnz l1. Loop is a lot slower than it could be (for a reason, but that's another topic).
11 年多之前 回复
csdnceshi78
程序go +1 for actually doing the profiling, but it'd be nice for you to include the output in your answer.
11 年多之前 回复

Point one which is not the answer.
Even if you never program in it, I find it useful to know at least one assembler instruction set. This is part of the programmers never-ending quest to know more and therefore be better. Also useful when stepping into frameworks you don't have the source code to and having at least a rough idea what is going on. It also helps you to understand JavaByteCode and .Net IL as they are both similar to assembler.

To answer the question when you have a small amount of code or a large amount of time. Most useful for use in embedded chips, where low chip complexity and poor competition in compilers targeting these chips can tip the balance in favour of humans. Also for restricted devices you are often trading off code size/memory size/performance in a way that would be hard to instruct a compiler to do. e.g. I know this user action is not called often so I will have small code size and poor performance, but this other function that look similar is used every second so I will have a larger code size and faster performance. That is the sort of trade off a skilled assembly programmer can use.

I would also like to add there is a lot of middle ground where you can code in C compile and examine the Assembly produced, then either change you C code or tweak and maintain as assembly.

My friend works on micro controllers, currently chips for controlling small electric motors. He works in a combination of low level c and Assembly. He once told me of a good day at work where he reduced the main loop from 48 instructions to 43. He is also faced with choices like the code has grown to fill the 256k chip and the business is wanting a new feature, do you

  1. Remove an existing feature
  2. Reduce the size of some or all of the existing features maybe at the cost of performance.
  3. Advocate moving to a larger chip with a higher cost, higher power consumption and larger form factor.

I would like to add as a commercial developer with quite a portfolio or languages, platforms, types of applications I have never once felt the need to dive into writing assembly. I have how ever always appreciated the knowledge I gained about it. And sometimes debugged into it.

I know I have far more answered the question "why should I learn assembler" but I feel it is a more important question then when is it faster.

so lets try once more You should be thinking about assembly

  • working on low level operating system function
  • Working on a compiler.
  • Working on an extremely limited chip, embedded system etc

Remember to compare your assembly to compiler generated to see which is faster/smaller/better.

David.

csdnceshi54
hurriedly% Time embedded applications are a great example! There are often weird instructions (even really simple ones like avr's sbi and cbi) that compilers used to (and sometimes still do) not take full advantage of, due to their limited knowledge of the hardware.
2 年多之前 回复
csdnceshi52
妄徒之命 +1 for considering embedded applications on tiny chips. Too many software engineers here either don't consider embedded or think that means a smart phone (32 bit, MB RAM, MB flash).
10 年多之前 回复

One of the posibilities to the CP/M-86 version of PolyPascal (sibling to Turbo Pascal) was to replace the "use-bios-to-output-characters-to-the-screen" facility with a machine language routine which in essense was given the x, and y, and the string to put there.

This allowed to update the screen much, much faster than before!

There was room in the binary to embed machine code (a few hundred bytes) and there was other stuff there too, so it was essential to squeeze as much as possible.

It turnes out that since the screen was 80x25 both coordinates could fit in a byte each, so both could fit in a two-byte word. This allowed to do the calculations needed in fewer bytes since a single add could manipulate both values simultaneously.

To my knowledge there is no C compilers which can merge multiple values in a register, do SIMD instructions on them and split them out again later (and I don't think the machine instructions will be shorter anyway).

Without giving any specific example or profiler evidence, you can write better assembler than the compiler when you know more than the compiler.

In the general case, a modern C compiler knows much more about how to optimize the code in question: it knows how the processor pipeline works, it can try to reorder instructions quicker than a human can, and so on - it's basically the same as a computer being as good as or better than the best human player for boardgames, etc. simply because it can make searches within the problem space faster than most humans. Although you theoretically can perform as well as the computer in a specific case, you certainly can't do it at the same speed, making it infeasible for more than a few cases (i.e. the compiler will most certainly outperform you if you try to write more than a few routines in assembler).

On the other hand, there are cases where the compiler does not have as much information - I'd say primarily when working with different forms of external hardware, of which the compiler has no knowledge. The primary example probably being device drivers, where assembler combined with a human's intimate knowledge of the hardware in question can yield better results than a C compiler could do.

Others have mentioned special purpose instructions, which is what I'm talking in the paragraph above - instructions of which the compiler might have limited or no knowledge at all, making it possible for a human to write faster code.

weixin_41568183
零零乙 Modern compilers do a lot, and it would take way too long to do by hand, but they're nowhere near perfect. Search gcc or llvm's bug trackers for "missed-optimization" bugs. There are many. Also, when writing in asm, you can more easily take advantage of preconditions like "this input can't be negative" that would be hard for a compiler to prove.
4 年多之前 回复
csdnceshi52
妄徒之命 "it can try to reorder instructions quicker than a human can". OCaml is known for being fast and, surprisingly, its native-code compiler ocamlopt skips instruction scheduling on x86 and, instead, leaves it up to the CPU because it can reorder more effectively at run-time.
8 年多之前 回复
weixin_41568131
10.24 Generally, this statement is true. The compiler does it's best to DWIW, but in some edge cases hand coding assembler gets the job done when realtime performance is a must.
11 年多之前 回复

I have an operation of transposition of bits that needs to be done, on 192 or 256 bits every interrupt, that happens every 50 microseconds.

It happens by a fixed map(hardware constraints). Using C, it took around 10 microseconds to make. When I translated this to Assembler, taking into account the specific features of this map, specific register caching, and using bit oriented operations; it took less than 3.5 microsecond to perform.

共30条数据 1 3 尾页
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
立即提问