什么时候组装速度比 c 快？

One of the stated reasons for knowing assembler is that, on occasion, it can be employed to write code that will be more performant than writing that code in a higher-level language, C in particular. However, I've also heard it stated many times that although that's not entirely false, the cases where assembler can actually be used to generate more performant code are both extremely rare and require expert knowledge of and experience with assembly.

This question doesn't even get into the fact that assembler instructions will be machine-specific and non-portable, or any of the other aspects of assembler. There are plenty of good reasons for knowing assembly besides this one, of course, but this is meant to be a specific question soliciting examples and data, not an extended discourse on assembler versus higher-level languages.

Can anyone provide some specific examples of cases where assembly will be faster than well-written C code using a modern compiler, and can you support that claim with profiling evidence? I am pretty confident these cases exist, but I really want to know exactly how esoteric these cases are, since it seems to be a point of some contention.

转载于:https://stackoverflow.com/questions/577554/when-is-assembly-faster-than-c

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

28条回答

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
必承其重 | 欲带皇冠 2009-07-03 17:06
关注
Here is a real world example: Fixed point multiplies on old compilers.

These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).

Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see

Getting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems.

_umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.

C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:

// on a 32-bit machine, int can hold 32-bit fixed-point integers. int inline FixedPointMul (int a, int b) { long long a_long = a; // cast to 64 bit. long long product = a_long * b; // perform multiplication return (int) (product >> 16); // shift by the fixed point bias }

The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.

x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).

So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.

If you rewrite the same code in (inline) assembler you can gain a significant speed boost.

In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.

Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.

For reference: The end-result for the fixed-point mul for the VS.NET compiler is:

int inline FixedPointMul (int a, int b) { return (int) __ll_rshift(__emul(a,b),16); }

The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.

Using Visual C++ 2013 gives the same assembly code for both ways.

gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)

See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)

Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).

Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.

Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.

Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.

Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(27条)

报告相同问题？

关注问题

什么时候组装速度比 c 快？
2009-02-23 13:03

回答 28 已采纳 Here is a real world example: Fixed point multiplies on old compilers. These don't only come hand
组装电脑组件的配置和兼容性
2016-11-26 12:34

回答 1 已采纳主板：映泰 Hi-Fi B85S3+， Hi-Fi音效，￥399 CPU：新款i5-4590 散片价￥1100 散热器：超频三红海mini 静音版，￥39 显卡：影驰 GTX 750
sql组装成json字符串将+号之间识别为了字符串 sql
2022-04-08 23:57

回答 1 已采纳 S.ENUM_CUTLEVEL 这个字段cast一下,转成varchar,估计你这个字段是个int类型,它就把这个加号当成了真正的加法,然后又发现前面这个参数无法转换成int,就报错了。建议把每个字段
c比汇编语言慢多少,什么时候汇编比C更快？
2021-05-23 05:24

感受我的style的博客已知的了解汇编器的原因之一是，有时可以用它来编写比用高级语言(尤其是C)编写更高性能的代码。但是，我也听到过很多次声明，尽管这并非完全错误，但实际上可将汇编程序用于生成更多性能代码的情况极为罕见，并且...
计算机组装与维护磁盘框线微信微信公众平台
2023-03-03 09:32

回答 1 已采纳基于Monster 组和GPT的调写：在Windows操作系统中，磁盘的信息可以通过“磁盘管理”工具查看。该工具将磁盘分区的信息以不同的颜色和框线进行表示，具体含义如下：黑色框线：表示主分区，可以被
PHP数组组装（相同ID） php
2023-04-06 16:16

回答 2 已采纳 https://cloud.tencent.com/developer/ask/sof/304334 已解决，参照这个链接最后的内容
多表查询结果组装树结构 java
2022-04-21 22:42

回答 2 已采纳直接嵌套查询，mybatis <resultMap> <collection></collection> </resultMap>
C语言与C 的差异是什么？
2021-07-31 12:38

思绪随想的博客 C 的最大优点是编译后代码运行效率接近汇编程序，速度快，资源占用少。因此早期很多重要应用软件、支撑软件甚至系统软件是用 C 编写。虽然比汇编程序稍差，但开发和维护成本要低得多。C++ 继承了 C 的这一优点，并...
c语言字符串类型问题 c语言
2022-12-28 12:40

回答 3 已采纳 #include <stdio.h> #include <ctype.h> // 用于使用 isdigit 函数 int conv(char arr[]) { i
SQL 怎么组装返回的数据还是sql能解决 java sql
2019-07-23 18:03

回答 1 已采纳这个属于一对多，也就是一个人对应多个时间段，是可以通过sql解决的，先对名称进行分组处理，然后再处理每个分组下的数据。
element ui组装数据 vue实现 json
2021-03-24 09:04

回答 1 已采纳就是checkbox和input结合form表单就行
React 官网为什么那么快？
2022-08-12 21:29

IT晓峰的博客浏览器请求到之后直接解析渲染出来即可，不需要再去下载和执行额外的Javasript脚本，所以速度会比客户端渲染快很多对于一些内容不经常变化的网站，我们甚至可以在服务端渲染的基础上予以改进，将每次请求服务端都...
WinForm动态生成流程图样式，用什么组件或者控件？
2018-06-08 08:31

回答 1 已采纳 FlowChart.Net https://baike.baidu.com/item/FlowChart.NET/10265501?fr=aladdin
c语言快速拼接字符串,C语言拼接字符串
2021-05-19 08:03

谢谢猫的博客字符串拼接涉及两个字符串的合并。strcat函数经常用来执行这种操作，这个函数接受两个字符串指针作为参数，然后把两者拼接起来并返回拼接结果的指针。这个函数的原型如下：此函数把第二个字符串拼接到第一个的结尾，...
组装计算机需要哪九件部件,组装各种电脑配件的规则是什么?组装电脑各配件规则介绍...
2021-07-28 22:49

张毅非的博客本文转自：http://www.dn010.com/zhuangji/635.html如果您想组装自己的电脑，那么您需要了解这些知识。首先，电脑的主要部件：处理器、显卡、主板、内存、电源、散热片、机箱、硬盘、硬盘数据线。电脑的主要性能是...
没有解决我的问题, 去提问

悬赏问题

¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
¥50 有数据，怎么建立模型求影响全要素生产率的因素
¥50 有数据，怎么用matlab求全要素生产率
¥15 TI的insta-spin例程
¥15 完成下列问题完成下列问题
¥15 C#算法问题, 不知道怎么处理这个数据的转换
¥15 YoloV5 第三方库的版本对照问题
¥15 请完成下列相关问题！
¥15 drone 推送镜像时候 purge: true 推送完毕后没有删除对应的镜像,手动拷贝到服务器执行结果正确在样才能让指令自动执行成功删除对应镜像，如何解决？
¥15 求daily translation（DT）偏差订正方法的代码