什么时候装配比 c 快？

One of the stated reasons for knowing assembler is that, on occasion, it can be employed to write code that will be more performant than writing that code in a higher-level language, C in particular. However, I've also heard it stated many times that although that's not entirely false, the cases where assembler can actually be used to generate more performant code are both extremely rare and require expert knowledge of and experience with assembly.

This question doesn't even get into the fact that assembler instructions will be machine-specific and non-portable, or any of the other aspects of assembler. There are plenty of good reasons for knowing assembly besides this one, of course, but this is meant to be a specific question soliciting examples and data, not an extended discourse on assembler versus higher-level languages.

Can anyone provide some specific examples of cases where assembly will be faster than well-written C code using a modern compiler, and can you support that claim with profiling evidence? I am pretty confident these cases exist, but I really want to know exactly how esoteric these cases are, since it seems to be a point of some contention.

转载于:https://stackoverflow.com/questions/577554/when-is-assembly-faster-than-c

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

30条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
Didn"t forge 2009-07-03 17:06
关注
Here is a real world example: Fixed point multiplies on old compilers.

These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).

Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see

Getting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems.

_umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.

C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:

// on a 32-bit machine, int can hold 32-bit fixed-point integers. int inline FixedPointMul (int a, int b) { long long a_long = a; // cast to 64 bit. long long product = a_long * b; // perform multiplication return (int) (product >> 16); // shift by the fixed point bias }

The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.

x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).

So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.

If you rewrite the same code in (inline) assembler you can gain a significant speed boost.

In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.

Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.

For reference: The end-result for the fixed-point mul for the VS.NET compiler is:

int inline FixedPointMul (int a, int b) { return (int) __ll_rshift(__emul(a,b),16); }

The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.

Using Visual C++ 2013 gives the same assembly code for both ways.

gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)

See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)

Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).

Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.

Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.

Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.

Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(29条)

报告相同问题？

关注问题

什么时候装配比 c 快？
2009-02-23 13:03

回答 30 已采纳 Here is a real world example: Fixed point multiplies on old compilers. These don't only come hand
springboot自动装配，selectImports()方法是在什么时候被调用的？ spring
2021-03-24 23:03

回答 2 已采纳 org.springframework.boot.autoconfigure.AutoConfigurationImportSelector#getAutoConfigurationEntry 调用是
SpringBoot是如何完成自动装配的？ java
2022-09-28 16:47

回答 2 已采纳太难了，求教！
zhuangpei.rar_UG_UG自动装配_visual c_zhuangpei_装配
2022-09-23 08:31

基于UG的自动装配技术，装配方法的使用代码
spring自动装配的问题 java spring
2022-08-26 15:13

回答 2 已采纳经过github上老哥的提醒，发现原来是跟BeanUtils下面这段代码有关，自动装配在判断是简单类型之后就不会装配了，比如int，date，string等等，这些类型需要手动注入，而判断就是通过ge
请问为什么通过注解装配就是找不到那个对象，xml装配的就可以运行显示？ java spring
2022-05-19 00:09

回答 2 已采纳你的上下文是xml的applicationContext，无法获取到注解的bean 要么使用@Autowired 或者@Resource 直接注入bean。 @Autowired p
这个自动装配为什么么找不到？，不是已经给了@mapper java maven spring
2022-01-17 16:45

回答 5 已采纳要先运行容器哦在测试类上，加入@RunWith(SpringRunner.class)这个注解
java什么是装配,装配java
2021-03-24 01:26

绵淼的博客使用yum安装：1:查看java版本：yum list |grep java2：安装：yum install (java名称是你从列表中找到的)3：删除：yum remove java安装时可能会出问题使用wget安装1：下载：wget 'http://wsdl8.yunpan.cn/share.php?...
springboot项目Bean自动装配失败 intellij-idea java spring
2022-10-05 20:06

回答 2 已采纳
ssm框架，自动装配失败 java
2022-12-28 09:49

回答 3 已采纳 spring文件中是否有注入service呢，注入路径是否正确，ctrl+鼠标是否能点进你的service类中
扫描了service包，还是无法装配 intellij-idea maven spring
2021-06-06 20:52

回答 5 已采纳你截图的不清楚，你可以私聊我，帮你远程看看，给个采纳就好
装配线调度算法 C语言
2012-02-06 18:15

高级算法装配线调度的C语言实现
自动装配失败，springmvc注入service失败 java spring 后端
2021-08-05 17:16

回答 3 已采纳已解决，找到了原因，原来是我配置的DispatcherServlet中的文件是springmvc-servlet.xml，找不到Service bean <servlet> &
一级圆柱齿轮减速机装配图三维 CAD图纸cad图纸毕业生设计书.zip
2022-04-28 21:08

一级圆柱齿轮减速机装配图三维 CAD图纸cad图纸毕业生设计书.zip
SolidWorks快速建模装配技巧
2022-02-01 17:39

潮灏小弟的博客运行更快一点鼠标笔式设定常见快捷键使用设计库使用快速装配技巧装配体——插入零部件——浏览——√：装配第一个零件要点左上的√号//各个基准面重合，方便以后装配。右键点击某个零件...
没有解决我的问题, 去提问

悬赏问题

¥15 运筹学排序问题中的在线排序
¥15 关于docker部署flink集成hadoop的yarn，请教个问题 flink启动yarn-session.sh连不上hadoop，这个整了好几天一直不行，求帮忙看一下怎么解决
¥30 求一段fortran代码用IVF编译运行的结果
¥15 深度学习根据CNN网络模型，搭建BP模型并训练MNIST数据集
¥15 C++ 头文件/宏冲突问题解决
¥15 用comsol模拟大气湍流通过底部加热（温度不同）的腔体
¥50 安卓adb backup备份子用户应用数据失败
¥20 有人能用聚类分析帮我分析一下文本内容嘛
¥30 python代码，帮调试，帮帮忙吧
¥15 #MATLAB仿真#车辆换道路径规划

什么时候装配比 c 快？

30条回答 默认 最新

悬赏问题

30条回答默认最新