为什么将0.1 f 改为0会降低10倍的性能？

Why does this bit of code,

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0.1f; // <--
        y[i] = y[i] - 0.1f; // <--
    }
}

run more than 10 times faster than the following bit (identical except where noted)?

const float x[16] = {  1.1,   1.2,   1.3,     1.4,   1.5,   1.6,   1.7,   1.8,
                       1.9,   2.0,   2.1,     2.2,   2.3,   2.4,   2.5,   2.6};
const float z[16] = {1.123, 1.234, 1.345, 156.467, 1.578, 1.689, 1.790, 1.812,
                     1.923, 2.034, 2.145,   2.256, 2.367, 2.478, 2.589, 2.690};
float y[16];
for (int i = 0; i < 16; i++)
{
    y[i] = x[i];
}

for (int j = 0; j < 9000000; j++)
{
    for (int i = 0; i < 16; i++)
    {
        y[i] *= x[i];
        y[i] /= z[i];
        y[i] = y[i] + 0; // <--
        y[i] = y[i] - 0; // <--
    }
}

when compiling with Visual Studio 2010 SP1. (I haven't tested with other compilers.)

转载于:https://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

5条回答默认最新

狐狸.fox 2012-02-16 16:20

关注

Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!!

Denormal (or subnormal) numbers are kind of a hack to get some extra values very close to zero out of the floating point representation. Operations on denormalized floating-point can be tens to hundreds of times slower than on normalized floating-point. This is because many processors can't handle them directly and must trap and resolve them using microcode.

If you print out the numbers after 10,000 iterations, you will see that they have converged to different values depending on whether 0 or 0.1 is used.

Here's the test code compiled on x64:

int main() {

    double start = omp_get_wtime();

    const float x[16]={1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8,1.9,2.0,2.1,2.2,2.3,2.4,2.5,2.6};
    const float z[16]={1.123,1.234,1.345,156.467,1.578,1.689,1.790,1.812,1.923,2.034,2.145,2.256,2.367,2.478,2.589,2.690};
    float y[16];
    for(int i=0;i<16;i++)
    {
        y[i]=x[i];
    }
    for(int j=0;j<9000000;j++)
    {
        for(int i=0;i<16;i++)
        {
            y[i]*=x[i];
            y[i]/=z[i];
#ifdef FLOATING
            y[i]=y[i]+0.1f;
            y[i]=y[i]-0.1f;
#else
            y[i]=y[i]+0;
            y[i]=y[i]-0;
#endif

            if (j > 10000)
                cout << y[i] << "  ";
        }
        if (j > 10000)
            cout << endl;
    }

    double end = omp_get_wtime();
    cout << end - start << endl;

    system("pause");
    return 0;
}

Output:

#define FLOATING
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007
1.78814e-007  1.3411e-007  1.04308e-007  0  7.45058e-008  6.70552e-008  6.70552e-008  5.58794e-007  3.05474e-007  2.16067e-007  1.71363e-007  1.49012e-007  1.2666e-007  1.11759e-007  1.04308e-007  1.04308e-007

//#define FLOATING
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.46842e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044
6.30584e-044  3.92364e-044  3.08286e-044  0  1.82169e-044  1.54143e-044  2.10195e-044  2.45208e-029  7.56701e-044  4.06377e-044  3.92364e-044  3.22299e-044  3.08286e-044  2.66247e-044  2.66247e-044  2.24208e-044

Note how in the second run the numbers are very close to zero.

Denormalized numbers are generally rare and thus most processors don't try to handle them efficiently.

To demonstrate that this has everything to do with denormalized numbers, if we flush denormals to zero by adding this to the start of the code:

_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

Then the version with 0 is no longer 10x slower and actually becomes faster. (This requires that the code be compiled with SSE enabled.)

This means that rather than using these weird lower precision almost-zero values, we just round to zero instead.

Timings: Core i7 920 @ 3.5 GHz:

//  Don't flush denormals to zero.
0.1f: 0.564067
0   : 26.7669

//  Flush denormals to zero.
0.1f: 0.587117
0   : 0.341406

In the end, this really has nothing to do with whether it's an integer or floating-point. The 0 or 0.1f is converted/stored into a register outside of both loops. So that has no effect on performance.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(4条)

报告相同问题？

关注问题

为什么将0.1 f 改为0会降低10倍的性能？ c++ visual studio
2012-02-16 15:58

回答 5 已采纳 Welcome to the world of denormalized floating-point! They can wreak havoc on performance!!! Denor
python中1整除0.1为什么是9.0？ python 有问必答
2022-03-03 10:53

回答 3 已采纳 0.1这个数本质上是取的一个近似数，所以你就可以明白了1//0.1=9.0。因为//是地板除，就是商取整的意思，而1//-0.1=-10是因为负数的话总是会四舍五入向负无穷大处指引。
matlab中修改向量长度以后，绘图为什么会报错？ matlab
2022-04-12 21:54

回答 1 已采纳程序没问题，但是你没有清除上一次的变量，所以当第一次T被赋值20000，fz的长度也为20000，当你第二次运行时，T改为200，for循环里只是更新前200个数，fz依旧是20000，因此造成fz与
Flink中的元编程与元学习
2023-07-25 00:31

禅与计算机程序设计艺术的博客 Flink 是 Apache 基金会开源的一款基于 Java 的分布式计算框架，它最初由 IBM 开发并于 2014 年宣布开源，目前已经成为 Apache Top-Level 项目，具有高吞吐量、低延迟等优点，被多家公司采用。在实际应用中，许多...
为什么数值输出只相差0.0000001,输入就差了0.1 c语言
2022-02-07 14:07

回答 1 已采纳因为输出只留一位小数，要取近似数。printf好像是四舍六入五成双，按理说第二个应该输出4.6，但是，浮点数是有精度丢失的，4.55在内存中存储会略少于4.55，所以输出是4.5
iterator反向遍历为什么会出错？ c++ 开发语言有问必答蓝桥杯
2022-03-15 20:48

回答 2 已采纳 for (it = a.end(); it!=a.begin();) *it永远取不到‘0’，直接break了 #include <iostream> #include <stri
为什么同样是double类型，输出的小数位数会不同？ java
2022-03-23 10:08

回答 4 已采纳 java的double和float类型在操作中会丧失精度，和预期结果产生偏差
神经网络进化与混合编程——进化计算与模糊适应
2023-08-08 01:02

禅与计算机程序设计艺术的博客近年来，人工智能研究者们逐渐发现，将神经网络的知识迁移到传统优化问题上，可以提升机器学习的性能。在这个过程中，出现了一项新理论——进化计算，试图利用基因组信息、遗传密码、突变数据等多种方式，对神经网络...
通过稀疏性和选择性的推理来改善神经语言模型
2023-08-07 00:59

禅与计算机程序设计艺术的博客在NLP领域，有很多工作都离不开深度学习技术。最经典、成功应用的莫过于深度学习...在深度学习语言模型学习过程中，为了提升模型的性能，一些研究者借鉴无监督学习、半监督学习和强化学习的方法来做到更好的表示学习。
【技术应用】模型微调：如何利用深度学习框架进行模型微调？
2023-07-14 02:28

禅与计算机程序设计艺术的博客模型微调（fine-tuning）是一种迁移学习方法，在不修改网络结构、直接对其最后几层的参数进行微调的同时，保留原网络前面的层参数不变，达到提升模型性能的目的。因此，模型微调非常适用于现有任务的相关领域、数据...
CPython解释器性能分析与优化
2023-02-25 16:46

仓颉编程语言的博客 CPython 是由 C 语言编写的 Python 纯解释器，采样分析（sampling profiling）可以更为精确地对其性能进行研究。本报告从不同视角探讨其中的开销构成，并讨论可行的优化方案。
深度学习中的编程语言Tensorflow
2020-05-04 13:53

人邮异步社区的博客本章讲述的主要内容包括：预备知识；...Tensorflow是谷歌开发的一种开源编程语言，旨在让深度学习程序编程变得更简单。我们首先从一个程序开始。 import tensorflow as tf x = tf.constant("Hello Wo...
原来ReLU这么好用？一文带你深度了解ReLU激活函数
2022-01-24 20:11

Java技能树的博客在神经网络中，激活函数负责将来自节点的加权输入转换为该输入的节点或输出的激活。ReLU 是一个分段线性函数，如果输入为正，它将直接输出，否则，它将输出为零。它已经成为许多类型神经网络的默认激活函数，因为...
Linux内核性能剖析的方法学和主要工具
2022-07-01 17:00

内核工匠的博客这些都致力于在降低功耗的情况下，总体不降低性能。除这些以外，我们也应该认识到，降低内核本身的CPU利用率，比如内存compaction、内存swap/reclaim、锁自旋等的开销，也能进一步降低功耗。在一个内存受限的系统中...
【跟小嘉学 Rust 编程】二十三、Cargo 使用指南
2023-09-01 02:42

小嘉丶学长的博客主要教材参考《The Rust Programming Language》主要教材参考《Rust For Rustaceans》主要教材参考《The Rustonomicon》主要教材参考《Rust 高级编程》主要教材参考《Cargo 指南》Cargo 是 Rust 的包管理工具，...
Python 并发编程
2022-01-05 00:27

Adenialzz的博客 Python 并发编程本文为 https://www.bilibili.com/video/BV1bK411A7tV?p=1 课程笔记。概览并发与并行并发(concurrency)：指在同一时刻只能有一条指令执行，但多个进程指令被快速的轮换执行，使得在宏观上具有多...
浏览器性能优化实战
2021-05-18 00:09

腾讯技术工程的博客作者：rosefang，腾讯 PCG 前端开发工程师当我们在做性能优化的时候，我们究竟在优化什么？浏览器底层是一个什么架构？浏览器渲染的本质究竟是什么？哪些方面对用户的体验影响才是最大的？...
没有解决我的问题, 去提问

悬赏问题

¥50 求解vmware的网络模式问题别拿AI回答
¥24 EFS加密后，在同一台电脑解密出错，证书界面找不到对应指纹的证书，未备份证书，求在原电脑解密的方法，可行即采纳
¥15 springboot 3.0 实现Security 6.x版本集成
¥15 PHP-8.1 镜像无法用dockerfile里的CMD命令启动只能进入容器启动，如何解决？(操作系统-ubuntu)
¥30 请帮我解决一下下面六个代码
¥15 关于资源监视工具的e-care有知道的嘛
¥35 MIMO天线稀疏阵列排布问题
¥60 用visual studio编写程序，利用间接平差求解水准网
¥15 Llama如何调用shell或者Python
¥20 谁能帮我挨个解读这个php语言编的代码什么意思？

码龄粉丝数原力等级 --

为什么将0.1 f 改为0会降低10倍的性能？

5条回答默认最新

码龄粉丝数原力等级 --

悬赏问题

为什么将0.1 f 改为0会降低10倍的性能？

5条回答 默认 最新

悬赏问题

5条回答默认最新