狐狸.fox 2011-12-05 17:54 采纳率: 0%

已采纳

如何达到理论上的每周期4次浮点运算的最大值？

How can the theoretical peak performance of 4 floating point operations (double precision) per cycle be achieved on a modern x86-64 Intel CPU?

As far as I understand it take three cycles for an SSE add and five cycles for a mul to complete on most of the modern Intel CPUs (see for example Agner Fog's 'Instruction Tables' ). Due to pipelining one can get a throughput of one add per cycle if the algorithm has at least three independent summations. Since that is true for packed addpd as well as the scalar addsd versions and SSE registers can contain two double's the throughput can be as much as two flops per cycle.

Furthermore, it seems (although I've not seen any proper documentation on this) add's and mul's can be executed in parallel giving a theoretical max throughput of four flops per cycle.

However, I've not been able to replicate that performance with a simple C/C++ programme. My best attempt resulted in about 2.7 flops/cycle. If anyone can contribute a simple C/C++ or assembler programme which demonstrates peak performance that'd be greatly appreciated.

My attempt:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <sys/time.h>

double stoptime(void) {
   struct timeval t;
   gettimeofday(&t,NULL);
   return (double) t.tv_sec + t.tv_usec/1000000.0;
}

double addmul(double add, double mul, int ops){
   // Need to initialise differently otherwise compiler might optimise away
   double sum1=0.1, sum2=-0.1, sum3=0.2, sum4=-0.2, sum5=0.0;
   double mul1=1.0, mul2= 1.1, mul3=1.2, mul4= 1.3, mul5=1.4;
   int loops=ops/10;          // We have 10 floating point operations inside the loop
   double expected = 5.0*add*loops + (sum1+sum2+sum3+sum4+sum5)
               + pow(mul,loops)*(mul1+mul2+mul3+mul4+mul5);

   for (int i=0; i<loops; i++) {
      mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul;
      sum1+=add; sum2+=add; sum3+=add; sum4+=add; sum5+=add;
   }
   return  sum1+sum2+sum3+sum4+sum5+mul1+mul2+mul3+mul4+mul5 - expected;
}

int main(int argc, char** argv) {
   if (argc != 2) {
      printf("usage: %s <num>\n", argv[0]);
      printf("number of operations: <num> millions\n");
      exit(EXIT_FAILURE);
   }
   int n = atoi(argv[1]) * 1000000;
   if (n<=0)
       n=1000;

   double x = M_PI;
   double y = 1.0 + 1e-8;
   double t = stoptime();
   x = addmul(x, y, n);
   t = stoptime() - t;
   printf("addmul:\t %.3f s, %.3f Gflops, res=%f\n", t, (double)n/t/1e9, x);
   return EXIT_SUCCESS;
}

Compiled with

g++ -O2 -march=native addmul.cpp ; ./a.out 1000

produces the following output on an Intel Core i5-750, 2.66 GHz.

addmul:  0.270 s, 3.707 Gflops, res=1.326463

That is, just about 1.4 flops per cycle. Looking at the assembler code with g++ -S -O2 -march=native -masm=intel addmul.cpp the main loop seems kind of optimal to me:

.L4:
inc    eax
mulsd    xmm8, xmm3
mulsd    xmm7, xmm3
mulsd    xmm6, xmm3
mulsd    xmm5, xmm3
mulsd    xmm1, xmm3
addsd    xmm13, xmm2
addsd    xmm12, xmm2
addsd    xmm11, xmm2
addsd    xmm10, xmm2
addsd    xmm9, xmm2
cmp    eax, ebx
jne    .L4

Changing the scalar versions with packed versions (addpd and mulpd) would double the flop count without changing the execution time and so I'd get just short of 2.8 flops per cycle. Is there a simple example which achieves four flops per cycle?

Nice little programme by Mysticial; here are my results (run just for a few seconds though):

gcc -O2 -march=nocona: 5.6 Gflops out of 10.66 Gflops (2.1 flops/cycle)
cl /O2, openmp removed: 10.1 Gflops out of 10.66 Gflops (3.8 flops/cycle)

It all seems a bit complex, but my conclusions so far:

gcc -O2 changes the order of independent floating point operations with the aim of alternating addpd and mulpd's if possible. Same applies to gcc-4.6.2 -O2 -march=core2.
gcc -O2 -march=nocona seems to keep the order of floating point operations as defined in the C++ source.
cl /O2, the 64-bit compiler from the SDK for Windows 7 does loop-unrolling automatically and seems to try and arrange operations so that groups of three addpd's alternate with three mulpd's (well, at least on my system and for my simple programme).
My Core i5 750 (Nahelem architecture) doesn't like alternating add's and mul's and seems unable to run both operations in parallel. However, if grouped in 3's it suddenly works like magic.
Other architectures (possibly Sandy Bridge and others) appear to be able to execute add/mul in parallel without problems if they alternate in the assembly code.
Although difficult to admit, but on my system cl /O2 does a much better job at low-level optimising operations for my system and achieves close to peak performance for the little C++ example above. I measured between 1.85-2.01 flops/cycle (have used clock() in Windows which is not that precise. I guess, need to use a better timer - thanks Mackie Messer).
The best I managed with gcc was to manually loop unroll and arrange additions and multiplications in groups of three. With g++ -O2 -march=nocona addmul_unroll.cpp I get at best 0.207s, 4.825 Gflops which corresponds to 1.8 flops/cycle which I'm quite happy with now.

In the C++ code I've replaced the for loop with

   for (int i=0; i<loops/3; i++) {
       mul1*=mul; mul2*=mul; mul3*=mul;
       sum1+=add; sum2+=add; sum3+=add;
       mul4*=mul; mul5*=mul; mul1*=mul;
       sum4+=add; sum5+=add; sum1+=add;

       mul2*=mul; mul3*=mul; mul4*=mul;
       sum2+=add; sum3+=add; sum4+=add;
       mul5*=mul; mul1*=mul; mul2*=mul;
       sum5+=add; sum1+=add; sum2+=add;

       mul3*=mul; mul4*=mul; mul5*=mul;
       sum3+=add; sum4+=add; sum5+=add;
   }

And the assembly now looks like

.L4:
mulsd    xmm8, xmm3
mulsd    xmm7, xmm3
mulsd    xmm6, xmm3
addsd    xmm13, xmm2
addsd    xmm12, xmm2
addsd    xmm11, xmm2
mulsd    xmm5, xmm3
mulsd    xmm1, xmm3
mulsd    xmm8, xmm3
addsd    xmm10, xmm2
addsd    xmm9, xmm2
addsd    xmm13, xmm2
...

转载于:https://stackoverflow.com/questions/8389648/how-do-i-achieve-the-theoretical-maximum-of-4-flops-per-cycle

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

4条回答默认最新

叼花硬汉 2011-12-05 20:43

关注

I've done this exact task before. But it was mainly to measure power consumption and CPU temperatures. The following code (which is fairly long) achieves close to optimal on my Core i7 2600K.

The key thing to note here is the massive amount of manual loop-unrolling as well as interleaving of multiplies and adds...

The full project can be found on my GitHub: https://github.com/Mysticial/Flops

Warning:

If you decide to compile and run this, pay attention to your CPU temperatures!!!
Make sure you don't overheat it. And make sure CPU-throttling doesn't affect your results!

Furthermore, I take no responsibility for whatever damage that may result from running this code.

Notes:

This code is optimized for x64. x86 doesn't have enough registers for this to compile well.
This code has been tested to work well on Visual Studio 2010/2012 and GCC 4.6.
ICC 11 (Intel Compiler 11) surprisingly has trouble compiling it well.
These are for pre-FMA processors. In order to achieve peak FLOPS on Intel Haswell and AMD Bulldozer processors (and later), FMA (Fused Multiply Add) instructions will be needed. These are beyond the scope of this benchmark.

#include <emmintrin.h>
#include <omp.h>
#include <iostream>
using namespace std;

typedef unsigned long long uint64;

double test_dp_mac_SSE(double x,double y,uint64 iterations){
    register __m128d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF;

    //  Generate starting data.
    r0 = _mm_set1_pd(x);
    r1 = _mm_set1_pd(y);

    r8 = _mm_set1_pd(-0.0);

    r2 = _mm_xor_pd(r0,r8);
    r3 = _mm_or_pd(r0,r8);
    r4 = _mm_andnot_pd(r8,r0);
    r5 = _mm_mul_pd(r1,_mm_set1_pd(0.37796447300922722721));
    r6 = _mm_mul_pd(r1,_mm_set1_pd(0.24253562503633297352));
    r7 = _mm_mul_pd(r1,_mm_set1_pd(4.1231056256176605498));
    r8 = _mm_add_pd(r0,_mm_set1_pd(0.37796447300922722721));
    r9 = _mm_add_pd(r1,_mm_set1_pd(0.24253562503633297352));
    rA = _mm_sub_pd(r0,_mm_set1_pd(4.1231056256176605498));
    rB = _mm_sub_pd(r1,_mm_set1_pd(4.1231056256176605498));

    rC = _mm_set1_pd(1.4142135623730950488);
    rD = _mm_set1_pd(1.7320508075688772935);
    rE = _mm_set1_pd(0.57735026918962576451);
    rF = _mm_set1_pd(0.70710678118654752440);

    uint64 iMASK = 0x800fffffffffffffull;
    __m128d MASK = _mm_set1_pd(*(double*)&iMASK);
    __m128d vONE = _mm_set1_pd(1.0);

    uint64 c = 0;
    while (c < iterations){
        size_t i = 0;
        while (i < 1000){
            //  Here's the meat - the part that really matters.

            r0 = _mm_mul_pd(r0,rC);
            r1 = _mm_add_pd(r1,rD);
            r2 = _mm_mul_pd(r2,rE);
            r3 = _mm_sub_pd(r3,rF);
            r4 = _mm_mul_pd(r4,rC);
            r5 = _mm_add_pd(r5,rD);
            r6 = _mm_mul_pd(r6,rE);
            r7 = _mm_sub_pd(r7,rF);
            r8 = _mm_mul_pd(r8,rC);
            r9 = _mm_add_pd(r9,rD);
            rA = _mm_mul_pd(rA,rE);
            rB = _mm_sub_pd(rB,rF);

            r0 = _mm_add_pd(r0,rF);
            r1 = _mm_mul_pd(r1,rE);
            r2 = _mm_sub_pd(r2,rD);
            r3 = _mm_mul_pd(r3,rC);
            r4 = _mm_add_pd(r4,rF);
            r5 = _mm_mul_pd(r5,rE);
            r6 = _mm_sub_pd(r6,rD);
            r7 = _mm_mul_pd(r7,rC);
            r8 = _mm_add_pd(r8,rF);
            r9 = _mm_mul_pd(r9,rE);
            rA = _mm_sub_pd(rA,rD);
            rB = _mm_mul_pd(rB,rC);

            r0 = _mm_mul_pd(r0,rC);
            r1 = _mm_add_pd(r1,rD);
            r2 = _mm_mul_pd(r2,rE);
            r3 = _mm_sub_pd(r3,rF);
            r4 = _mm_mul_pd(r4,rC);
            r5 = _mm_add_pd(r5,rD);
            r6 = _mm_mul_pd(r6,rE);
            r7 = _mm_sub_pd(r7,rF);
            r8 = _mm_mul_pd(r8,rC);
            r9 = _mm_add_pd(r9,rD);
            rA = _mm_mul_pd(rA,rE);
            rB = _mm_sub_pd(rB,rF);

            r0 = _mm_add_pd(r0,rF);
            r1 = _mm_mul_pd(r1,rE);
            r2 = _mm_sub_pd(r2,rD);
            r3 = _mm_mul_pd(r3,rC);
            r4 = _mm_add_pd(r4,rF);
            r5 = _mm_mul_pd(r5,rE);
            r6 = _mm_sub_pd(r6,rD);
            r7 = _mm_mul_pd(r7,rC);
            r8 = _mm_add_pd(r8,rF);
            r9 = _mm_mul_pd(r9,rE);
            rA = _mm_sub_pd(rA,rD);
            rB = _mm_mul_pd(rB,rC);

            i++;
        }

        //  Need to renormalize to prevent denormal/overflow.
        r0 = _mm_and_pd(r0,MASK);
        r1 = _mm_and_pd(r1,MASK);
        r2 = _mm_and_pd(r2,MASK);
        r3 = _mm_and_pd(r3,MASK);
        r4 = _mm_and_pd(r4,MASK);
        r5 = _mm_and_pd(r5,MASK);
        r6 = _mm_and_pd(r6,MASK);
        r7 = _mm_and_pd(r7,MASK);
        r8 = _mm_and_pd(r8,MASK);
        r9 = _mm_and_pd(r9,MASK);
        rA = _mm_and_pd(rA,MASK);
        rB = _mm_and_pd(rB,MASK);
        r0 = _mm_or_pd(r0,vONE);
        r1 = _mm_or_pd(r1,vONE);
        r2 = _mm_or_pd(r2,vONE);
        r3 = _mm_or_pd(r3,vONE);
        r4 = _mm_or_pd(r4,vONE);
        r5 = _mm_or_pd(r5,vONE);
        r6 = _mm_or_pd(r6,vONE);
        r7 = _mm_or_pd(r7,vONE);
        r8 = _mm_or_pd(r8,vONE);
        r9 = _mm_or_pd(r9,vONE);
        rA = _mm_or_pd(rA,vONE);
        rB = _mm_or_pd(rB,vONE);

        c++;
    }

    r0 = _mm_add_pd(r0,r1);
    r2 = _mm_add_pd(r2,r3);
    r4 = _mm_add_pd(r4,r5);
    r6 = _mm_add_pd(r6,r7);
    r8 = _mm_add_pd(r8,r9);
    rA = _mm_add_pd(rA,rB);

    r0 = _mm_add_pd(r0,r2);
    r4 = _mm_add_pd(r4,r6);
    r8 = _mm_add_pd(r8,rA);

    r0 = _mm_add_pd(r0,r4);
    r0 = _mm_add_pd(r0,r8);


    //  Prevent Dead Code Elimination
    double out = 0;
    __m128d temp = r0;
    out += ((double*)&temp)[0];
    out += ((double*)&temp)[1];

    return out;
}

void test_dp_mac_SSE(int tds,uint64 iterations){

    double *sum = (double*)malloc(tds * sizeof(double));
    double start = omp_get_wtime();

#pragma omp parallel num_threads(tds)
    {
        double ret = test_dp_mac_SSE(1.1,2.1,iterations);
        sum[omp_get_thread_num()] = ret;
    }

    double secs = omp_get_wtime() - start;
    uint64 ops = 48 * 1000 * iterations * tds * 2;
    cout << "Seconds = " << secs << endl;
    cout << "FP Ops  = " << ops << endl;
    cout << "FLOPs   = " << ops / secs << endl;

    double out = 0;
    int c = 0;
    while (c < tds){
        out += sum[c++];
    }

    cout << "sum = " << out << endl;
    cout << endl;

    free(sum);
}

int main(){
    //  (threads, iterations)
    test_dp_mac_SSE(8,10000000);

    system("pause");
}

Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 55.5104
FP Ops  = 960000000000
FLOPs   = 1.7294e+010
sum = 2.22652

The machine is a Core i7 2600K @ 4.4 GHz. Theoretical SSE peak is 4 flops * 4.4 GHz = 17.6 GFlops. This code achieves 17.3 GFlops - not bad.

Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 117.202
FP Ops  = 7680000000000
FLOPs   = 6.55279e+010
sum = 17.8122

Theoretical SSE peak is 4 flops * 4 cores * 4.4 GHz = 70.4 GFlops. Actual is 65.5 GFlops.

Let's take this one step further. AVX...

#include <immintrin.h>
#include <omp.h>
#include <iostream>
using namespace std;

typedef unsigned long long uint64;

double test_dp_mac_AVX(double x,double y,uint64 iterations){
    register __m256d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF;

    //  Generate starting data.
    r0 = _mm256_set1_pd(x);
    r1 = _mm256_set1_pd(y);

    r8 = _mm256_set1_pd(-0.0);

    r2 = _mm256_xor_pd(r0,r8);
    r3 = _mm256_or_pd(r0,r8);
    r4 = _mm256_andnot_pd(r8,r0);
    r5 = _mm256_mul_pd(r1,_mm256_set1_pd(0.37796447300922722721));
    r6 = _mm256_mul_pd(r1,_mm256_set1_pd(0.24253562503633297352));
    r7 = _mm256_mul_pd(r1,_mm256_set1_pd(4.1231056256176605498));
    r8 = _mm256_add_pd(r0,_mm256_set1_pd(0.37796447300922722721));
    r9 = _mm256_add_pd(r1,_mm256_set1_pd(0.24253562503633297352));
    rA = _mm256_sub_pd(r0,_mm256_set1_pd(4.1231056256176605498));
    rB = _mm256_sub_pd(r1,_mm256_set1_pd(4.1231056256176605498));

    rC = _mm256_set1_pd(1.4142135623730950488);
    rD = _mm256_set1_pd(1.7320508075688772935);
    rE = _mm256_set1_pd(0.57735026918962576451);
    rF = _mm256_set1_pd(0.70710678118654752440);

    uint64 iMASK = 0x800fffffffffffffull;
    __m256d MASK = _mm256_set1_pd(*(double*)&iMASK);
    __m256d vONE = _mm256_set1_pd(1.0);

    uint64 c = 0;
    while (c < iterations){
        size_t i = 0;
        while (i < 1000){
            //  Here's the meat - the part that really matters.

            r0 = _mm256_mul_pd(r0,rC);
            r1 = _mm256_add_pd(r1,rD);
            r2 = _mm256_mul_pd(r2,rE);
            r3 = _mm256_sub_pd(r3,rF);
            r4 = _mm256_mul_pd(r4,rC);
            r5 = _mm256_add_pd(r5,rD);
            r6 = _mm256_mul_pd(r6,rE);
            r7 = _mm256_sub_pd(r7,rF);
            r8 = _mm256_mul_pd(r8,rC);
            r9 = _mm256_add_pd(r9,rD);
            rA = _mm256_mul_pd(rA,rE);
            rB = _mm256_sub_pd(rB,rF);

            r0 = _mm256_add_pd(r0,rF);
            r1 = _mm256_mul_pd(r1,rE);
            r2 = _mm256_sub_pd(r2,rD);
            r3 = _mm256_mul_pd(r3,rC);
            r4 = _mm256_add_pd(r4,rF);
            r5 = _mm256_mul_pd(r5,rE);
            r6 = _mm256_sub_pd(r6,rD);
            r7 = _mm256_mul_pd(r7,rC);
            r8 = _mm256_add_pd(r8,rF);
            r9 = _mm256_mul_pd(r9,rE);
            rA = _mm256_sub_pd(rA,rD);
            rB = _mm256_mul_pd(rB,rC);

            r0 = _mm256_mul_pd(r0,rC);
            r1 = _mm256_add_pd(r1,rD);
            r2 = _mm256_mul_pd(r2,rE);
            r3 = _mm256_sub_pd(r3,rF);
            r4 = _mm256_mul_pd(r4,rC);
            r5 = _mm256_add_pd(r5,rD);
            r6 = _mm256_mul_pd(r6,rE);
            r7 = _mm256_sub_pd(r7,rF);
            r8 = _mm256_mul_pd(r8,rC);
            r9 = _mm256_add_pd(r9,rD);
            rA = _mm256_mul_pd(rA,rE);
            rB = _mm256_sub_pd(rB,rF);

            r0 = _mm256_add_pd(r0,rF);
            r1 = _mm256_mul_pd(r1,rE);
            r2 = _mm256_sub_pd(r2,rD);
            r3 = _mm256_mul_pd(r3,rC);
            r4 = _mm256_add_pd(r4,rF);
            r5 = _mm256_mul_pd(r5,rE);
            r6 = _mm256_sub_pd(r6,rD);
            r7 = _mm256_mul_pd(r7,rC);
            r8 = _mm256_add_pd(r8,rF);
            r9 = _mm256_mul_pd(r9,rE);
            rA = _mm256_sub_pd(rA,rD);
            rB = _mm256_mul_pd(rB,rC);

            i++;
        }

        //  Need to renormalize to prevent denormal/overflow.
        r0 = _mm256_and_pd(r0,MASK);
        r1 = _mm256_and_pd(r1,MASK);
        r2 = _mm256_and_pd(r2,MASK);
        r3 = _mm256_and_pd(r3,MASK);
        r4 = _mm256_and_pd(r4,MASK);
        r5 = _mm256_and_pd(r5,MASK);
        r6 = _mm256_and_pd(r6,MASK);
        r7 = _mm256_and_pd(r7,MASK);
        r8 = _mm256_and_pd(r8,MASK);
        r9 = _mm256_and_pd(r9,MASK);
        rA = _mm256_and_pd(rA,MASK);
        rB = _mm256_and_pd(rB,MASK);
        r0 = _mm256_or_pd(r0,vONE);
        r1 = _mm256_or_pd(r1,vONE);
        r2 = _mm256_or_pd(r2,vONE);
        r3 = _mm256_or_pd(r3,vONE);
        r4 = _mm256_or_pd(r4,vONE);
        r5 = _mm256_or_pd(r5,vONE);
        r6 = _mm256_or_pd(r6,vONE);
        r7 = _mm256_or_pd(r7,vONE);
        r8 = _mm256_or_pd(r8,vONE);
        r9 = _mm256_or_pd(r9,vONE);
        rA = _mm256_or_pd(rA,vONE);
        rB = _mm256_or_pd(rB,vONE);

        c++;
    }

    r0 = _mm256_add_pd(r0,r1);
    r2 = _mm256_add_pd(r2,r3);
    r4 = _mm256_add_pd(r4,r5);
    r6 = _mm256_add_pd(r6,r7);
    r8 = _mm256_add_pd(r8,r9);
    rA = _mm256_add_pd(rA,rB);

    r0 = _mm256_add_pd(r0,r2);
    r4 = _mm256_add_pd(r4,r6);
    r8 = _mm256_add_pd(r8,rA);

    r0 = _mm256_add_pd(r0,r4);
    r0 = _mm256_add_pd(r0,r8);

    //  Prevent Dead Code Elimination
    double out = 0;
    __m256d temp = r0;
    out += ((double*)&temp)[0];
    out += ((double*)&temp)[1];
    out += ((double*)&temp)[2];
    out += ((double*)&temp)[3];

    return out;
}

void test_dp_mac_AVX(int tds,uint64 iterations){

    double *sum = (double*)malloc(tds * sizeof(double));
    double start = omp_get_wtime();

#pragma omp parallel num_threads(tds)
    {
        double ret = test_dp_mac_AVX(1.1,2.1,iterations);
        sum[omp_get_thread_num()] = ret;
    }

    double secs = omp_get_wtime() - start;
    uint64 ops = 48 * 1000 * iterations * tds * 4;
    cout << "Seconds = " << secs << endl;
    cout << "FP Ops  = " << ops << endl;
    cout << "FLOPs   = " << ops / secs << endl;

    double out = 0;
    int c = 0;
    while (c < tds){
        out += sum[c++];
    }

    cout << "sum = " << out << endl;
    cout << endl;

    free(sum);
}

int main(){
    //  (threads, iterations)
    test_dp_mac_AVX(8,10000000);

    system("pause");
}

Output (1 thread, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 57.4679
FP Ops  = 1920000000000
FLOPs   = 3.34099e+010
sum = 4.45305

Theoretical AVX peak is 8 flops * 4.4 GHz = 35.2 GFlops. Actual is 33.4 GFlops.

Output (8 threads, 10000000 iterations) - Compiled with Visual Studio 2010 SP1 - x64 Release:

Seconds = 111.119
FP Ops  = 15360000000000
FLOPs   = 1.3823e+011
sum = 35.6244

Theoretical AVX peak is 8 flops * 4 cores * 4.4 GHz = 140.8 GFlops. Actual is 138.2 GFlops.

Now for some explanations:

The performance critical part is obviously the 48 instructions inside the inner loop. You'll notice that it's broken into 4 blocks of 12 instructions each. Each of these 12 instructions blocks are completely independent from each other - and take on average 6 cycles to execute.

So there's 12 instructions and 6 cycles between issue-to-use. The latency of multiplication is 5 cycles, so it's just enough to avoid latency stalls.

The normalization step is needed to keep the data from over/underflowing. This is needed since the do-nothing code will slowly increase/decrease the magnitude of the data.

So it's actually possible to do better than this if you just use all zeros and get rid of the normalization step. However, since I wrote the benchmark to measure power consumption and temperature, I had to make sure the flops were on "real" data, rather than zeros - as the execution units may very well have special case-handling for zeros that use less power and produce less heat.

More Results:

Intel Core i7 920 @ 3.5 GHz
Windows 7 Ultimate x64
Visual Studio 2010 SP1 - x64 Release

Threads: 1

Seconds = 72.1116
FP Ops  = 960000000000
FLOPs   = 1.33127e+010
sum = 2.22652

Theoretical SSE Peak: 4 flops * 3.5 GHz = 14.0 GFlops. Actual is 13.3 GFlops.

Threads: 8

Seconds = 149.576
FP Ops  = 7680000000000
FLOPs   = 5.13452e+010
sum = 17.8122

Theoretical SSE Peak: 4 flops * 4 cores * 3.5 GHz = 56.0 GFlops. Actual is 51.3 GFlops.

My processor temps hit 76C on the multi-threaded run! If you runs these, be sure the results aren't affected by CPU throttling.

2 x Intel Xeon X5482 Harpertown @ 3.2 GHz
Ubuntu Linux 10 x64
GCC 4.5.2 x64 - (-O2 -msse3 -fopenmp)

Threads: 1

Seconds = 78.3357
FP Ops  = 960000000000
FLOPs   = 1.22549e+10
sum = 2.22652

Theoretical SSE Peak: 4 flops * 3.2 GHz = 12.8 GFlops. Actual is 12.3 GFlops.

Threads: 8

Seconds = 78.4733
FP Ops  = 7680000000000
FLOPs   = 9.78676e+10
sum = 17.8122

Theoretical SSE Peak: 4 flops * 8 cores * 3.2 GHz = 102.4 GFlops. Actual is 97.9 GFlops.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

查看更多回答(3条)

报告相同问题？

关注问题

如何达到理论上的每周期4次浮点运算的最大值？ c++
2011-12-05 17:54

回答 3 已采纳 I've done this exact task before. But it was mainly to measure power consumption and CPU temperatu
计算单摆摆动幅度达到最大值时，对应的时间，单摆的运动周期 matlab python
2021-09-19 15:29

回答 1 已采纳是第一问还是第二问啊？第一问没有符号解，只能数值解 g = 10; L = 1; theta0 = 30*pi/180;%30° odefun = @(t,theta) [theta(2); -g/L
尝试安装gonum时如何处理“不允许导入周期”？
2019-04-17 14:42

回答 1 已采纳 As Adrian told in the comments, the issue was that the go version I was running was too old for go
Vue 的生命周期钩子函数是异步的吗？ vue.js
2022-08-23 18:01

回答 2 已采纳钩子函数不是异步，每个钩子函数都有它执行的顺序，例如vue2中有 created钩子，它是在vue实例化之后运行的，mounted是在dom挂载后运行的，在两个函数中同时输出一个log，不管你的mou
SQL语言（SQL SERVER），计算同一数据出现的周期开发语言
2021-02-28 17:43

回答 3 已采纳你这个需求确实挺变态的，花了我好长时间，不过最后还是搞出来了。首先你需要一个临时表，这个表比你的主表多出来一个row_num的字段，这个字段是用来存放根据blue分组之后组内序号的。先执行insert
关于#时钟周期#的问题，如何解决？ fpga开发 stm32 其他单片机
2022-02-11 21:14

回答 2 已采纳周期等于频率的倒数，即1/5GHz=0.2*10^9s
《汇编语言编程基础基于 LoongArch 》读书与实践笔记
2023-02-10 23:35

loongsoner的博客好记性不如烂笔头，在此记录与分享一下《汇编语言编程基础基于 LoongArch 》读书与实践笔记。如文中出现错误，欢迎在评论区留言讨论，我会尽快修改更新 :-)
计算机超速进化，你准备好了吗？
2021-01-08 11:24

SECISLAND安全官的博客 1时间概念2数学知识：浮点数3CPU的速度4缓存5内存6硬盘7网络做信息安全的都免不了和计算机打交道，计算机是信息安全的主要防护对象之一，信息安全的工具也是在计算机上进行操作和使用，所以对计算机基础的了解有助...
C语言基础知识入门大全
2021-05-11 20:37

程序员张小妍的博客 C语言也是其它众多高级语言的鼻祖语言，所以说学习C语言是进入编程世界的必修课！简单来说，一个C程序就是由若干和组成。就是一条预处理命令, 它的作用是通知C语言编译系统在对C程序进行正式编译之前需做一些预处理...
2024最新软件测试【测试理论+ python 编程】面试题（内附答案）
2024-09-03 16:08

sszmvb1234的博客 bug 就会用 tapd(或者禅道)记录并且提交 bug，也会进行 bug 复测，以及回归测试，每一轮测试结束之后我们都会写一个测试报告，一般情况下，测试 4-5 轮之后会达到上线要求，当达到上线的标准后，测试报告会认为...
CUDA2.4-原理之性能优化及浮点运算
2015-07-09 19:59

仙守的博客本部分来自于《大规模并行处理器编程实战》第六章、第七章。打算不再看这本书了，准备看《programming massively parallel processors 2nd》，即它的第二版，第一版是09年的，第二版是13年的，虽说第二版可是里面...
gpu浮点计算能力floaps_聊聊 GPU 的计算能力上限
2021-01-04 03:01

饿鱼小孩儿的博客 </span><span>每个 SM 除了 INT32、FP32、FP64 计算单元之外，还有额外 4 个身宽体胖的 Tensor Core，这是加速 Deep Learning 计算的重磅武器，已发展到第三代，每个时钟周期可做 1024 次 FP16 乘加运算，与 Volta ...
没有解决我的问题, 去提问

悬赏问题

¥15 PointNet++的onnx模型只能使用一次
¥20 西南科技大学数字信号处理
¥15 有两个非常“自以为是”烦人的问题急期待大家解决！
¥30 STM32 INMP441无法读取数据
¥15 R语言绘制密度图，一个密度曲线内fill不同颜色如何实现
¥100 求汇川机器人IRCB300控制器和示教器同版本升级固件文件升级包
¥15 用visualstudio2022创建vue项目后无法启动
¥15 x趋于0时tanx-sinx极限可以拆开算吗
¥500 把面具戴到人脸上，请大家贡献智慧，别用大模型回答，大模型的答案没啥用
¥15 任意一个散点图自己下载其js脚本文件并做成独立的案例页面，不要作在线的，要离线状态。

码龄粉丝数原力等级 --

如何达到理论上的每周期4次浮点运算的最大值？

4条回答默认最新

码龄粉丝数原力等级 --

Warning:

Let's take this one step further. AVX...

More Results:

悬赏问题

如何达到理论上的每周期4次浮点运算的最大值？

4条回答 默认 最新

Warning:

Let's take this one step further. AVX...

More Results:

悬赏问题

4条回答默认最新