Does Go not guarantee stack alignment so you could use a memory source operand for minpd
?
Also, I'm not familiar with Go; is its float
really IEEE binary64, which most languages (including x86 asm) call double
? It's weird to see float
in the source and pd
(packed double) instructions used in the asm.
The overhead of calling a standalone hand-written-asm function for this is going to be higher than letting a compiler do it with scalar minsd
, for a single pair. Especially with Go's crappy calling convention, passing args in memory and storing the return value to memory.
An optimizing Go compiler with an LLVM or gcc back-end should get the job done with inline code with lower latency and fewer uops of throughput cost than calling this function, even with the optimization given below. Or if you're lucky, the compiler will use minpd
for you.
But for the actual problem, after minpd x0, x1
, what you need is a horizontal sum of xmm1
. Fastest way to do horizontal float vector sum on x86.
You should use movaps
to copy xmm registers, even if you only care about the low 64 bits. movsd x1, x2
merges into the low 64 bits of xmm2, creating a false dependency on the old value and costing a shuffle uop.
minpd x0, x1
movhps x1, x0 // high 64 bits of xmm1 => low 64 of xmm0
addsd x1, x0
You could movaps x1, x2
and unpckhpd x2,x2
, but that would cost an extra movapd
or movaps
which you can avoid by using movhps
.
(movaps
/ movups
is shorter than movapd
, smaller code-size, and otherwise exactly equivalent to movapd
/ movupd
on all CPUs for loads, stores, and reg-reg copies.)