2018-10-03 19:47
浏览 53


I'm writing an assembly function in Golang. To simplify let's suppose that I want to do the following function:

func sseSumOfMinimums (d1, d2 [2]float64) float64

It will compute the minimum of d1[0], d2[0] and the minimum of d1[1] and d2[1] and compute the sum

In assembly I do:

TEXT ·sseSum(SB), $0-40
MOVUPD d1+0(FP), X0 // loading d1 to X0
MOVUPD d2+16(FP), X1 // loading d1 to X1
MINPD X0, X1 // compute pair minimums and store to X1
MOVSD X1, X2 // move first min to X2
// How do I move second float of X1 to X3?
MOVSD X3, ret+32(FP)

The part that I'm missing is how to extract the second scalar from X1 to X3

图片转代码服务由CSDN问答提供 功能建议

我正在用Golang写一个汇编函数。 为简化起见,假设我要执行以下功能:

  func sseSumOfMinimums(d1,d2 [2] float64)float64 
   \  n 

它将计算d1 [0],d2 [0]的最小值和d1 [1]和d2 [1]的最小值并计算总和

In 程序集我这样做:

  TEXT·sseSum(SB),$ 0-40 
MOVUPD d1 + 0(FP),X0 //将d1加载到X0 
MOVUPD d2 + 16(  FP),X1 //将d1加载到X1 
MINPD X0,X1 //计算成对的最小值并存储到X1 
MOVSD X1,X2 //将第一分钟移到X2 
MOVSD X3,ret + 32(FP)

我缺少的部分是如何从X1提取第二个标量到 X3

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 邀请回答

1条回答 默认 最新

  • dqf60304 2018-10-03 20:02

    Does Go not guarantee stack alignment so you could use a memory source operand for minpd?

    Also, I'm not familiar with Go; is its float really IEEE binary64, which most languages (including x86 asm) call double? It's weird to see float in the source and pd (packed double) instructions used in the asm.

    The overhead of calling a standalone hand-written-asm function for this is going to be higher than letting a compiler do it with scalar minsd, for a single pair. Especially with Go's crappy calling convention, passing args in memory and storing the return value to memory.

    An optimizing Go compiler with an LLVM or gcc back-end should get the job done with inline code with lower latency and fewer uops of throughput cost than calling this function, even with the optimization given below. Or if you're lucky, the compiler will use minpd for you.

    But for the actual problem, after minpd x0, x1, what you need is a horizontal sum of xmm1. Fastest way to do horizontal float vector sum on x86.

    You should use movaps to copy xmm registers, even if you only care about the low 64 bits. movsd x1, x2 merges into the low 64 bits of xmm2, creating a false dependency on the old value and costing a shuffle uop.

    minpd   x0, x1
    movhps  x1, x0        // high 64 bits of xmm1  => low 64 of xmm0
    addsd   x1, x0

    You could movaps x1, x2 and unpckhpd x2,x2, but that would cost an extra movapd or movaps which you can avoid by using movhps.

    (movaps / movups is shorter than movapd, smaller code-size, and otherwise exactly equivalent to movapd / movupd on all CPUs for loads, stores, and reg-reg copies.)

    点赞 打赏 评论

相关推荐 更多相似问题