duanfei8897 2018-10-03 19:47
浏览 53
已采纳

SSE2从golang中的打包数据中提取浮点数

I'm writing an assembly function in Golang. To simplify let's suppose that I want to do the following function:

func sseSumOfMinimums (d1, d2 [2]float64) float64

It will compute the minimum of d1[0], d2[0] and the minimum of d1[1] and d2[1] and compute the sum

In assembly I do:

TEXT ·sseSum(SB), $0-40
MOVUPD d1+0(FP), X0 // loading d1 to X0
MOVUPD d2+16(FP), X1 // loading d1 to X1
MINPD X0, X1 // compute pair minimums and store to X1
MOVSD X1, X2 // move first min to X2
// How do I move second float of X1 to X3?
ADDSD X2, X3
MOVSD X3, ret+32(FP)

The part that I'm missing is how to extract the second scalar from X1 to X3

  • 写回答

1条回答 默认 最新

  • dqf60304 2018-10-03 20:02
    关注

    Does Go not guarantee stack alignment so you could use a memory source operand for minpd?

    Also, I'm not familiar with Go; is its float really IEEE binary64, which most languages (including x86 asm) call double? It's weird to see float in the source and pd (packed double) instructions used in the asm.


    The overhead of calling a standalone hand-written-asm function for this is going to be higher than letting a compiler do it with scalar minsd, for a single pair. Especially with Go's crappy calling convention, passing args in memory and storing the return value to memory.

    An optimizing Go compiler with an LLVM or gcc back-end should get the job done with inline code with lower latency and fewer uops of throughput cost than calling this function, even with the optimization given below. Or if you're lucky, the compiler will use minpd for you.


    But for the actual problem, after minpd x0, x1, what you need is a horizontal sum of xmm1. Fastest way to do horizontal float vector sum on x86.

    You should use movaps to copy xmm registers, even if you only care about the low 64 bits. movsd x1, x2 merges into the low 64 bits of xmm2, creating a false dependency on the old value and costing a shuffle uop.

    minpd   x0, x1
    movhps  x1, x0        // high 64 bits of xmm1  => low 64 of xmm0
    addsd   x1, x0
    

    You could movaps x1, x2 and unpckhpd x2,x2, but that would cost an extra movapd or movaps which you can avoid by using movhps.

    (movaps / movups is shorter than movapd, smaller code-size, and otherwise exactly equivalent to movapd / movupd on all CPUs for loads, stores, and reg-reg copies.)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 fpga自动售货机数码管(相关搜索:数字时钟)
  • ¥20 Python安装cvxpy库出问题
  • ¥15 用前端向数据库插入数据,通过debug发现数据能走到后端,但是放行之后就会提示错误
  • ¥15 python天天向上类似问题,但没有清零
  • ¥30 3天&7天&&15天&销量如何统计同一行
  • ¥30 帮我写一段可以读取LD2450数据并计算距离的Arduino代码
  • ¥15 C#调用python代码(python带有库)
  • ¥15 活动选择题。最多可以参加几个项目?
  • ¥15 飞机曲面部件如机翼,壁板等具体的孔位模型
  • ¥15 vs2019中数据导出问题