具有goroutine的矩阵乘法会降低性能

I am optimizing matrix multiplication via goroutines in Go.

My benchmark shows, introducing concurrency per row or per element largely drops performance:

goos: darwin
goarch: amd64
BenchmarkMatrixDotNaive/A.MultNaive-8                            2000000               869 ns/op               0 B/op          0 allocs/op
BenchmarkMatrixDotNaive/A.ParalMultNaivePerRow-8                  100000             14467 ns/op              80 B/op          9 allocs/op
BenchmarkMatrixDotNaive/A.ParalMultNaivePerElem-8                  20000             77299 ns/op             528 B/op         65 allocs/op

I know some basic prior knowledge of cache locality, it make sense that per element concurrency drops performance. However, why per row still drops the performance even in naive version?

In fact, I also wrote a block/tiling optimization, its vanilla version (without goroutine concurrency) even worse than naive version (not present here, let's focus on naive first).

What did I do wrong here? Why? How to optimize here?

Multiplication:

package naive

import (
    "errors"
    "sync"
)

// Errors
var (
    ErrNumElements = errors.New("Error number of elements")
    ErrMatrixSize  = errors.New("Error size of matrix")
)

// Matrix is a 2d array
type Matrix struct {
    N    int
    data [][]float64
}

// New a size by size matrix
func New(size int) func(...float64) (*Matrix, error) {
    wg := sync.WaitGroup{}
    d := make([][]float64, size)
    for i := range d {
        wg.Add(1)
        go func(i int) {
            defer wg.Done()
            d[i] = make([]float64, size)
        }(i)
    }
    wg.Wait()
    m := &Matrix{N: size, data: d}
    return func(es ...float64) (*Matrix, error) {
        if len(es) != size*size {
            return nil, ErrNumElements
        }
        for i := range es {
            wg.Add(1)
            go func(i int) {
                defer wg.Done()
                m.data[i/size][i%size] = es[i]
            }(i)
        }
        wg.Wait()
        return m, nil
    }
}

// At access element (i, j)
func (A *Matrix) At(i, j int) float64 {
    return A.data[i][j]
}

// Set set element (i, j) with val
func (A *Matrix) Set(i, j int, val float64) {
    A.data[i][j] = val
}

// MultNaive matrix multiplication O(n^3)
func (A *Matrix) MultNaive(B, C *Matrix) (err error) {
    var (
        i, j, k int
        sum     float64
        N       = A.N
    )

    if N != B.N || N != C.N {
        return ErrMatrixSize
    }

    for i = 0; i < N; i++ {
        for j = 0; j < N; j++ {
            sum = 0.0
            for k = 0; k < N; k++ {
                sum += A.At(i, k) * B.At(k, j)
            }
            C.Set(i, j, sum)
        }
    }
    return
}

// ParalMultNaivePerRow matrix multiplication O(n^3) in concurrency per row
func (A *Matrix) ParalMultNaivePerRow(B, C *Matrix) (err error) {
    var N = A.N

    if N != B.N || N != C.N {
        return ErrMatrixSize
    }

    wg := sync.WaitGroup{}
    for i := 0; i < N; i++ {
        wg.Add(1)
        go func(i int) {
            defer wg.Done()
            for j := 0; j < N; j++ {
                sum := 0.0
                for k := 0; k < N; k++ {
                    sum += A.At(i, k) * B.At(k, j)
                }
                C.Set(i, j, sum)
            }
        }(i)
    }
    wg.Wait()
    return
}

// ParalMultNaivePerElem matrix multiplication O(n^3) in concurrency per element
func (A *Matrix) ParalMultNaivePerElem(B, C *Matrix) (err error) {
    var N = A.N

    if N != B.N || N != C.N {
        return ErrMatrixSize
    }

    wg := sync.WaitGroup{}
    for i := 0; i < N; i++ {
        for j := 0; j < N; j++ {
            wg.Add(1)
            go func(i, j int) {
                defer wg.Done()
                sum := 0.0
                for k := 0; k < N; k++ {
                    sum += A.At(i, k) * B.At(k, j)
                }
                C.Set(i, j, sum)
            }(i, j)
        }
    }
    wg.Wait()
    return
}

Benchmark:

package naive

import (
    "os"
    "runtime/trace"
    "testing"
)

type Dot func(B, C *Matrix) error

var (
    A = &Matrix{
        N: 8,
        data: [][]float64{
            []float64{1, 2, 3, 4, 5, 6, 7, 8},
            []float64{9, 1, 2, 3, 4, 5, 6, 7},
            []float64{8, 9, 1, 2, 3, 4, 5, 6},
            []float64{7, 8, 9, 1, 2, 3, 4, 5},
            []float64{6, 7, 8, 9, 1, 2, 3, 4},
            []float64{5, 6, 7, 8, 9, 1, 2, 3},
            []float64{4, 5, 6, 7, 8, 9, 1, 2},
            []float64{3, 4, 5, 6, 7, 8, 9, 0},
        },
    }
    B = &Matrix{
        N: 8,
        data: [][]float64{
            []float64{9, 8, 7, 6, 5, 4, 3, 2},
            []float64{1, 9, 8, 7, 6, 5, 4, 3},
            []float64{2, 1, 9, 8, 7, 6, 5, 4},
            []float64{3, 2, 1, 9, 8, 7, 6, 5},
            []float64{4, 3, 2, 1, 9, 8, 7, 6},
            []float64{5, 4, 3, 2, 1, 9, 8, 7},
            []float64{6, 5, 4, 3, 2, 1, 9, 8},
            []float64{7, 6, 5, 4, 3, 2, 1, 0},
        },
    }
    C = &Matrix{
        N: 8,
        data: [][]float64{
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
        },
    }
)

func BenchmarkMatrixDotNaive(b *testing.B) {
    f, _ := os.Create("bench.trace")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    tests := []struct {
        name string
        f    Dot
    }{
        {
            name: "A.MultNaive",
            f:    A.MultNaive,
        },
        {
            name: "A.ParalMultNaivePerRow",
            f:    A.ParalMultNaivePerRow,
        },
        {
            name: "A.ParalMultNaivePerElem",
            f:    A.ParalMultNaivePerElem,
        },
    }
    for _, tt := range tests {
        b.Run(tt.name, func(b *testing.B) {
            for i := 0; i < b.N; i++ {
                tt.f(B, C)
            }
        })
    }
}

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
doucan957495 2018-09-19 10:08
关注
Performing 8x8 matrix multipliciation is relatively small work.

Goroutines (although may be lightweight) do have overhead. If the work they do is "small", the overhead of launching, synchronizing and throwing them away may outweight the performance gain of utilizing multiple cores / threads, and overall you might not gain performance by executing such small tasks concurrently (hell, you may even do worse than without using goroutines). Measure.

If we increase the matrix size to 80x80, running the benchmark we already see some performance gain in case of ParalMultNaivePerRow:

BenchmarkMatrixDotNaive/A.MultNaive-4 2000 1054775 ns/op BenchmarkMatrixDotNaive/A.ParalMultNaivePerRow-4 2000 709367 ns/op BenchmarkMatrixDotNaive/A.ParalMultNaivePerElem-4 100 10224927 ns/op

(As you see in the results, I have 4 CPU cores, running it on your 8-core machine might show more performance gain.)

When rows are small, you are using goroutines to do minimal work, you may improve performance by not "throwing" away goroutines once they're done with their "tiny" work, but you may "reuse" them. See related question: Is this an idiomatic worker thread pool in Go?

Also see related / possible duplicate: Vectorise a function taking advantage of concurrency
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

具有goroutine的矩阵乘法会降低性能
2018-09-19 09:58

回答 1 已采纳 Performing 8x8 matrix multipliciation is relatively small work. Goroutines (although may be light
Go例程矩阵乘法错误
2018-05-06 04:34

回答 1 已采纳 I see two problems: You have a classic closure problem in multiply with i and j. There's no guar
为什么此goroutine会阻止？
2017-02-24 07:19

回答 3 已采纳 This is according to the spec: The function value and parameters are evaluated as usual in the
pond:Go语言编写的简约高性能goroutine工作池
2021-04-14 01:28

Go语言编写的简约高性能goroutine工作池动机该库旨在提供一种在有限的资源或服务上执行某些功能时限制并发性的简单方法。一些常见的方案包括：对编号为有限的数据库执行查询。连接数将HTTP请求发送到速率/...
具有多个通道的多个goroutine的死锁
2018-11-05 05:11

回答 1 已采纳 We can iterate through values sent over a channel. To break such iteration channel needs to be clo
这段具有非缓冲通道的代码是否会在Go中导致goroutine泄漏？
2019-05-28 19:42

回答 1 已采纳 The problem with your code is twofold. First, there is, theoretically, a goroutine leak since any
突破具有无限循环的第三方goroutine
2017-10-10 16:05

回答 2 已采纳 Short answer: You can't. There's no way to kill a goroutine (short of killing the entire program)
parallelwriter:io.MultiWriter但具有更多goroutine
2021-05-21 14:25

并行写入器 ParallelWriter与相似，但同时执行写入操作。
当goroutine争用超过3400时，为什么sync.Mutex会大大降低性能？
2019-08-19 19:07

回答 2 已采纳 sync.Mutex 's implementation is based on runtime semaphore. The reason why it encounters massive p
从goroutine返回值会发生什么
2015-01-09 20:06

回答 5 已采纳 A quick bit of looking at the assembly output shows $ go build -gcflags -S z.go The getNumber()
Goroutine超时
2018-07-07 12:07

回答 2 已采纳 You control cancelation of http requests with a context.Context. // create a timeout or cancelati
gls:快速goroutine本地存储
2021-05-18 11:22

gls 快速goroutine本地存储警告关于为什么在Go中实现和使用线程本地存储（实际上是goroutine本地存储）是一个坏主意，有大量的文档和讨论。例如，请参阅和的，这是鼓励您解决需要goroutine本地存储的问题的方式。 ...
在golang中优先使用goroutine
2018-12-21 20:09

回答 2 已采纳 I have created threadpools on golang. This should allow easily one to prioritize certain goroutine
gls:Goroutine本地存储
2021-03-27 21:39

Goroutine本地存储该库是golang中线程本地存储模式的实现。作为一种语言，Go掩藏了与goroutine相关的复杂性，从内部通信（通过通道）到阻止/解除阻止。与Java不同，golang对goroutine的控制很少。在这种情况下，...
GOLANG使用Context管理关联goroutine的方法
2020-09-19 17:50

主要介绍了GOLANG使用Context管理关联goroutine的方法，文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值，需要的朋友们下面随着小编来一起学习学习吧
没有解决我的问题, 去提问

悬赏问题

¥15 oracle集群安装出bug
¥15 关于#python#的问题：自动化测试
¥20 问题请教！vue项目关于Nginx配置nonce安全策略的问题
¥15 教务系统账号被盗号如何追溯设备
¥20 delta降尺度方法，未来数据怎么降尺度
¥15 c# 使用NPOI快速将datatable数据导入excel中指定sheet，要求快速高效
¥15 再不同版本的系统上，TCP传输速度不一致
¥15 高德地图2.0 版本点聚合中Marker的位置无法实时更新，如何解决呢？
¥15 DIFY API Endpoint 问题。
¥20 sub地址DHCP问题

具有goroutine的矩阵乘法会降低性能

1条回答 默认 最新

悬赏问题

1条回答默认最新