dongweishi2028 2018-09-19 09:58
浏览 68
已采纳

具有goroutine的矩阵乘法会降低性能

I am optimizing matrix multiplication via goroutines in Go.

My benchmark shows, introducing concurrency per row or per element largely drops performance:

goos: darwin
goarch: amd64
BenchmarkMatrixDotNaive/A.MultNaive-8                            2000000               869 ns/op               0 B/op          0 allocs/op
BenchmarkMatrixDotNaive/A.ParalMultNaivePerRow-8                  100000             14467 ns/op              80 B/op          9 allocs/op
BenchmarkMatrixDotNaive/A.ParalMultNaivePerElem-8                  20000             77299 ns/op             528 B/op         65 allocs/op

I know some basic prior knowledge of cache locality, it make sense that per element concurrency drops performance. However, why per row still drops the performance even in naive version?

In fact, I also wrote a block/tiling optimization, its vanilla version (without goroutine concurrency) even worse than naive version (not present here, let's focus on naive first).

What did I do wrong here? Why? How to optimize here?

Multiplication:

package naive

import (
    "errors"
    "sync"
)

// Errors
var (
    ErrNumElements = errors.New("Error number of elements")
    ErrMatrixSize  = errors.New("Error size of matrix")
)

// Matrix is a 2d array
type Matrix struct {
    N    int
    data [][]float64
}

// New a size by size matrix
func New(size int) func(...float64) (*Matrix, error) {
    wg := sync.WaitGroup{}
    d := make([][]float64, size)
    for i := range d {
        wg.Add(1)
        go func(i int) {
            defer wg.Done()
            d[i] = make([]float64, size)
        }(i)
    }
    wg.Wait()
    m := &Matrix{N: size, data: d}
    return func(es ...float64) (*Matrix, error) {
        if len(es) != size*size {
            return nil, ErrNumElements
        }
        for i := range es {
            wg.Add(1)
            go func(i int) {
                defer wg.Done()
                m.data[i/size][i%size] = es[i]
            }(i)
        }
        wg.Wait()
        return m, nil
    }
}

// At access element (i, j)
func (A *Matrix) At(i, j int) float64 {
    return A.data[i][j]
}

// Set set element (i, j) with val
func (A *Matrix) Set(i, j int, val float64) {
    A.data[i][j] = val
}

// MultNaive matrix multiplication O(n^3)
func (A *Matrix) MultNaive(B, C *Matrix) (err error) {
    var (
        i, j, k int
        sum     float64
        N       = A.N
    )

    if N != B.N || N != C.N {
        return ErrMatrixSize
    }

    for i = 0; i < N; i++ {
        for j = 0; j < N; j++ {
            sum = 0.0
            for k = 0; k < N; k++ {
                sum += A.At(i, k) * B.At(k, j)
            }
            C.Set(i, j, sum)
        }
    }
    return
}

// ParalMultNaivePerRow matrix multiplication O(n^3) in concurrency per row
func (A *Matrix) ParalMultNaivePerRow(B, C *Matrix) (err error) {
    var N = A.N

    if N != B.N || N != C.N {
        return ErrMatrixSize
    }

    wg := sync.WaitGroup{}
    for i := 0; i < N; i++ {
        wg.Add(1)
        go func(i int) {
            defer wg.Done()
            for j := 0; j < N; j++ {
                sum := 0.0
                for k := 0; k < N; k++ {
                    sum += A.At(i, k) * B.At(k, j)
                }
                C.Set(i, j, sum)
            }
        }(i)
    }
    wg.Wait()
    return
}

// ParalMultNaivePerElem matrix multiplication O(n^3) in concurrency per element
func (A *Matrix) ParalMultNaivePerElem(B, C *Matrix) (err error) {
    var N = A.N

    if N != B.N || N != C.N {
        return ErrMatrixSize
    }

    wg := sync.WaitGroup{}
    for i := 0; i < N; i++ {
        for j := 0; j < N; j++ {
            wg.Add(1)
            go func(i, j int) {
                defer wg.Done()
                sum := 0.0
                for k := 0; k < N; k++ {
                    sum += A.At(i, k) * B.At(k, j)
                }
                C.Set(i, j, sum)
            }(i, j)
        }
    }
    wg.Wait()
    return
}

Benchmark:

package naive

import (
    "os"
    "runtime/trace"
    "testing"
)

type Dot func(B, C *Matrix) error

var (
    A = &Matrix{
        N: 8,
        data: [][]float64{
            []float64{1, 2, 3, 4, 5, 6, 7, 8},
            []float64{9, 1, 2, 3, 4, 5, 6, 7},
            []float64{8, 9, 1, 2, 3, 4, 5, 6},
            []float64{7, 8, 9, 1, 2, 3, 4, 5},
            []float64{6, 7, 8, 9, 1, 2, 3, 4},
            []float64{5, 6, 7, 8, 9, 1, 2, 3},
            []float64{4, 5, 6, 7, 8, 9, 1, 2},
            []float64{3, 4, 5, 6, 7, 8, 9, 0},
        },
    }
    B = &Matrix{
        N: 8,
        data: [][]float64{
            []float64{9, 8, 7, 6, 5, 4, 3, 2},
            []float64{1, 9, 8, 7, 6, 5, 4, 3},
            []float64{2, 1, 9, 8, 7, 6, 5, 4},
            []float64{3, 2, 1, 9, 8, 7, 6, 5},
            []float64{4, 3, 2, 1, 9, 8, 7, 6},
            []float64{5, 4, 3, 2, 1, 9, 8, 7},
            []float64{6, 5, 4, 3, 2, 1, 9, 8},
            []float64{7, 6, 5, 4, 3, 2, 1, 0},
        },
    }
    C = &Matrix{
        N: 8,
        data: [][]float64{
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
        },
    }
)

func BenchmarkMatrixDotNaive(b *testing.B) {
    f, _ := os.Create("bench.trace")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    tests := []struct {
        name string
        f    Dot
    }{
        {
            name: "A.MultNaive",
            f:    A.MultNaive,
        },
        {
            name: "A.ParalMultNaivePerRow",
            f:    A.ParalMultNaivePerRow,
        },
        {
            name: "A.ParalMultNaivePerElem",
            f:    A.ParalMultNaivePerElem,
        },
    }
    for _, tt := range tests {
        b.Run(tt.name, func(b *testing.B) {
            for i := 0; i < b.N; i++ {
                tt.f(B, C)
            }
        })
    }
}
  • 写回答

1条回答 默认 最新

  • doucan957495 2018-09-19 10:08
    关注

    Performing 8x8 matrix multipliciation is relatively small work.

    Goroutines (although may be lightweight) do have overhead. If the work they do is "small", the overhead of launching, synchronizing and throwing them away may outweight the performance gain of utilizing multiple cores / threads, and overall you might not gain performance by executing such small tasks concurrently (hell, you may even do worse than without using goroutines). Measure.

    If we increase the matrix size to 80x80, running the benchmark we already see some performance gain in case of ParalMultNaivePerRow:

    BenchmarkMatrixDotNaive/A.MultNaive-4               2000     1054775 ns/op
    BenchmarkMatrixDotNaive/A.ParalMultNaivePerRow-4    2000      709367 ns/op
    BenchmarkMatrixDotNaive/A.ParalMultNaivePerElem-4    100    10224927 ns/op
    

    (As you see in the results, I have 4 CPU cores, running it on your 8-core machine might show more performance gain.)

    When rows are small, you are using goroutines to do minimal work, you may improve performance by not "throwing" away goroutines once they're done with their "tiny" work, but you may "reuse" them. See related question: Is this an idiomatic worker thread pool in Go?

    Also see related / possible duplicate: Vectorise a function taking advantage of concurrency

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 oracle集群安装出bug
  • ¥15 关于#python#的问题:自动化测试
  • ¥20 问题请教!vue项目关于Nginx配置nonce安全策略的问题
  • ¥15 教务系统账号被盗号如何追溯设备
  • ¥20 delta降尺度方法,未来数据怎么降尺度
  • ¥15 c# 使用NPOI快速将datatable数据导入excel中指定sheet,要求快速高效
  • ¥15 再不同版本的系统上,TCP传输速度不一致
  • ¥15 高德地图2.0 版本点聚合中Marker的位置无法实时更新,如何解决呢?
  • ¥15 DIFY API Endpoint 问题。
  • ¥20 sub地址DHCP问题