dongweishi2028 2018-09-19 09:58
浏览 68
已采纳

具有goroutine的矩阵乘法会降低性能

I am optimizing matrix multiplication via goroutines in Go.

My benchmark shows, introducing concurrency per row or per element largely drops performance:

goos: darwin
goarch: amd64
BenchmarkMatrixDotNaive/A.MultNaive-8                            2000000               869 ns/op               0 B/op          0 allocs/op
BenchmarkMatrixDotNaive/A.ParalMultNaivePerRow-8                  100000             14467 ns/op              80 B/op          9 allocs/op
BenchmarkMatrixDotNaive/A.ParalMultNaivePerElem-8                  20000             77299 ns/op             528 B/op         65 allocs/op

I know some basic prior knowledge of cache locality, it make sense that per element concurrency drops performance. However, why per row still drops the performance even in naive version?

In fact, I also wrote a block/tiling optimization, its vanilla version (without goroutine concurrency) even worse than naive version (not present here, let's focus on naive first).

What did I do wrong here? Why? How to optimize here?

Multiplication:

package naive

import (
    "errors"
    "sync"
)

// Errors
var (
    ErrNumElements = errors.New("Error number of elements")
    ErrMatrixSize  = errors.New("Error size of matrix")
)

// Matrix is a 2d array
type Matrix struct {
    N    int
    data [][]float64
}

// New a size by size matrix
func New(size int) func(...float64) (*Matrix, error) {
    wg := sync.WaitGroup{}
    d := make([][]float64, size)
    for i := range d {
        wg.Add(1)
        go func(i int) {
            defer wg.Done()
            d[i] = make([]float64, size)
        }(i)
    }
    wg.Wait()
    m := &Matrix{N: size, data: d}
    return func(es ...float64) (*Matrix, error) {
        if len(es) != size*size {
            return nil, ErrNumElements
        }
        for i := range es {
            wg.Add(1)
            go func(i int) {
                defer wg.Done()
                m.data[i/size][i%size] = es[i]
            }(i)
        }
        wg.Wait()
        return m, nil
    }
}

// At access element (i, j)
func (A *Matrix) At(i, j int) float64 {
    return A.data[i][j]
}

// Set set element (i, j) with val
func (A *Matrix) Set(i, j int, val float64) {
    A.data[i][j] = val
}

// MultNaive matrix multiplication O(n^3)
func (A *Matrix) MultNaive(B, C *Matrix) (err error) {
    var (
        i, j, k int
        sum     float64
        N       = A.N
    )

    if N != B.N || N != C.N {
        return ErrMatrixSize
    }

    for i = 0; i < N; i++ {
        for j = 0; j < N; j++ {
            sum = 0.0
            for k = 0; k < N; k++ {
                sum += A.At(i, k) * B.At(k, j)
            }
            C.Set(i, j, sum)
        }
    }
    return
}

// ParalMultNaivePerRow matrix multiplication O(n^3) in concurrency per row
func (A *Matrix) ParalMultNaivePerRow(B, C *Matrix) (err error) {
    var N = A.N

    if N != B.N || N != C.N {
        return ErrMatrixSize
    }

    wg := sync.WaitGroup{}
    for i := 0; i < N; i++ {
        wg.Add(1)
        go func(i int) {
            defer wg.Done()
            for j := 0; j < N; j++ {
                sum := 0.0
                for k := 0; k < N; k++ {
                    sum += A.At(i, k) * B.At(k, j)
                }
                C.Set(i, j, sum)
            }
        }(i)
    }
    wg.Wait()
    return
}

// ParalMultNaivePerElem matrix multiplication O(n^3) in concurrency per element
func (A *Matrix) ParalMultNaivePerElem(B, C *Matrix) (err error) {
    var N = A.N

    if N != B.N || N != C.N {
        return ErrMatrixSize
    }

    wg := sync.WaitGroup{}
    for i := 0; i < N; i++ {
        for j := 0; j < N; j++ {
            wg.Add(1)
            go func(i, j int) {
                defer wg.Done()
                sum := 0.0
                for k := 0; k < N; k++ {
                    sum += A.At(i, k) * B.At(k, j)
                }
                C.Set(i, j, sum)
            }(i, j)
        }
    }
    wg.Wait()
    return
}

Benchmark:

package naive

import (
    "os"
    "runtime/trace"
    "testing"
)

type Dot func(B, C *Matrix) error

var (
    A = &Matrix{
        N: 8,
        data: [][]float64{
            []float64{1, 2, 3, 4, 5, 6, 7, 8},
            []float64{9, 1, 2, 3, 4, 5, 6, 7},
            []float64{8, 9, 1, 2, 3, 4, 5, 6},
            []float64{7, 8, 9, 1, 2, 3, 4, 5},
            []float64{6, 7, 8, 9, 1, 2, 3, 4},
            []float64{5, 6, 7, 8, 9, 1, 2, 3},
            []float64{4, 5, 6, 7, 8, 9, 1, 2},
            []float64{3, 4, 5, 6, 7, 8, 9, 0},
        },
    }
    B = &Matrix{
        N: 8,
        data: [][]float64{
            []float64{9, 8, 7, 6, 5, 4, 3, 2},
            []float64{1, 9, 8, 7, 6, 5, 4, 3},
            []float64{2, 1, 9, 8, 7, 6, 5, 4},
            []float64{3, 2, 1, 9, 8, 7, 6, 5},
            []float64{4, 3, 2, 1, 9, 8, 7, 6},
            []float64{5, 4, 3, 2, 1, 9, 8, 7},
            []float64{6, 5, 4, 3, 2, 1, 9, 8},
            []float64{7, 6, 5, 4, 3, 2, 1, 0},
        },
    }
    C = &Matrix{
        N: 8,
        data: [][]float64{
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
            []float64{0, 0, 0, 0, 0, 0, 0, 0},
        },
    }
)

func BenchmarkMatrixDotNaive(b *testing.B) {
    f, _ := os.Create("bench.trace")
    defer f.Close()
    trace.Start(f)
    defer trace.Stop()

    tests := []struct {
        name string
        f    Dot
    }{
        {
            name: "A.MultNaive",
            f:    A.MultNaive,
        },
        {
            name: "A.ParalMultNaivePerRow",
            f:    A.ParalMultNaivePerRow,
        },
        {
            name: "A.ParalMultNaivePerElem",
            f:    A.ParalMultNaivePerElem,
        },
    }
    for _, tt := range tests {
        b.Run(tt.name, func(b *testing.B) {
            for i := 0; i < b.N; i++ {
                tt.f(B, C)
            }
        })
    }
}
  • 写回答

1条回答 默认 最新

  • doucan957495 2018-09-19 10:08
    关注

    Performing 8x8 matrix multipliciation is relatively small work.

    Goroutines (although may be lightweight) do have overhead. If the work they do is "small", the overhead of launching, synchronizing and throwing them away may outweight the performance gain of utilizing multiple cores / threads, and overall you might not gain performance by executing such small tasks concurrently (hell, you may even do worse than without using goroutines). Measure.

    If we increase the matrix size to 80x80, running the benchmark we already see some performance gain in case of ParalMultNaivePerRow:

    BenchmarkMatrixDotNaive/A.MultNaive-4               2000     1054775 ns/op
    BenchmarkMatrixDotNaive/A.ParalMultNaivePerRow-4    2000      709367 ns/op
    BenchmarkMatrixDotNaive/A.ParalMultNaivePerElem-4    100    10224927 ns/op
    

    (As you see in the results, I have 4 CPU cores, running it on your 8-core machine might show more performance gain.)

    When rows are small, you are using goroutines to do minimal work, you may improve performance by not "throwing" away goroutines once they're done with their "tiny" work, but you may "reuse" them. See related question: Is this an idiomatic worker thread pool in Go?

    Also see related / possible duplicate: Vectorise a function taking advantage of concurrency

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 c语言怎么用printf(“\b \b”)与getch()实现黑框里写入与删除?
  • ¥20 怎么用dlib库的算法识别小麦病虫害
  • ¥15 华为ensp模拟器中S5700交换机在配置过程中老是反复重启
  • ¥15 java写代码遇到问题,求帮助
  • ¥15 uniapp uview http 如何实现统一的请求异常信息提示?
  • ¥15 有了解d3和topogram.js库的吗?有偿请教
  • ¥100 任意维数的K均值聚类
  • ¥15 stamps做sbas-insar,时序沉降图怎么画
  • ¥15 买了个传感器,根据商家发的代码和步骤使用但是代码报错了不会改,有没有人可以看看
  • ¥15 关于#Java#的问题,如何解决?