当goroutine争用超过3400时，为什么sync.Mutex会大大降低性能？

I am comparing the performance regarding sync.Mutex and Go channels. Here is my benchmark:

// go playground: https://play.golang.org/p/f_u9jHBq_Jc
const (
    start = 300  // actual = start  * goprocs
    end   = 600 // actual = end    * goprocs
    step  = 10
)

var goprocs = runtime.GOMAXPROCS(0) // 8

// https://perf.golang.org/search?q=upload:20190819.3
func BenchmarkChanWrite(b *testing.B) {
    var v int64
    ch := make(chan int, 1)
    ch <- 1
    for i := start; i < end; i += step {
        b.Run(fmt.Sprintf("goroutines-%d", i*goprocs), func(b *testing.B) {
            b.SetParallelism(i)
            b.RunParallel(func(pb *testing.PB) {
                for pb.Next() {
                    <-ch
                    v += 1
                    ch <- 1
                }
            })
        })
    }
}

// https://perf.golang.org/search?q=upload:20190819.2
func BenchmarkMutexWrite(b *testing.B) {
    var v int64
    mu := sync.Mutex{}
    for i := start; i < end; i += step {
        b.Run(fmt.Sprintf("goroutines-%d", i*goprocs), func(b *testing.B) {
            b.SetParallelism(i)
            b.RunParallel(func(pb *testing.PB) {
                for pb.Next() {
                    mu.Lock()
                    v += 1
                    mu.Unlock()
                }
            })
        })
    }
}

The performance comparison visualization is as follows:

What are the reasons that

sync.Mutex encounters a large performance drop when the number of goroutines goes higher than roughly 3400?
Go channels are pretty stable but slower than sync.Mutex before?

Raw bench data by benchstat (go test -bench=. -count=5) go version go1.12.4 linux/amd64:

MutexWrite/goroutines-2400-8  48.6ns ± 1%
MutexWrite/goroutines-2480-8  49.1ns ± 0%
MutexWrite/goroutines-2560-8  49.7ns ± 1%
MutexWrite/goroutines-2640-8  50.5ns ± 3%
MutexWrite/goroutines-2720-8  50.9ns ± 2%
MutexWrite/goroutines-2800-8  51.8ns ± 3%
MutexWrite/goroutines-2880-8  52.5ns ± 2%
MutexWrite/goroutines-2960-8  54.1ns ± 4%
MutexWrite/goroutines-3040-8  54.5ns ± 2%
MutexWrite/goroutines-3120-8  56.1ns ± 3%
MutexWrite/goroutines-3200-8  63.2ns ± 5%
MutexWrite/goroutines-3280-8  77.5ns ± 6%
MutexWrite/goroutines-3360-8   141ns ± 6%
MutexWrite/goroutines-3440-8   239ns ± 8%
MutexWrite/goroutines-3520-8   248ns ± 3%
MutexWrite/goroutines-3600-8   254ns ± 2%
MutexWrite/goroutines-3680-8   256ns ± 1%
MutexWrite/goroutines-3760-8   261ns ± 2%
MutexWrite/goroutines-3840-8   266ns ± 3%
MutexWrite/goroutines-3920-8   276ns ± 3%
MutexWrite/goroutines-4000-8   278ns ± 3%
MutexWrite/goroutines-4080-8   286ns ± 5%
MutexWrite/goroutines-4160-8   293ns ± 4%
MutexWrite/goroutines-4240-8   295ns ± 2%
MutexWrite/goroutines-4320-8   280ns ± 8%
MutexWrite/goroutines-4400-8   294ns ± 9%
MutexWrite/goroutines-4480-8   285ns ±10%
MutexWrite/goroutines-4560-8   290ns ± 8%
MutexWrite/goroutines-4640-8   271ns ± 3%
MutexWrite/goroutines-4720-8   271ns ± 4%

ChanWrite/goroutines-2400-8  158ns ± 3%
ChanWrite/goroutines-2480-8  159ns ± 2%
ChanWrite/goroutines-2560-8  161ns ± 2%
ChanWrite/goroutines-2640-8  161ns ± 1%
ChanWrite/goroutines-2720-8  163ns ± 1%
ChanWrite/goroutines-2800-8  166ns ± 3%
ChanWrite/goroutines-2880-8  168ns ± 1%
ChanWrite/goroutines-2960-8  176ns ± 4%
ChanWrite/goroutines-3040-8  176ns ± 2%
ChanWrite/goroutines-3120-8  180ns ± 1%
ChanWrite/goroutines-3200-8  180ns ± 1%
ChanWrite/goroutines-3280-8  181ns ± 2%
ChanWrite/goroutines-3360-8  183ns ± 2%
ChanWrite/goroutines-3440-8  188ns ± 3%
ChanWrite/goroutines-3520-8  190ns ± 2%
ChanWrite/goroutines-3600-8  193ns ± 2%
ChanWrite/goroutines-3680-8  196ns ± 3%
ChanWrite/goroutines-3760-8  199ns ± 2%
ChanWrite/goroutines-3840-8  206ns ± 2%
ChanWrite/goroutines-3920-8  209ns ± 2%
ChanWrite/goroutines-4000-8  206ns ± 2%
ChanWrite/goroutines-4080-8  209ns ± 2%
ChanWrite/goroutines-4160-8  208ns ± 2%
ChanWrite/goroutines-4240-8  209ns ± 3%
ChanWrite/goroutines-4320-8  213ns ± 2%
ChanWrite/goroutines-4400-8  209ns ± 2%
ChanWrite/goroutines-4480-8  211ns ± 1%
ChanWrite/goroutines-4560-8  213ns ± 2%
ChanWrite/goroutines-4640-8  215ns ± 1%
ChanWrite/goroutines-4720-8  218ns ± 3%

Go 1.12.4. Hardware:

CPU:       Quad core Intel Core i7-7700 (-MT-MCP-) cache: 8192 KB
           clock speeds: max: 4200 MHz 1: 1109 MHz 2: 3641 MHz 3: 3472 MHz 4: 3514 MHz 5: 3873 MHz 6: 3537 MHz
           7: 3410 MHz 8: 3016 MHz
           CPU Flags: 3dnowprefetch abm acpi adx aes aperfmperf apic arat arch_perfmon art avx avx2 bmi1 bmi2
           bts clflush clflushopt cmov constant_tsc cpuid cpuid_fault cx16 cx8 de ds_cpl dtes64 dtherm dts epb
           ept erms est f16c flexpriority flush_l1d fma fpu fsgsbase fxsr hle ht hwp hwp_act_window hwp_epp
           hwp_notify ibpb ibrs ida intel_pt invpcid invpcid_single lahf_lm lm mca mce md_clear mmx monitor
           movbe mpx msr mtrr nonstop_tsc nopl nx pae pat pbe pcid pclmulqdq pdcm pdpe1gb pebs pge pln pni
           popcnt pse pse36 pti pts rdrand rdseed rdtscp rep_good rtm sdbg sep smap smep smx ss ssbd sse sse2
           sse4_1 sse4_2 ssse3 stibp syscall tm tm2 tpr_shadow tsc tsc_adjust tsc_deadline_timer tsc_known_freq
           vme vmx vnmi vpid x2apic xgetbv1 xsave xsavec xsaveopt xsaves xtopology xtpr

Update: I tested on different hardware. It seems the problem still exists:

bench: https://play.golang.org/p/HnQ44--E4UQ

Update:

My full benchmark that tested from 8 goroutines to 15000 goroutines, including a comparison on chan/sync.Mutex/atomic:

写回答
好问题 0 提建议
追加酬金
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dsfykqq3403 2019-08-27 07:56
关注
sync.Mutex 's implementation is based on runtime semaphore. The reason why it encounters massive performance decreases is that the implementation of runtime.semacquire1.

Now, let's sample two representative points, we use go tool pprof when the number of goroutines was equal to 2400 and 4800:

goos: linux goarch: amd64 BenchmarkMutexWrite/goroutines-2400-8 50000000 46.5 ns/op PASS ok 2.508s BenchmarkMutexWrite/goroutines-4800-8 50000000 317 ns/op PASS ok 16.020s

2400:

4800:

As we can see, when the number of goroutines increased to 4800, the overhead of runtime.gopark becomes dominant. Let's dig more in the runtime source code and see who exactly calls runtime.gopark. In the runtime.semacquire1:

func semacquire1(addr *uint32, lifo bool, profile semaProfileFlags, skipframes int) { // fast path if cansemacquire(addr) { return } s := acquireSudog() root := semroot(addr) ... for { lock(&root.lock) atomic.Xadd(&root.nwait, 1) if cansemacquire(addr) { atomic.Xadd(&root.nwait, -1) unlock(&root.lock) break } // slow path root.queue(addr, s, lifo) goparkunlock(&root.lock, waitReasonSemacquire, traceEvGoBlockSync, 4+skipframes) if s.ticket != 0 || cansemacquire(addr) { break } } ... }

Based on the pprof graph we presented above, we can conclude that:

Observation: runtime.gopark calls rarely when 2400 #goroutines, and runtime.mutex calls heavily. We infer that most of the code is done before the slow path.

Observation: runtime.gopark calls heavily when 4800 #goroutines. We infer that most of the code was entering the slow path, and when we start using runtime.gopark, the runtime scheduler context switching costs must be considered.

Considering channels in Go is implemented based on OS synchronization primitives without involving runtime scheduler, eg. Futex on Linux. Therefore its performance decreases linearly with the increasing of problem size.

The above explains the reason why we see a massive performance decrease in sync.Mutex.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

当goroutine争用超过3400时，为什么sync.Mutex会大大降低性能？
2019-08-19 19:07

回答 2 已采纳 sync.Mutex 's implementation is based on runtime semaphore. The reason why it encounters massive p
为什么不使用sync.WaitGroup，sync.Mutex引用类型（例如通道，切片）？
2018-11-17 08:27

回答 1 已采纳 When you pass any argument as value, the value will get copied. Any modification these arguments w
为什么sync.Mutex存在？
2019-03-04 12:08

回答 2 已采纳 It's true that you could use a sync.RWMutex whenever you need a sync.Mutex. I think both exist be
Go sync.Mutex互斥锁的学习
2022-11-05 17:27

试剑江湖。的博客相比于 Go 语言宣扬的“用通讯的方式共享数据”，通过共享数据的方式来传递信息和协调线程运行的做法其实更加主流，毕竟大多数的现代编程语言，都是用后一种方式作为并发编程的解决方案的（这种方案的历史非常悠久，...
我应该使用通道还是sync.Mutex lock（）？
2016-08-20 17:59

回答 1 已采纳 I would suggest you use a channel, but let me point out something about your code. I noticed you
从接收方关闭通道：从多个goroutine访问sync.Mutex时出现死锁
2018-04-01 08:43

回答 2 已采纳 You can try as hard as you like: you have to close the channel from sender side. You might be abl
在地图[string] int上使用sync.Mutex进行的Golang竞赛
2016-12-14 23:19

回答 2 已采纳 All returns the underlying map and the releases the lock, so the code using the map will have a da
golang sync.Map 原理以及性能分析
2022-03-25 15:56

奇怪的大象的博客 sync.Map 原理以及性能分析支持并发的map sync.Map 数据结构 Load Store delete Range sync.Map总结 sync.Map，读写锁的适用场景参考文献 golang支持map关键字，golang的map的读写是编译成runtime的函数调用...
如何解决分配副本将锁定值复制到tr：net / http.Transport包含sync.Mutex http ssl
2016-05-05 14:06

回答 1 已采纳 You should be creating a *http.Transport pointer, instead of a value tr = &http.Transport{
甚至在golang中使用sync.Mutex时的比赛条件
2017-05-03 23:46

回答 2 已采纳 You have a number of race conditions, all pointed out specifically by the race detector: x :=
在golang中仔细检查了锁定-为什么需要mutex.RLock（）？
2019-01-08 13:34

回答 2 已采纳 If you don't acquire a RLock to read syncProducer, it's a data race, since another goroutine may u
sync.Map详解
2022-04-09 15:06

沉淅尘的博客导航Golang sync.Map 详解简单的介绍一下 Golang MapMap 使用sync.Mapsync.Map 是什么sync.Map 使用sync.Map 剖析sync.map 整体结构参考参考 Golang sync.Map 详解原生的 Go Map 在并发读写场景下经常会遇到 panic ...
什么是golang中的sync（sync.RWMutex）
2016-10-23 18:44

回答 1 已采纳 Gorrila's context associates data structures with one another by means of a map, but maps are not
GO标准库巡礼-sync
2020-03-19 22:42

lz404的博客在go中sync负责提供同步原语如互斥锁等。任何属于该包类型的对象都不应该被复制（只能passed by pointer) sync.Mutex sync.Mutex是最常用的同步原语。其作用在于对共享资源的互斥访问。常用的使用范式 mutex := &...
Go sync.Map
2021-04-13 11:01

JunChow520的博客 map并发读线程安全，并发读写线程不...换言之，Golang中map只读是线程安全的(thread-safe)，但在并发环境下读写是线程不安全的(写线程不安全)，为什么呢？例如：并发环境下同时读写map会发生致命错误，即多个gor...
Golang中sync.Map的实现原理
2020-11-05 14:50

背着电脑去搬砖的博客需要并发读写时，一般的做法是加锁，但这样性能并不高，Go语言在 1.9 版本中提供了一种效率较高的并发安全的 sync.Map，今天，我们就来讲讲 sync.Map的用法以及原理使用方法 func main() { var m sync.Map //插入...
sync.Pool使用与实现
2020-04-05 20:49

lz404的博客之所以需要单独提及sync.Pool是因为 1. 它对于性能优化非常重要，gin利用sync.Pool来重新利用context， fasthttp更是专门提及"sync.Pool is your best friend."。2. 由于其引入就是为了优化性能，因此我们可以从源码...
没有解决我的问题, 去提问

悬赏问题

¥15 素材场景中光线烘焙后灯光失效
¥15 请教一下各位，为什么我这个没有实现模拟点击
¥15 执行 virtuoso 命令后，界面没有，cadence 启动不起来
¥50 comfyui下连接animatediff节点生成视频质量非常差的原因
¥20 有关区间dp的问题求解
¥15 多电路系统共用电源的串扰问题
¥15 slam rangenet++配置
¥15 有没有研究水声通信方面的帮我改俩matlab代码
¥15 ubuntu子系统密码忘记
¥15 保护模式-系统加载-段寄存器

当goroutine争用超过3400时，为什么sync.Mutex会大大降低性能？

2条回答 默认 最新

悬赏问题

2条回答默认最新