当goroutine争用超过3400时，为什么sync.Mutex会大大降低性能？

I am comparing the performance regarding sync.Mutex and Go channels. Here is my benchmark:

// go playground: https://play.golang.org/p/f_u9jHBq_Jc
const (
    start = 300  // actual = start  * goprocs
    end   = 600 // actual = end    * goprocs
    step  = 10
)

var goprocs = runtime.GOMAXPROCS(0) // 8

// https://perf.golang.org/search?q=upload:20190819.3
func BenchmarkChanWrite(b *testing.B) {
    var v int64
    ch := make(chan int, 1)
    ch <- 1
    for i := start; i < end; i += step {
        b.Run(fmt.Sprintf("goroutines-%d", i*goprocs), func(b *testing.B) {
            b.SetParallelism(i)
            b.RunParallel(func(pb *testing.PB) {
                for pb.Next() {
                    <-ch
                    v += 1
                    ch <- 1
                }
            })
        })
    }
}

// https://perf.golang.org/search?q=upload:20190819.2
func BenchmarkMutexWrite(b *testing.B) {
    var v int64
    mu := sync.Mutex{}
    for i := start; i < end; i += step {
        b.Run(fmt.Sprintf("goroutines-%d", i*goprocs), func(b *testing.B) {
            b.SetParallelism(i)
            b.RunParallel(func(pb *testing.PB) {
                for pb.Next() {
                    mu.Lock()
                    v += 1
                    mu.Unlock()
                }
            })
        })
    }
}

The performance comparison visualization is as follows:

What are the reasons that

sync.Mutex encounters a large performance drop when the number of goroutines goes higher than roughly 3400?
Go channels are pretty stable but slower than sync.Mutex before?

Raw bench data by benchstat (go test -bench=. -count=5) go version go1.12.4 linux/amd64:

MutexWrite/goroutines-2400-8  48.6ns ± 1%
MutexWrite/goroutines-2480-8  49.1ns ± 0%
MutexWrite/goroutines-2560-8  49.7ns ± 1%
MutexWrite/goroutines-2640-8  50.5ns ± 3%
MutexWrite/goroutines-2720-8  50.9ns ± 2%
MutexWrite/goroutines-2800-8  51.8ns ± 3%
MutexWrite/goroutines-2880-8  52.5ns ± 2%
MutexWrite/goroutines-2960-8  54.1ns ± 4%
MutexWrite/goroutines-3040-8  54.5ns ± 2%
MutexWrite/goroutines-3120-8  56.1ns ± 3%
MutexWrite/goroutines-3200-8  63.2ns ± 5%
MutexWrite/goroutines-3280-8  77.5ns ± 6%
MutexWrite/goroutines-3360-8   141ns ± 6%
MutexWrite/goroutines-3440-8   239ns ± 8%
MutexWrite/goroutines-3520-8   248ns ± 3%
MutexWrite/goroutines-3600-8   254ns ± 2%
MutexWrite/goroutines-3680-8   256ns ± 1%
MutexWrite/goroutines-3760-8   261ns ± 2%
MutexWrite/goroutines-3840-8   266ns ± 3%
MutexWrite/goroutines-3920-8   276ns ± 3%
MutexWrite/goroutines-4000-8   278ns ± 3%
MutexWrite/goroutines-4080-8   286ns ± 5%
MutexWrite/goroutines-4160-8   293ns ± 4%
MutexWrite/goroutines-4240-8   295ns ± 2%
MutexWrite/goroutines-4320-8   280ns ± 8%
MutexWrite/goroutines-4400-8   294ns ± 9%
MutexWrite/goroutines-4480-8   285ns ±10%
MutexWrite/goroutines-4560-8   290ns ± 8%
MutexWrite/goroutines-4640-8   271ns ± 3%
MutexWrite/goroutines-4720-8   271ns ± 4%

ChanWrite/goroutines-2400-8  158ns ± 3%
ChanWrite/goroutines-2480-8  159ns ± 2%
ChanWrite/goroutines-2560-8  161ns ± 2%
ChanWrite/goroutines-2640-8  161ns ± 1%
ChanWrite/goroutines-2720-8  163ns ± 1%
ChanWrite/goroutines-2800-8  166ns ± 3%
ChanWrite/goroutines-2880-8  168ns ± 1%
ChanWrite/goroutines-2960-8  176ns ± 4%
ChanWrite/goroutines-3040-8  176ns ± 2%
ChanWrite/goroutines-3120-8  180ns ± 1%
ChanWrite/goroutines-3200-8  180ns ± 1%
ChanWrite/goroutines-3280-8  181ns ± 2%
ChanWrite/goroutines-3360-8  183ns ± 2%
ChanWrite/goroutines-3440-8  188ns ± 3%
ChanWrite/goroutines-3520-8  190ns ± 2%
ChanWrite/goroutines-3600-8  193ns ± 2%
ChanWrite/goroutines-3680-8  196ns ± 3%
ChanWrite/goroutines-3760-8  199ns ± 2%
ChanWrite/goroutines-3840-8  206ns ± 2%
ChanWrite/goroutines-3920-8  209ns ± 2%
ChanWrite/goroutines-4000-8  206ns ± 2%
ChanWrite/goroutines-4080-8  209ns ± 2%
ChanWrite/goroutines-4160-8  208ns ± 2%
ChanWrite/goroutines-4240-8  209ns ± 3%
ChanWrite/goroutines-4320-8  213ns ± 2%
ChanWrite/goroutines-4400-8  209ns ± 2%
ChanWrite/goroutines-4480-8  211ns ± 1%
ChanWrite/goroutines-4560-8  213ns ± 2%
ChanWrite/goroutines-4640-8  215ns ± 1%
ChanWrite/goroutines-4720-8  218ns ± 3%

Go 1.12.4. Hardware:

CPU:       Quad core Intel Core i7-7700 (-MT-MCP-) cache: 8192 KB
           clock speeds: max: 4200 MHz 1: 1109 MHz 2: 3641 MHz 3: 3472 MHz 4: 3514 MHz 5: 3873 MHz 6: 3537 MHz
           7: 3410 MHz 8: 3016 MHz
           CPU Flags: 3dnowprefetch abm acpi adx aes aperfmperf apic arat arch_perfmon art avx avx2 bmi1 bmi2
           bts clflush clflushopt cmov constant_tsc cpuid cpuid_fault cx16 cx8 de ds_cpl dtes64 dtherm dts epb
           ept erms est f16c flexpriority flush_l1d fma fpu fsgsbase fxsr hle ht hwp hwp_act_window hwp_epp
           hwp_notify ibpb ibrs ida intel_pt invpcid invpcid_single lahf_lm lm mca mce md_clear mmx monitor
           movbe mpx msr mtrr nonstop_tsc nopl nx pae pat pbe pcid pclmulqdq pdcm pdpe1gb pebs pge pln pni
           popcnt pse pse36 pti pts rdrand rdseed rdtscp rep_good rtm sdbg sep smap smep smx ss ssbd sse sse2
           sse4_1 sse4_2 ssse3 stibp syscall tm tm2 tpr_shadow tsc tsc_adjust tsc_deadline_timer tsc_known_freq
           vme vmx vnmi vpid x2apic xgetbv1 xsave xsavec xsaveopt xsaves xtopology xtpr

Update: I tested on different hardware. It seems the problem still exists:

bench: https://play.golang.org/p/HnQ44--E4UQ

Update:

My full benchmark that tested from 8 goroutines to 15000 goroutines, including a comparison on chan/sync.Mutex/atomic:

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

2条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
dsfykqq3403 2019-08-27 07:56
关注
sync.Mutex 's implementation is based on runtime semaphore. The reason why it encounters massive performance decreases is that the implementation of runtime.semacquire1.

Now, let's sample two representative points, we use go tool pprof when the number of goroutines was equal to 2400 and 4800:

goos: linux goarch: amd64 BenchmarkMutexWrite/goroutines-2400-8 50000000 46.5 ns/op PASS ok 2.508s BenchmarkMutexWrite/goroutines-4800-8 50000000 317 ns/op PASS ok 16.020s

2400:

4800:

As we can see, when the number of goroutines increased to 4800, the overhead of runtime.gopark becomes dominant. Let's dig more in the runtime source code and see who exactly calls runtime.gopark. In the runtime.semacquire1:

func semacquire1(addr *uint32, lifo bool, profile semaProfileFlags, skipframes int) { // fast path if cansemacquire(addr) { return } s := acquireSudog() root := semroot(addr) ... for { lock(&root.lock) atomic.Xadd(&root.nwait, 1) if cansemacquire(addr) { atomic.Xadd(&root.nwait, -1) unlock(&root.lock) break } // slow path root.queue(addr, s, lifo) goparkunlock(&root.lock, waitReasonSemacquire, traceEvGoBlockSync, 4+skipframes) if s.ticket != 0 || cansemacquire(addr) { break } } ... }

Based on the pprof graph we presented above, we can conclude that:

Observation: runtime.gopark calls rarely when 2400 #goroutines, and runtime.mutex calls heavily. We infer that most of the code is done before the slow path.

Observation: runtime.gopark calls heavily when 4800 #goroutines. We infer that most of the code was entering the slow path, and when we start using runtime.gopark, the runtime scheduler context switching costs must be considered.

Considering channels in Go is implemented based on OS synchronization primitives without involving runtime scheduler, eg. Futex on Linux. Therefore its performance decreases linearly with the increasing of problem size.

The above explains the reason why we see a massive performance decrease in sync.Mutex.
本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

查看更多回答(1条)

报告相同问题？

关注问题

Go sync.Mutex互斥锁的学习
2022-11-05 17:27

试剑江湖。的博客相比于 Go 语言宣扬的“用通讯的方式共享数据”，通过共享数据的方式来传递信息和协调线程运行的做法其实更加主流，毕竟大多数的现代编程语言，都是用后一种方式作为并发编程的解决方案的（这种方案的历史非常悠久，...
Go sync.Pool最佳实践：复用对象降低GC压力
2025-05-14 20:18

Go高并发架构_王工的博客理论知识已经详细介绍，接下来分享一些来自真实项目的经验和教训，这些案例展示了sync.Pool在不同场景下的应用效果。经过深入探讨，我们已经全面了解了sync.Pool的工作原理、使用方法、最佳实践和实战经验。让我们...
golang sync.Map 原理以及性能分析
2022-03-25 15:56

奇怪的大象的博客 sync.Map 原理以及性能分析支持并发的map sync.Map 数据结构 Load Store delete Range sync.Map总结 sync.Map，读写锁的适用场景参考文献 golang支持map关键字，golang的map的读写是编译成runtime的函数调用...
sync.Map详解
2022-04-09 15:06

沉淅尘的博客导航Golang sync.Map 详解简单的介绍一下 Golang MapMap 使用sync.Mapsync.Map 是什么sync.Map 使用sync.Map 剖析sync.map 整体结构参考参考 Golang sync.Map 详解原生的 Go Map 在并发读写场景下经常会遇到 panic ...
GO标准库巡礼-sync
2020-03-19 22:42

lz404的博客在go中sync负责提供同步原语如互斥锁等。任何属于该包类型的对象都不应该被复制（只能passed by pointer) sync.Mutex sync.Mutex是最常用的同步原语。其作用在于对共享资源的互斥访问。常用的使用范式 mutex := &...
Go 1.9 sync.Map 并发性能深度分析
2025-08-24 13:12

金刚廉神兽的博客分段锁（Segmented Locks）是一种为了提高并发性能而引入的锁机制。...这样，当线程需要访问某个段时，只需要获取该段的锁即可，而不需要获取整个数据结构的锁，大大减少了锁竞争，提高了并发性能。
Go sync.Map
2021-04-13 11:01

JunChow520的博客 map并发读线程安全，并发读写线程不...换言之，Golang中map只读是线程安全的(thread-safe)，但在并发环境下读写是线程不安全的(写线程不安全)，为什么呢？例如：并发环境下同时读写map会发生致命错误，即多个gor...
Golang生产级实战：彻底解决高并发卡顿与内存泄漏(Goroutine, GMP, sync.Pool)
2025-10-15 14:02

国良的架构笔记的博客本文介绍了Golang在医疗系统开发中的两大...在内存管理方面，分析了逃逸分析和GC机制对AI监测系统性能的影响，强调合理控制变量作用域以避免GC压力导致系统卡顿。这些特性使Golang成为处理高并发医疗场景的理想选择。
sync.Pool使用与实现
2020-04-05 20:49

lz404的博客之所以需要单独提及sync.Pool是因为 1. 它对于性能优化非常重要，gin利用sync.Pool来重新利用context， fasthttp更是专门提及"sync.Pool is your best friend."。2. 由于其引入就是为了优化性能，因此我们可以从源码...
Golang中sync.Map的实现原理
2020-11-05 14:50

背着电脑去搬砖的博客需要并发读写时，一般的做法是加锁，但这样性能并不高，Go语言在 1.9 版本中提供了一种效率较高的并发安全的 sync.Map，今天，我们就来讲讲 sync.Map的用法以及原理使用方法 func main() { var m sync.Map //插入...
「Golang」sync.RWMutex源码讲解
2020-12-06 17:11

_ Echo_的博客什么是sync.RWMutex 上次写过了sync.Mutex的源代码解析，这回写一下他的扩展版本，sync.RWMutex（下称读写锁）的源代码解析，首先看一下读写锁的作用，如下述： sync/rwmutex.go中 // A RWMutex is a reader/writer...
Go语言系列 - sync.Map 源码解析
2025-10-12 10:35

non-action_pilgrim的博客 Go语言系列-sync.Map 源码解析
Go高并发服务性能调优心法：告别Goroutine泛滥、GC毛刺与锁竞争
2025-09-28 11:44

国良的架构笔记的博客 Go高并发服务性能调优实战：从崩溃到稳定问题背景：电子患者数据上报系统在流量高峰时出现性能崩溃，表现为Goroutine泛滥、CPU过载和数据库连接耗尽。核心问题：无节制创建Goroutine导致调度器过载频繁内存分配...
Goroutine 并发调度模型深度解析之手撸一个高性能 goroutine 池
2021-02-23 17:07

Geffin的博客文章目录1 前言2 Goroutine & Scheduler2.1 线程那些事儿2.1.1 用户级线程模型2.1.2 内核级线程模型2.1.3 两级线程模型2.2 G-P-M 模型概述2.3 G-P-M 模型调度2.3.1 用户态阻塞/唤醒2.3.2 系统调用阻塞3 大规模 ...
深入理解 go Mutex
2024-01-11 08:44

「已注销」的博客这里引用一下维基百科的定义：互斥锁（，缩写Mutex）是一种用于多线程编程中，防止两个线程同时对同一公共资源（比如全局变量）进行读写的机制。该目的通过将代码切片成一个一个的临界区域（）达成。临街区域指的是...
没有解决我的问题, 去提问

当goroutine争用超过3400时，为什么sync.Mutex会大大降低性能？

2条回答 默认 最新

2条回答默认最新