dongliqin6939 2016-10-02 09:44
浏览 126
已采纳

为什么Locking Go比Java慢得多? Mutex.Lock()Mutex.Unlock()花费了大量时间

I've written a small Go library (go-patan) that collects a running min/max/avg/stddev of certain variables. I compared it to an equivalent Java implementation (patan), and to my surprise the Java implementation is much faster. I would like to understand why.

The library basically consists of a simple data store with a lock that serializes reads and writes. This is a snippet of the code:

type Store struct {
   durations map[string]*Distribution
   counters  map[string]int64
   samples   map[string]*Distribution

   lock *sync.Mutex
}

func (store *Store) addSample(key string, value int64) {
  store.addToStore(store.samples, key, value)
}

func (store *Store) addDuration(key string, value int64) {
  store.addToStore(store.durations, key, value)
}

func (store *Store) addToCounter(key string, value int64) {
  store.lock.Lock()
  defer store.lock.Unlock()
  store.counters[key] = store.counters[key] + value
}

func (store *Store) addToStore(destination map[string]*Distribution, key string, value int64) {
  store.lock.Lock()
  defer store.lock.Unlock()
  distribution, exists := destination[key]
  if !exists {
    distribution = NewDistribution()
    destination[key] = distribution
  }
  distribution.addSample(value)
}

I've benchmarked the GO and Java implementations (go-benchmark-gist, java-benchmark-gist) and Java wins by far, but I don't understand why:

Go Results:
10 threads with 20000 items took 133 millis
100 threads with 20000 items took 1809 millis
1000 threads with 20000 items took 17576 millis
10 threads with 200000 items took 1228 millis
100 threads with 200000 items took 17900 millis

Java Results:
10 threads with 20000 items takes 89 millis
100 threads with 20000 items takes 265 millis
1000 threads with 20000 items takes 2888 millis  
10 threads with 200000 items takes 311 millis
100 threads with 200000 items takes 3067 millis

I've profiled the program with the Go's pprof and generated a call graph call-graph. This shows that it basically spends all the time in sync.(*Mutex).Lock() and sync.(*Mutex).Unlock().

The Top20 calls according to the profiler:

(pprof) top20
59110ms of 73890ms total (80.00%)
Dropped 22 nodes (cum <= 369.45ms)
Showing top 20 nodes out of 65 (cum >= 50220ms)
      flat  flat%   sum%        cum   cum%
    8900ms 12.04% 12.04%     8900ms 12.04%  runtime.futex
    7270ms  9.84% 21.88%     7270ms  9.84%  runtime/internal/atomic.Xchg
    7020ms  9.50% 31.38%     7020ms  9.50%  runtime.procyield
    4560ms  6.17% 37.56%     4560ms  6.17%  sync/atomic.CompareAndSwapUint32
    4400ms  5.95% 43.51%     4400ms  5.95%  runtime/internal/atomic.Xadd
    4210ms  5.70% 49.21%    22040ms 29.83%  runtime.lock
    3650ms  4.94% 54.15%     3650ms  4.94%  runtime/internal/atomic.Cas
    3260ms  4.41% 58.56%     3260ms  4.41%  runtime/internal/atomic.Load
    2220ms  3.00% 61.56%    22810ms 30.87%  sync.(*Mutex).Lock
    1870ms  2.53% 64.10%     1870ms  2.53%  runtime.osyield
    1540ms  2.08% 66.18%    16740ms 22.66%  runtime.findrunnable
    1430ms  1.94% 68.11%     1430ms  1.94%  runtime.freedefer
    1400ms  1.89% 70.01%     1400ms  1.89%  sync/atomic.AddUint32
    1250ms  1.69% 71.70%     1250ms  1.69%  github.com/toefel18/go-patan/statistics/lockbased.(*Distribution).addSample
    1240ms  1.68% 73.38%     3140ms  4.25%  runtime.deferreturn
    1070ms  1.45% 74.83%     6520ms  8.82%  runtime.systemstack
    1010ms  1.37% 76.19%     1010ms  1.37%  runtime.newdefer
    1000ms  1.35% 77.55%     1000ms  1.35%  runtime.mapaccess1_faststr
     950ms  1.29% 78.83%    15660ms 21.19%  runtime.semacquire
     860ms  1.16% 80.00%    50220ms 67.97%  main.Benchmrk.func1

Can someone explain why locking in Go seems to be so much slower than in Java, what am I doing wrong? I've also written a channel based implementation in Go, but that is even slower.

  • 写回答

2条回答 默认 最新

  • donglanzhan7151 2016-10-03 20:41
    关注

    I've also posted this question on the golang-nuts group. The reply from Jesper Louis Andersen explains quite well that Java uses synchronization optimization techniques such as lock escape analysis/lock elision and lock coarsening.

    Java JIT might be taking the lock and allowing multiple updates at once within the lock to increase performance. I ran the Java benchmark with -Djava.compiler=NONE which gave dramatic performance, but is not a fair comparison.

    I assume that many of these optimization techniques have less impact in a production environment.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥15 微信会员卡等级和折扣规则
  • ¥15 微信公众平台自制会员卡可以通过收款码收款码收款进行自动积分吗
  • ¥15 随身WiFi网络灯亮但是没有网络,如何解决?
  • ¥15 gdf格式的脑电数据如何处理matlab
  • ¥20 重新写的代码替换了之后运行hbuliderx就这样了
  • ¥100 监控抖音用户作品更新可以微信公众号提醒
  • ¥15 UE5 如何可以不渲染HDRIBackdrop背景
  • ¥70 2048小游戏毕设项目
  • ¥20 mysql架构,按照姓名分表
  • ¥15 MATLAB实现区间[a,b]上的Gauss-Legendre积分