dpka7974 2016-03-12 03:24
浏览 89
已采纳

分布式工作同步和fork-join并行编程方法有什么区别

In this article on wikipedia: https://en.wikipedia.org/wiki/Go_(programming_language)#Suitability_for_parallel_programming

It is claimed that the go expert used a distribute-work-synchronize pattern to organize his parallel programs, while the non-expert used a fork-join: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f1/Fork_join.svg/2000px-Fork_join.svg.png

I am familiar with the fork-join from school, but I was wondering what the distribute-work-synchronize pattern is, and what the differences are between it and the classical fork-join model that i'm familiar with?

When I do fork join I typically run as many threads as I do cores, but in the paper they say the go expert did this as well, and they briefly mention the overhead of creating new threads being one of the ways the go expert optimized the code, but don't seem to go into detail.

  • 写回答

1条回答 默认 最新

  • drmq16019 2016-03-12 10:56
    关注

    I would be very careful taking the statement you've quoted from https://en.wikipedia.org/wiki/Go_(programming_language)#Suitability_for_parallel_programming as a general truth for Go programming.

    I assume that what's described in the study as distribute-work-synchronize is the approach of dividing a problem into subtasks, that are determined mainly by the parallelization that can be achieved in hardware, and less by the natural way the problem decomposes into smaller tasks.

    This approach may give you some performance benefits depending on your specific problem and your expertise, but it may not be trivial to apply even for embarrassingly-parallel problems. In addition, this approach is more dependant on the specific hardware you're using (e.g. 2-32 vs 64-1024 cores, regular CAS vs LL/SC), on the specific problem size (e.g. fits in L1 vs barely fits in RAM), and most importantly, on your expertise with that approach and with the tool that you're using to solve your problem.

    The above is standard "premature optimization is the root of all evil" / "simple, adequate and correct trumps complex, super-fast and with an insidious bug" advice, but the actual experiment cited in the paper also gives some examples why you should use your own judgment which approach to use.

    1. The study used Go 1.0.3. The significant improvements done on the scheduling, garbage collection and goroutine/channel performance fronts may have effectively made the results obsolete.

      1.1. Ironically, the paper mentions how for one of the Chapel solutions, expert version was ~68% slower when Chapel 1.6 (instead of 1.5) was used.

    2. The study does not claim to provide statistically significant results - for each of the 4 platforms, a single non-expert solved 6 synthetic problems that fit a specific blueprint, then rewrote his solutions according to the advice of a single expert.

    3. The expert should not be held responsible for applying his advice outside of the specific context. distribute-work-synchronize was the better approach for these specific problems, if you are a Staff Software Engineer on the Golang team (Luuk van Dijk), and your alternative was using plain divide-and-conquer with Go 1.0.3.

    When I do fork join I typically run as many threads as I do cores, but in the paper they say the go expert did this as well, and they briefly mention the overhead of creating new threads being one of the ways the go expert optimized the code, but don't seem to go into detail.

    I'm assuming the overhead of creating new threads is related to the proliferation of tasks when approaching the bottom of the recursion tree.

    I would think that algorithms that fall into Case 2 and 3 of the Master Theorem will be particularly affected, but even algorithms that fall in Case 1 (where the work done on the leaf level of the tree is most significant, i.e. the overhead of the spawned threads is diluted the most) will be affected. For example, making a new logical task for the comparison of each pair of elements in a top-down merge sort is probably redundant.

    IMHO running as many threads as you do cores, but logically dividing the work naturally/on each recursion level is a great compromise between my understanding for distribute-work-synchronize and a plain divide-and-conquer.

    You are still paying some price in complexity (and probably, but not necessarily, in running time) for the scheduling of your N tasks on the K worker threads. It is possible for this price to cost you in missed parallelization opportunities at runtime, for example because of cache thrashing, because of sub-optmal scheduling, because of thread or core affinity in play, etc. However this may be abstracted away from your program at the language platform level, or at the Fork-Join framework level (in case you are using a framework), or maybe at the OS level. In this case, your solution is not fully responsible for adapting to changes in your language platform, in your problem size, or in your machine hardware, and you should be able to benefit from optimizations in the underlying layers without touching your solution.

    My bet is that the increased complexity and the reduced portability of a custom-tailored distribute-work-synchronize solution is worth it only if you can prove that your natural divide-and-conquer solution is not enough and you are aware of the consequences.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 vue2登录调用后端接口如何实现
  • ¥65 永磁型步进电机PID算法
  • ¥15 sqlite 附加(attach database)加密数据库时,返回26是什么原因呢?
  • ¥88 找成都本地经验丰富懂小程序开发的技术大咖
  • ¥15 如何处理复杂数据表格的除法运算
  • ¥15 如何用stc8h1k08的片子做485数据透传的功能?(关键词-串口)
  • ¥15 有兄弟姐妹会用word插图功能制作类似citespace的图片吗?
  • ¥15 latex怎么处理论文引理引用参考文献
  • ¥15 请教:如何用postman调用本地虚拟机区块链接上的合约?
  • ¥15 为什么使用javacv转封装rtsp为rtmp时出现如下问题:[h264 @ 000000004faf7500]no frame?