weixin_39940154
weixin_39940154
2021-01-01 05:56

Compilation locks

Running on the cluster I get a lot of fighting over locks. Maybe it would be good to have functionality that allows the first worker to finish compiling (and fill the cache) before the others start?

You basically just use it like this:

 python
worker.start_compilation()
f = theano.function([inputs], [outputs])
# etc
worker.end_compilation()

The first worker to get there will go right on, the others will wait till the first workers reaches end_compilation after which the rest can go.

该提问来源于开源项目:mila-iqia/platoon

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

14条回答

  • weixin_39940154 weixin_39940154 4月前

    I meant to ask, which update are you referring to? I couldn't find any commits in Theano that seemed to affect the compilation lock, so I wasn't sure what you meant.

    点赞 评论 复制链接分享
  • weixin_39560245 weixin_39560245 4月前

    I didn't do a PR, it is in a branch in my fork on github:

    https://github.com/nouiz/Theano/tree/lock

    On Thu, Feb 25, 2016 at 12:57 PM, Bart van Merriënboer < notifications.com> wrote:

    I meant to ask, which update are you referring to? I couldn't find any commits in Theano that seemed to affect the compilation lock, so I wasn't sure what you meant.

    — Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-188904479.

    点赞 评论 复制链接分享
  • weixin_39560245 weixin_39560245 4月前

    This isn't always good. If the optimization take times and the cache is full, then this will slow things down...

    The few good fixes I know: - The first time, start one worker to fill the cache, then restart. - Make Theano compile faster when the cache is empty - Update to the lock? (I'm not sure this is possible)

    I'm not again this PR is this is optional, but we should not make it mandatory or tell people to always use this.

    点赞 评论 复制链接分享
  • weixin_39940154 weixin_39940154 4月前

    The first time, start one worker to fill the cache, then restart.

    That's kind of what I was trying to do, but it's true that this approach is not very efficient in the case where the cache is already full. Is there a way to check if the cache is full or not?

    On which cluster did you had problems with this?

    Helios, my compilation directory was in $RAP

    This is mostly what the theano cache do, but via the file system... So mostly, the gain is bypassing the filesystem? Do you see something else?

    I was just trying to avoid Theano's locking system, because it seemed really inefficient. The waiting period is between 5 and 10 seconds by default, which seems long, The lock itself is two calls to isdir and mkdir, which I guess technically speaking is not atomic and can result in race conditions (I had one process crash with a strange error which I think happened because the cache was corrupt, maybe because of that?).

    What is the motivation for Theano not using fcntl.flock or fcntl.lockf? In the lab our system seems to be NFS4 with local_lock=none, so it should support locking files using fcntl. Hades and a bunch of other clusters use GPFS, which also supports fcntl locks. The only problem seems to be Helios, which uses Lustre but with localflock enabled (so locks are node-local apparently?). Do you think they would be willing to enable global locks? If so, Theano could switch to using blocking fcntl calls, leaving things up to the file system, which is likely to be far more efficient.

    点赞 评论 复制链接分享
  • weixin_39746229 weixin_39746229 4月前

    I think that folder creation is atomic and that's the reason why it's used there. As for the global lock, I asked them before to enable it and they said that it would have a huge impact on performance. I then did my research and it's totally false, so I have plans to ask again with more data in the coming weeks.

    I also agree with Fred, I don't think this PR is the right solution. I think that just launching a dummy job on queue test to fill the cache first if the best thing to do at the moment.

    点赞 评论 复制链接分享
  • weixin_39940154 weixin_39940154 4月前

    Folder creation is, yes, but you first need to test whether it exists already. In between this test and the creation another process might have created it already and your folder creation will fail; so the locking operation isn't atomic in any way. It's generally a bad idea to try and reimplement locking when there are system calls that do it for you, but since you asked them about enabling global locking on Lustre I guess you are thinking the same thing.

    When I looked into this just now, I reached the same conclusion. One paper actually explicitly said

    While the Lustre documentation states that the locking mechanism can be disabled for higher performance, we have never observed such improvement by doing so.

    点赞 评论 复制链接分享
  • weixin_39746229 weixin_39746229 4月前

    Yes yes, I totally agree that reimplementing locking is a terrible idea in general that is why I asked them to fix that around a year ago :P

    Thanks for the paper, I'll add it to the evidence I'll send them.

    As for locking with folder creation, I don't know how it's implemented in Theano but, how about just creating and catching the "already exist" exception as we do here.

    点赞 评论 复制链接分享
  • weixin_39560245 weixin_39560245 4月前

    The directory creation is atomic. In fact, it is the only posic atomic operation. This is why we use it. We needed that in the past as we where using NFS3 that didn't had a global lock working.

    Maybe the isdir can be removed to lower the load on the FS, but I don't think it will really help, but it should be quick to implement. Do you want to do it?

    The 5-10s shouldn't be a problem. During this time, one process is compiling. When a process get the lock, it will pick up what others have done. In fact, to lower the overhead on the OS, and help on that, we could raise this on Helios to 30s-60s. It would help that.

    , and others here, what do you think of raising the wait time? Do you also think it can help? Bart, do you have the time to clear the cache and try it with an higher wait time? There is a Theano flag for this: compile.wait=30 would do it.

    On Thu, Feb 18, 2016 at 3:28 PM, Mathieu Germain notifications.com wrote:

    Yes yes, I totally agree that reimplementing locking is a terrible idea in general that is why I asked them to fix that around a year ago :P

    Thanks for the paper, I'll add it to the evidence I'll send them.

    As for locking with folder creation, I don't know how it's implemented in Theano but, how about just creating and catching the "already exist" exception as we do here https://github.com/SMART-Lab/smartdispatch/pull/100/files.

    — Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-185901784.

    点赞 评论 复制链接分享
  • weixin_39978863 weixin_39978863 4月前

    The current locking code for theano works very well in all sorts of hostile environments. This is not the case for fcntl() and flock() which both have silent failure cases in some configurations.

    The isdir call is only there as an optimization to avoid doing the mkdir which is the real "lock" here. As matthieu said we use mkdir because it is atomic on NFS (in fact that is the only guaranteed atomic operation on all NFS versions).

    It might be a tiny bit faster to use an fcntl() lock in environments that support it, but that time would be dwarfed by the time of the compilation itself.

    What might improve performance of the cache is a better index than looping through the directory. This would also reduce the load on the filesystem.

    I am not sure if increasing the wait time is actually going to win anything here.

    点赞 评论 复制链接分享
  • weixin_39560245 weixin_39560245 4月前

    The real problem is

    Suppose that if you have an empty cache and launch 1 job and it take it 10m to fill the cache.

    Then it happen frequently that if you have an empty cache and launch many jobs at the same time, it will take more then 10m for the first job to finish. It can take 30m 1h and even more.

    I don't understand why this happen. If the problem is fight for the lock with non-efficient lock via the FS, fcntl can help and making the waiting time could also help as we will try to take it less often, so the same process will kept it for longer.

    , what tell you that the isdir() is an optimization? Did you timed it? Maybe it is as costly as mkdir. If that is the case, it is not efficient to use it.

    Fred

    On Thu, Feb 18, 2016 at 4:32 PM, abergeron notifications.com wrote:

    The current locking code for theano works very well is all sorts of hostile environments. This is not the case for fcntl() and flock() which both have silent failure cases in some configurations.

    The isdir call is only there as an optimization to avoid doing the mkdir which is the real "lock" here. As matthieu said we use mkdir because it is atomic on NFS (in fact that is the only guaranteed atomic operation on all NFS versions).

    It might be a tiny bit faster to use an fcntl() lock in environments that support it, but that time would be dwarfed by the time of the compilation itself.

    What might improve performance of the cache is a better index that looping through the directory. This would also reduce the load on the filesystem.

    I am not sure if increasing the wait time is actually going to win anything here.

    — Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-185931479.

    点赞 评论 复制链接分享
  • weixin_39978863 weixin_39978863 4月前

    I didn't time it. I'm talking about what's there. Maybe the isdir is superfluous and we could go straight for mkdir.

    Also, we should try to see why it takes more time with many processes. Is it because they are rescanning the cache? Is it because they are actually checking the lock too often? Is it because they just cause too much filesystem access when combined? Having an answer to these question would be more helpful than just replacing things blindly.

    From timings I did I know that we are spending a large amount of time looking for things by listing the directory in the single-process case. Does that extend to the multi-process case? Is it made worse? That I don't know.

    点赞 评论 复制链接分享
  • weixin_39560245 weixin_39560245 4月前

    We rescan each time we take the lock. This is needed to don't compile the same module multiple time. I don't think a listdir is slower then puthing that information into a file a reading it.

    Making the wait time longer could help by making less scanning?

    In the past, we where taking the lock at the start and keeping it for all the compilation process. To not take it when we don't compile c code, I postponed it to only when we need to take it. But I forgot if we keep it to the end of c file compilation or not.

    On Thu, Feb 18, 2016 at 7:13 PM, abergeron notifications.com wrote:

    I didn't time it. I'm talking about what's there. Maybe the isdir is superfluous and we could go straight for mkdir.

    Also, we should try to see why it takes more time with many processes. Is it because they are rescanning the cache? Is it because they are actually checking the lock too often? Is it because they just cause too much filesystem access when combined? Having an answer to these question would be more helpful than just replacing things blindly.

    From timings I did I know that we are spending a large amount of time looking for things by listing the directory in the single-process case. Does that extend to the multi-process case? Is it made worse? That I don't know.

    — Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-185988564.

    点赞 评论 复制链接分享
  • weixin_39940154 weixin_39940154 4月前

    So I just created a little benchmark compiling our machine translation model for CPU on my desktop at home. If I start 4 workers in parallel with an empty cache and Theano's directory locking:

    
    2016-02-19 11:00:54,316:Worker 1: Finished, took 282.9769949913025
    2016-02-19 11:01:03,866:Worker 0: Finished, took 292.5302128791809
    2016-02-19 11:01:04,291:Worker 3: Finished, took 292.93801403045654
    2016-02-19 11:01:14,747:Worker 2: Finished, took 303.4081165790558
    2016-02-19 11:01:14,819:Completed parallel processing, took 303.4868779182434
    

    If I start 1 first, and then the other 3

    
    2016-02-19 11:05:00,069:Worker 0: Finished, took 225.24447321891785
    ---
    2016-02-19 11:06:40,361:Worker 3: Finished, took 100.25727009773254
    2016-02-19 11:06:42,032:Worker 2: Finished, took 101.92840576171875
    2016-02-19 11:06:50,461:Worker 1: Finished, took 110.35797786712646
    2016-02-19 11:06:50,518:Completed sequential processing, took 335.69908452033997
    

    If I use fcntl.lockf (see https://github.com/bartvm/Theano/commit/a341aa8f7e7f1fdcd071588b5c804fec1c469e87):

    
    2016-02-19 11:26:02,181:Worker 0: Finished, took 307.9920198917389
    2016-02-19 11:26:02,206:Worker 3: Finished, took 308.0243444442749
    2016-02-19 11:26:02,271:Worker 1: Finished, took 308.06945419311523
    2016-02-19 11:26:02,283:Worker 2: Finished, took 308.09377932548523
    2016-02-19 11:26:02,348:Completed parallel processing, took 308.1741032600403
    
    2016-02-19 11:29:40,563:Worker 0: Finished, took 218.2124490737915
    ---
    2016-02-19 11:31:19,632:Worker 1: Finished, took 99.03914141654968
    2016-02-19 11:31:20,215:Worker 2: Finished, took 99.62268471717834
    2016-02-19 11:31:20,495:Worker 3: Finished, took 99.9016981124878
    2016-02-19 11:31:20,552:Completed sequential processing, took 318.2032618522644
    

    In short, the directory locking definitely slows things down (from ~220 to ~300 seconds), but the locking mechanism itself seems to make little difference. I'll try running it on the cluster, see if it gives the same results.

    点赞 评论 复制链接分享
  • weixin_39560245 weixin_39560245 4月前

    Thanks for the timing. So we know now that the lock itself, in the best case (locally) don't cause slowdowns.

    Making the same timing on the cluster is a good idea.

    I just checked the code and we take and release the lock for each thunk we compile. I started in the past the implementation merged in master to take it the first time we need it and release it only after all the thunks have been compiled. I'll see if I can quickly hack a working version to see if this help.

    I think if we can understand what cause the 220s to become 300s and fix this, we would "fix" the problem that it is slower on cluster.

    On Fri, Feb 19, 2016 at 12:14 PM, Bart van Merriënboer < notifications.com> wrote:

    So I just created a little benchmark compiling our machine translation model for CPU on my desktop at home. If I start 4 workers in parallel with an empty cache and Theano's directory locking:

    2016-02-19 11:00:54,316:Worker 1: Finished, took 282.9769949913025 2016-02-19 11:01:03,866:Worker 0: Finished, took 292.5302128791809 2016-02-19 11:01:04,291:Worker 3: Finished, took 292.93801403045654 2016-02-19 11:01:14,747:Worker 2: Finished, took 303.4081165790558 2016-02-19 11:01:14,819:Completed parallel processing, took 303.4868779182434

    If I start 1 first, and then the other 3

    2016-02-19 11:05:00,069:Worker 0: Finished, took 225.24447321891785

    2016-02-19 11:06:40,361:Worker 3: Finished, took 100.25727009773254 2016-02-19 11:06:42,032:Worker 2: Finished, took 101.92840576171875 2016-02-19 11:06:50,461:Worker 1: Finished, took 110.35797786712646 2016-02-19 11:06:50,518:Completed sequential processing, took 335.69908452033997

    If I use fcntl.lockf (see bartvm/Theano https://github.com/bartvm/Theano/commit/a341aa8f7e7f1fdcd071588b5c804fec1c469e87 ):

    2016-02-19 11:26:02,181:Worker 0: Finished, took 307.9920198917389 2016-02-19 11:26:02,206:Worker 3: Finished, took 308.0243444442749 2016-02-19 11:26:02,271:Worker 1: Finished, took 308.06945419311523 2016-02-19 11:26:02,283:Worker 2: Finished, took 308.09377932548523 2016-02-19 11:26:02,348:Completed parallel processing, took 308.1741032600403

    2016-02-19 11:29:40,563:Worker 0: Finished, took 218.2124490737915

    2016-02-19 11:31:19,632:Worker 1: Finished, took 99.03914141654968 2016-02-19 11:31:20,215:Worker 2: Finished, took 99.62268471717834 2016-02-19 11:31:20,495:Worker 3: Finished, took 99.9016981124878 2016-02-19 11:31:20,552:Completed sequential processing, took 318.2032618522644

    In short, the directory locking definitely slows things down (from ~220 to ~300 seconds), but the locking mechanism itself seems to make little difference. I'll try running it on the cluster, see if it gives the same results.

    — Reply to this email directly or view it on GitHub https://github.com/mila-udem/platoon/pull/62#issuecomment-186309795.

    点赞 评论 复制链接分享

相关推荐