weixin_39633917
weixin_39633917
2020-12-02 15:10

lock rework: try to acquire, let go if any locks fail; retry

There's a potential race condition when working with multiple locks, as we do with locking multiple folders at once. If one set of locks is partially done, and another set is acquired, we can deadlock.

This PR implements a scheme where we try to acquire locks one-by-one, and if any fail (quickly), let go of all currently held locks in that group. The overall timeout still applies, but it applies to the whole group, not each individual lock.

Derived from http://stackoverflow.com/questions/9814008/multiple-mutex-locking-strategies-and-why-libraries-dont-use-address-comparison

This also refactors build.py to use the utils module, rather than its small pieces.

该提问来源于开源项目:conda/conda-build

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

13条回答

  • weixin_39633917 weixin_39633917 5月前

    Sorry this PR has been a lot of dead ends and things are still unreliable. I can't tell why Conda is removing files out from under me when I have the pkgs folder locked at https://github.com/conda/conda-build/pull/1540/files#diff-ac55d744935b1f37a3369258b88e8f1fR663

    It seems like conda might be generating removal instructions if the md5 of the package in the cache doesn't match. I have no idea how that condition might arise, though, since new packages (for zlib, for example, which seems to crop up often), should not be getting downloaded. I can only trigger this behavior sporadically by running my test suite with a large number of parallel jobs - with 16 jobs, I'm almost guaranteed to see an error. Of course, that makes digging though debug logs interesting.

    I'll try to keep digging as time allows.

    点赞 评论 复制链接分享
  • weixin_39828198 weixin_39828198 5月前

    Thank you for your support! I am also out of ideas, and the only idea I have now is to try commenting out all lockings.

    点赞 评论 复制链接分享
  • weixin_39633917 weixin_39633917 5月前

    with the latest commit, I'm no longer seeing lock issues locally running with 16 parallel jobs on my test suite. It's not a perfect test, but it might finally be working. Care to give it a shot? There's also the --no-locking CLI option now. It disables most locks. There is still minimal locking done when copying files. I can rework that if necessary, but it means refactoring some parts that are currently independent of configuration.

    I'm very hesitant to just comment all locking out - that means doing more than one build job at a time is dangerous. We do need locks, and we need to get them right.

    点赞 评论 复制链接分享
  • weixin_39828198 weixin_39828198 5月前

    Sure!!! I will give it a try the first thing after I get up in the morning.

    点赞 评论 复制链接分享
  • weixin_39828198 weixin_39828198 5月前

    The first thing I have noticed is that it doesn't play nicely with so called multi-user conda setup, meaning that users don't have permissions to write into conda installation folder by default:

    
    Copying /mnt to /home/jenkins/conda-bld/libssengine_1482435813947/work
    Traceback (most recent call last):
      File "/opt/conda/bin/conda-build", line 6, in <module>
        sys.exit(conda_build.cli.main_build.main())
      File "/opt/conda/lib/python3.5/site-packages/conda_build/cli/main_build.py", line 312, in main
        execute(sys.argv[1:])
      File "/opt/conda/lib/python3.5/site-packages/conda_build/cli/main_build.py", line 304, in execute
        already_built=None, config=config, noverify=args.no_verify)
      File "/opt/conda/lib/python3.5/site-packages/conda_build/api.py", line 86, in build
        need_source_download=need_source_download, config=config)
      File "/opt/conda/lib/python3.5/site-packages/conda_build/build.py", line 1408, in build_tree
        config=config)
      File "/opt/conda/lib/python3.5/site-packages/conda_build/render.py", line 157, in render_recipe
        config=config)
      File "/opt/conda/lib/python3.5/site-packages/conda_build/render.py", line 87, in parse_or_try_download
        source.provide(metadata.path, metadata.get_section('source'), config=config)
      File "/opt/conda/lib/python3.5/site-packages/conda_build/source.py", line 498, in provide
        copy_into(path, config.work_dir, config.timeout, locking=config.locking)
      File "/opt/conda/lib/python3.5/site-packages/conda_build/utils.py", line 118, in copy_into
        merge_tree(src, dst, symlinks, timeout=timeout, lock=lock, locking=locking)
      File "/opt/conda/lib/python3.5/site-packages/conda_build/utils.py", line 232, in merge_tree
        lock = get_lock(src, timeout=timeout)
      File "/opt/conda/lib/python3.5/site-packages/conda_build/utils.py", line 259, in get_lock
        os.makedirs(locks_dir)
      File "/opt/conda/lib/python3.5/os.py", line 241, in makedirs
        mkdir(name, mode)
    PermissionError: [Errno 13] Permission denied: '/opt/conda/locks'
    </module>

    I will go ahead and make this folder writable to everyone and see if that will do the trick.

    点赞 评论 复制链接分享
  • weixin_39828198 weixin_39828198 5月前

    It seems to work fine now with this minor workaround for the multi-user setup I mentioned above. Thank you!

    点赞 评论 复制链接分享
  • weixin_39633917 weixin_39633917 5月前

    Thanks . I have created a new issue for reviewing multi-user usage. I think there's more issues to it than just this, and I need to do a full audit by creating that kind of setup locally.

    Issue at https://github.com/conda/conda-build/issues/1601

    点赞 评论 复制链接分享
  • weixin_39633917 weixin_39633917 5月前

    This PR does not seem to improve things. I'm leaving it here until someone has time to implement something better. I'm envisioning some kind of centralized lock server in conda that both conda and conda-build can be clients to.

    点赞 评论 复制链接分享
  • weixin_39633917 weixin_39633917 5月前

    if you have any known means of reproducing a lock failure, please let me know. It will help me troubleshoot this.

    点赞 评论 复制链接分享
  • weixin_39633917 weixin_39633917 5月前

    Also, for the record, the errors that I see and am concerned about as race conditions are things like:

    
    Traceback (most recent call last):
      File "/Users/msarahan/code/conda-build/tests/test_cli.py", line 51, in test_build_add_channel
        main_build.execute(args)
      File "/Users/msarahan/code/conda-build/conda_build/cli/main_build.py", line 240, in execute
        already_built=None, config=config, noverify=args.no_verify)
      File "/Users/msarahan/code/conda-build/conda_build/api.py", line 77, in build
        need_source_download=need_source_download, config=config)
      File "/Users/msarahan/code/conda-build/conda_build/build.py", line 1234, in build_tree
        config=recipe_config)
      File "/Users/msarahan/code/conda-build/conda_build/build.py", line 753, in build
        create_env(config.build_prefix, specs, config=config)
      File "/Users/msarahan/code/conda-build/conda_build/build.py", line 638, in create_env
        plan.execute_actions(actions, index, verbose=config.debug)
      File "/Users/msarahan/miniconda2/lib/python2.7/site-packages/conda/plan.py", line 643, in execute_actions
        inst.execute_instructions(plan, index, verbose)
      File "/Users/msarahan/miniconda2/lib/python2.7/site-packages/conda/instructions.py", line 134, in execute_instructions
        cmd(state, arg)
      File "/Users/msarahan/miniconda2/lib/python2.7/site-packages/conda/instructions.py", line 78, in LINK_CMD
        link(state['prefix'], dist, lt, index=state['index'])
      File "/Users/msarahan/miniconda2/lib/python2.7/site-packages/conda/install.py", line 1008, in link
        create_meta(prefix, dist, info_dir, meta_dict)
      File "/Users/msarahan/miniconda2/lib/python2.7/site-packages/conda/install.py", line 421, in create_meta
        with open(join(info_dir, 'index.json')) as fi:
    IOError: [Errno 2] No such file or directory: u'/Users/msarahan/miniconda2/pkgs/zlib-1.2.8-3/info/index.json'
    

    I think I am locking the pkgs folder, but I haven't been able to make these errors go away.

    点赞 评论 复制链接分享
  • weixin_39828198 weixin_39828198 5月前

    Is there a specific reason to use filelock.SoftFileLock instead of platform-specific "stronger" implementation, which is aliased to filelock.FileLock? I am not sure, but the "softness" might be the reason the locking doesn't work reliably enough.

    点赞 评论 复制链接分享
  • weixin_39633917 weixin_39633917 5月前

    IIRC, it was because FileLock operates on files that already exist, while SoftFileLock uses the file itself, creating it as necessary. It was a more natural mapping for locking folders. I'll try to create files in some other folder - one per lock location obtained - and try to use filelock.FileLock.

    点赞 评论 复制链接分享
  • weixin_39633917 weixin_39633917 5月前

    PS: also, because that's how conda did it, and I was also trying to make conda-build's locks work with Conda. Conda 4.3 has removed locks altogether, though, so I guess we're free to pursue other options here.

    点赞 评论 复制链接分享

相关推荐