weixin_39638014
weixin_39638014
2021-01-11 08:53

Add executor locks to unified executor

This change adds the executor locks to prevent concurrent non-commutative changes to replicas, and avoid deadlocks between multi-shard DML to the unified executor.

该提问来源于开源项目:citusdata/citus

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

10条回答

  • weixin_39595430 weixin_39595430 4月前

    I've tried to look deeper into the isolation test failures on top of this branch and all the PRs that are open against unified_executor. I think we're in a pretty good spot with this simple change:

    | Failing Test | Fix difficulty | Notes | | ------------- | ------------- | ------------- | | isolation_concurrent_dml and isolation_insert_vs_all |Simple | Failing due to multi row inserts, but note that in #2735 we do StartExecution once so it this PR does not directly solve it | | isolation_distributed_transaction_id | Trivial | Just update the transaction number in the tests | |isolation_insert_select_conflict | Trivial| Just an RETURNING order issue, wait until #2754 merged | | isolation_citus_dist_activity and isolation_get_distributed_wait_queries | Trivial |Two things: (a) Real-time worker query should be updated from COPY to SELECT (b) Router SELECT with BEGIN opens transaction block now| |isolation_dump_global_wait_edges and isolation_get_distributed_wait_queries | Medium | In the master branch AssignDistributedTransactionId() precedes AcquireExecutorShardLock() where as it is different on this branch . I simply tried moving the lock acquisition right after BeginOrContinueCoordinatedTransaction and it worked. But need to think more |

    Apart from the above, I still see the following warnings, which I'll look a bit more details this time. WARNING: problem in alloc set MessageContext: detected write past chunk end in block 0x7facfb858200, chunk 0x7facfb859000 WARNING: problem in alloc set MessageContext: detected write past chunk end in block 0x7facfb858200, chunk 0x7facfb859000

    点赞 评论 复制链接分享
  • weixin_39595430 weixin_39595430 4月前

    Apart from the above, I still see the following warnings, which I'll look a bit more details this time. WARNING: problem in alloc set MessageContext: detected write past chunk end in block 0x7facfb858200, chunk 0x7facfb859000 WARNING: problem in alloc set MessageContext: detected write past chunk end in block 0x7facfb858200, chunk 0x7facfb859000

    That might be related to skipping make-clean but still suspicious and still looking to reproduce it...

    点赞 评论 复制链接分享
  • weixin_39595430 weixin_39595430 4月前

    In the master branch AssignDistributedTransactionId() precedes AcquireExecutorShardLock() where as it is different on this branch . I simply tried moving the lock acquisition right after BeginOrContinueCoordinatedTransaction and it worked. But need to think more

    It looks like we've discussed/fixed a similar thing https://github.com/citusdata/citus/pull/2452/files while back. I think isolation_get_distributed_wait_queries test is also failing due to the same thing.

    点赞 评论 复制链接分享
  • weixin_39945679 weixin_39945679 4月前

    Codecov Report

    Merging #2752 into unified_executor will decrease coverage by 41.63%. The diff coverage is 100%.

    diff
    @@                  Coverage Diff                  @@
    ##           unified_executor    #2752       +/-   ##
    =====================================================
    - Coverage              77.1%   35.46%   -41.64%     
    =====================================================
      Files                   128      128               
      Lines                 31393    31281      -112     
    =====================================================
    - Hits                  24205    11094    -13111     
    - Misses                 7188    20187    +12999
    
    点赞 评论 复制链接分享
  • weixin_39945679 weixin_39945679 4月前

    Codecov Report

    Merging #2752 into unified_executor will increase coverage by 4.79%. The diff coverage is 87.16%.

    diff
    @@                 Coverage Diff                  @@
    ##           unified_executor    #2752      +/-   ##
    ====================================================
    + Coverage             77.12%   81.92%   +4.79%     
    ====================================================
      Files                   128      128              
      Lines                 31393    31406      +13     
    ====================================================
    + Hits                  24212    25729    +1517     
    + Misses                 7181     5677    -1504
    
    点赞 评论 复制链接分享
  • weixin_39595430 weixin_39595430 4月前

    Two more greens, isolation test pass as of now!

    点赞 评论 复制链接分享
  • weixin_39844942 weixin_39844942 4月前

    some question: how distributed lock is avoided below in ShouldRunTasksSequentially case. we need sort the task by shardid? i will do some test. but there is some complicated about AcquireExecutorShardLock, for example relationRowLockList,RequiresConsistentSnapshot ...,is there some case that we can not prevent distributed lock while not in mx,just cn.

    AcquireExecutorShardLocks { foreach(taskCell, taskList) { AcquireExecutorShardLock(task, operation); } }

    i am some pullzed by the design goal about the lock of citus. can we get the guarantee from citus as below: A: run transcations through just only one cn node.(not mx case) B: run transcations throught just originally postgres if transcations will not cause dead lock in case B, it will also not case distributed dead lock in case A.

    点赞 评论 复制链接分享
  • weixin_39844942 weixin_39844942 4月前

    below code is some strange. is there need change the order of return ?

    ExecutionOrderForTask { if (operation == CMD_INSERT && !task->upsertQuery) { return EXECUTION_ORDER_SEQUENTIAL; //return EXECUTION_ORDER_PARALLEL; } else { return EXECUTION_ORDER_PARALLEL; //return EXECUTION_ORDER_SEQUENTIAL; } }

    点赞 评论 复制链接分享
  • weixin_39595430 weixin_39595430 4月前

    some question: how distributed lock is avoided below in ShouldRunTasksSequentially case. we need sort the task by shardid?

    As far as I can see, we already sort the tasks.

    ShouldRunTasksSequentially() returns true only for multi-row INSERTs. Those tasks are sorted via RouterInsertTaskList() -> BuildRoutesForInsert() -> GroupInsertValuesByShardId() -> SortList() -> CompareInsertValuesByShardId(). So, the taskList that the executor gets is already sorted.

    点赞 评论 复制链接分享
  • weixin_39638014 weixin_39638014 4月前

    Note that this currently causes multi-row INSERTs to block each other, but this will be resolved in #2735 .

    点赞 评论 复制链接分享