weixin_39789979
weixin_39789979
2020-12-29 04:51

RFC: Recycle value stacks to avoid allocation costs

We hope to use wasmi to implement a policy engine for low-volume packet inspection in an embedded setting. Initial CPU profiles using toy policies showed almost 40% CPU time spent in Vec::extend_with to allocate and initialize a new value stack for each function call.

This MR tries to address this by adding API to allow applications to reuse stack allocations (for both value and call stacks).

As a side effect, it also provides API for applications to configure the stack size limits (which is rather useful for us as we only need a few kB of value stack in our particular application).

A micro benchmark executing a minimal policy for single packet optimized for size showed the following changes: * Originally, an iteration took 205 µs. * With this MR, but without doing any recycling, an iteration took 295 µs. * With this MR, and using the recycling, an iteration took 165 µs. (Also variance dropped from 1.7% to 0.6%.) * With this MR, using the recycling as well as keeping the stack "dirty", an iteration took 75 µs. * With this MR, using the recycling as well as a "clean" but much smaller stack of 32 kB, an iteration took 77 µs.

At least the following questions remain open from my point of view: * [ ] The performance dropped significantly without recycling and I am not certain where this is coming from. I currently suspect that making the limits configurable prevents some micro optimizations that add up to the observed slow down. * [ ] Zeroing out the stack before reuse does not seem to be necessary to pass the test suite and not doing it significantly improves performance. However, I am unsure if this would mean that programs could observe "dirty" values on the stack or whether the spec requires the stack to be zeroed before invoking an exported function.

该提问来源于开源项目:paritytech/wasmi

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

9条回答

  • weixin_39789979 weixin_39789979 4月前

    It looks like -rhebo hasn't signed our Contributor License Agreement, yet.

    The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence. Wikipedia

    You can read and sign our full Contributor License Agreement at the following URL: https://cla.parity.io

    Once you've signed, please reply to this thread with [clabot:check] to prove it.

    Many thanks,

    Parity Technologies CLA Bot

    [clabot:check]

    点赞 评论 复制链接分享
  • weixin_39838028 weixin_39838028 4月前

    It looks like -rhebo signed our Contributor License Agreement. :+1:

    Many thanks,

    Parity Technologies CLA Bot

    点赞 评论 复制链接分享
  • weixin_39693662 weixin_39693662 4月前

    Hello, -rhebo ! First, thank you for the PR.

    The code looks good, however my local benchmark report massive slowdowns on the latest commit:

    
     name               master-2960f1b.bench ns/iter  stack-recycling-2.bench ns/iter  diff ns/iter  diff %  speedup
     bench_regex_redux  2,708,982                     2,768,588                              59,606   2.20%   x 0.98
     bench_rev_comp     6,187,121                     6,209,233                              22,112   0.36%   x 1.00
     bench_tiny_keccak  3,296,202                     3,289,373                              -6,829  -0.21%   x 1.00
     fac_opt            21,475                        42,373                                 20,898  97.31%   x 0.51
     fac_recursive      23,583                        45,191                                 21,608  91.63%   x 0.52
     recursive_ok       996,758                       1,130,878                             134,120  13.46%   x 0.88
     recursive_trap     110,301                       157,495                                47,194  42.79%   x 0.70
    

    Looking at these results it is clear that the less the running time is the more is the slowdown. Can we fix it?

    Should I adjust the default size back to 512 kB or leave it as is?

    Hm, it was 1MB before, isn't it? My logic is as follows:

    1. pub const DEFAULT_VALUE_STACK_LIMIT: usize = (1024 * 1024) / ::core::mem::size_of::<runtimevalue>();</runtimevalue>
    2. we used to resize the value stack to have DEFAULT_VALUE_STACK_LIMIT items.
    3. assuming that each item is of size_of::<RuntimeValue[Internal]> size and the items are laid out densely, the whole value stack should occupy ((1024 * 1024) / size_of::<RuntimeValue>()) * size_of::<RuntimeValue>() bytes or simply 1MB.
    点赞 评论 复制链接分享
  • weixin_39789979 weixin_39789979 4月前

    The code looks good, however my local benchmark report massive slowdowns on the latest commit:

    I think the slowdown is due to doubling the effective default stack size, at least that is what my local measurements suggest.

    Hm, it was 1MB before, isn't it? My logic is as follows:

    The stack is made up of values of type RuntimeValueInternal which is smaller than RuntimeValue as it is missing the enum tag. Hence the actual size ATM is

    
    ((1024 * 1024) / size_of::<runtimevalue>()) * size_of::<runtimevalueinternal>() ==
    ((1024 * 1024) / 16) * 8 ==
    512 * 1024
    </runtimevalueinternal></runtimevalue>
    点赞 评论 复制链接分享
  • weixin_39789979 weixin_39789979 4月前

    Looking at these results it is clear that the less the running time is the more is the slowdown.

    I think this is due to the stack initialization dominating the execution time for these small examples and hence doubling the stack size almost doubles the execution time.

    点赞 评论 复制链接分享
  • weixin_39693662 weixin_39693662 4月前

    Ah, yeah, you are right! I am ok with leaving the stack size as is now.

    点赞 评论 复制链接分享
  • weixin_39838028 weixin_39838028 4月前

    It looks like -rhebo hasn't signed our Contributor License Agreement, yet.

    The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence. Wikipedia

    You can read and sign our full Contributor License Agreement at the following URL: https://cla.parity.io

    Once you've signed, please reply to this thread with [clabot:check] to prove it.

    Many thanks,

    Parity Technologies CLA Bot

    点赞 评论 复制链接分享
  • weixin_39789979 weixin_39789979 4月前

    As an additional data point these are results of the included benchmarks on my machine before

    
    test bench_regex_redux ... bench:   3,700,627 ns/iter (+/- 17,403)
    test bench_rev_comp    ... bench:   8,217,647 ns/iter (+/- 11,277)
    test bench_tiny_keccak ... bench:   4,467,537 ns/iter (+/- 153,282)
    test fac_opt           ... bench:      15,351 ns/iter (+/- 1,622)
    test fac_recursive     ... bench:      16,975 ns/iter (+/- 6,130)
    test recursive_ok      ... bench:   1,099,639 ns/iter (+/- 22,430)
    test recursive_trap    ... bench:     126,832 ns/iter (+/- 5,552)
    

    and after these changes

    
    test bench_regex_redux ... bench:   3,846,837 ns/iter (+/- 66,684)
    test bench_rev_comp    ... bench:   8,497,919 ns/iter (+/- 28,065)
    test bench_tiny_keccak ... bench:   4,388,131 ns/iter (+/- 39,272)
    test fac_opt           ... bench:      30,699 ns/iter (+/- 11,836)
    test fac_recursive     ... bench:      32,653 ns/iter (+/- 454)
    test recursive_ok      ... bench:   1,192,462 ns/iter (+/- 4,996)
    test recursive_trap    ... bench:     148,877 ns/iter (+/- 1,696)
    
    点赞 评论 复制链接分享
  • weixin_39789979 weixin_39789979 4月前

    Ah, it seems changing the default size of the value stack from (1024 * 1024) / ::core::mem::size_of::<RuntimeValue>() to (1024 * 1024) / ::core::mem::size_of::<RuntimeValueInternal>() is responsible for the performance regressions as it effectively increased the default stack size from 512 kB to 1 MB and when reverting the performance returns to the previous values when recycling is not used.

    Should I adjust the default size back to 512 kB or leave it as is?

    点赞 评论 复制链接分享

相关推荐