weixin_39854369
weixin_39854369
2021-01-11 11:43

[TieredCompilation] Cold methods with hot loops may run slower with tiering

c#
internal static class Program
{
    private const int HistoryCount = 8;
    private const int InnerIterationCount = 256;
    private static readonly TimeSpan s_ts500ms = TimeSpan.FromMilliseconds(500);

    private static void Main()
    {
        var sw = new Stopwatch();
        var history = new Queue<double>(HistoryCount);
        var list = new List<int>(InnerIterationCount);
        for (int outerIteration = -1; outerIteration < HistoryCount; ++outerIteration)
        {
            var duration = s_ts500ms;
            int iterations = 0;
            TimeSpan elapsed;
            sw.Restart();
            do
            {
                // ---
                list.Clear();
                for (int innerIteration = 0; innerIteration < InnerIterationCount; ++innerIteration)
                    list.Add(innerIteration);
                // ---
                ++iterations;
            } while ((iterations & 0xf) != 0 || (elapsed = sw.Elapsed) < duration);

            if (outerIteration < 0)
                continue;

            var iterationsPerMs = iterations / elapsed.TotalMilliseconds;
            if (history.Count >= HistoryCount)
                history.Dequeue();
            history.Enqueue(iterationsPerMs);
            Console.WriteLine($"{iterationsPerMs,10:0.00} {history.Average(),10:0.00}");
        }
    }
}
</int></double>

Average iterations per ms with tiering disabled: 2775.05 Tiering enabled: 2045.84

A comparison of PerfView profiles shows that some inlining is not happening:


Name                                                                                Inc %        Inc    Exc %      Exc
 test!Program.Main()                                                                 97.7      4,485     29.9    1,375
+ system.private.corelib!System.Collections.Generic.List`1[System.Int32].Add(Int32)  66.9      3,072     66.8    3,069

The JITStats summary shows that the only JIT trigger for Main is FG (foreground), which when tiering is enabled, is tier 0 (minopts), which does not do inlining. There is no TC trigger to indicate tier 1 for Main.

A workaround is to move the iteration code into a separate method:

c#
internal static class Program
{
    private const int HistoryCount = 8;
    private const int InnerIterationCount = 256;
    private static readonly TimeSpan s_ts500ms = TimeSpan.FromMilliseconds(500);

    private static void Main()
    {
        var sw = new Stopwatch();
        var history = new Queue<double>(HistoryCount);
        var list = new List<int>(InnerIterationCount);
        for (int outerIteration = -1; outerIteration < HistoryCount; ++outerIteration)
        {
            var duration = s_ts500ms;
            int iterations = 0;
            TimeSpan elapsed;
            sw.Restart();
            do
            {
                // ---
                RunIteration(list);
                // ---
                ++iterations;
            } while ((iterations & 0xf) != 0 || (elapsed = sw.Elapsed) < duration);

            if (outerIteration < 0)
                continue;

            var iterationsPerMs = iterations / elapsed.TotalMilliseconds;
            if (history.Count >= HistoryCount)
                history.Dequeue();
            history.Enqueue(iterationsPerMs);
            Console.WriteLine($"{iterationsPerMs,10:0.00} {history.Average(),10:0.00}");
        }
    }

    private static void RunIteration(List<int> list)
    {
        list.Clear();
        for (int innerIteration = 0; innerIteration < InnerIterationCount; ++innerIteration)
            list.Add(innerIteration);
    }
}
</int></int></double>

Average iterations per ms with tiering disabled: 2775.55 Tiering enabled: 2728.70

The PerfView profile now shows most of the time spent is exclusively in RunIteration as expected:


Name                                                                                    Inc %        Inc    Exc %      Exc      First         Last
 test!Program.Main()                                                                     98.0      4,490      0.5       22  1,567.832    6,074.464
+ test!Program.RunIteration(class System.Collections.Generic.List`1)                     97.0      4,443     93.5    4,283  1,568.696    6,074.464
|+ system.private.corelib!System.Collections.Generic.List`1[System.Int32].Add(Int32)      3.5        159      3.5      159  1,569.678    1,792.351

List.Add is still showing up, and that must be when RunIteration was at tier 0, as the JITStats summary shows:

Start (msec) | JitTime msec | IL Size | Native Size | Method Name | Trigger -- | -- | -- | -- | -- | -- 1,568.151 | 0.1 | 30 | 74 | Program.RunIteration(class System.Collections.Generic.List1) | FG 1,791.821 | 0.6 | 30 | 78 | Program.RunIteration(class System.Collections.Generic.List1) | TC

The last sample in List.Add in the profile was at 1,792.351. The tier 1 JIT for RunIteration was initiated at 1,791.821 and would have completed at around 1,792.421.

Other workarounds: - For benchmarks where each iteration of the benchmark is very short (a few milliseconds or less), use something like BenchmarkDotNet, where tiering would occur during the piloting or warmup phases and would not affect the measured phase. If each iteration of the benchmark takes longer, the number of warmup iterations may be increased to allow enough time for tiering to occur before measurement begins. - Disable tier 0 JIT (in environment COMPlus_TieredCompilation_DisableTier0Jit=1 or in project file <DisableTier0Jit>true</DisableTier0Jit>). In this mode, methods that don't have pregenerated code would be optimized initially. It may be useful as a global workaround for a suite of benchmarks where there may be several instances of cold methods with hot loops. For apps, it would avoid the worst-case situations where a cold method jitted at tier 0 contains a hot loop that runs for a long time. It would still be possible to be running a long-running hot loop in a cold method that has not yet been jitted at tier 1, but it would be running optimized pregenerated code, so the perf may be reasonable and the issue may not be as severe. - Attribute methods expected to contain hot code with MethodImplOptions.AggressiveOptimization. In the first example above, that would be:

c#
    [MethodImpl(MethodImplOptions.AggressiveOptimization)]
    private static void Main()
    {
        ...
    }
  
  • Turn off tiered compilation (in environment COMPlus_TieredCompilation=0 or in project file <TieredCompilation>false</TieredCompilation>) for such types of benchmarks

Considerations: - Consider optimizing loops at tier 0, or methods containing loops. Data needs to be collected on how this would affect startup performance. - Longer-term: A proper fix would probably involve at least some portions of what OSR involves

该提问来源于开源项目:dotnet/runtime

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

4条回答

  • weixin_40007548 weixin_40007548 4月前

    OSR looks like the best solution in general if the engineering cost is acceptable 😄

    点赞 评论 复制链接分享
  • weixin_39604350 weixin_39604350 4月前

    First I woudl like to highlight that the mitigation mention above of using an attribute has been implemented. In particular if you add the AggresiveInlining flags to the 'RunIteration' method like so

    
        using System.Runtime.CompilerServices;
    
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static void RunIteration(List<int> list)
    </int>

    The original repro will be fixed. Thus any particular instance of the issue can be mitigated in a straightfoward way. You just have to realize that you need to do it.

    Second I should mention that it is relatively rare that this situation (where a perf-critical method is only run once) is pretty rare in real code. Much more typically perf-critical methods are called repeatedly. The sad exception to this is microbenchmarks (they are typically designed to be a hot loop, and if the important code is in the loop (rather that what was called from the loop), you will get this issue. Having benchmarks add the AggressiveInlininng switch.

    It has been suggested above that On Stack Replacement (OSR) is the solution to this bug (since it would allow methods that have not returned to updated. This of course would solve the problem but figuring how how to map local state is non-trivial and likely error prone, which is why the cost-benefit of actually fixing this bug that way is low.

    However I would like to suggest two other possible mitigations. They are not as comprehensive as OSR, but they are much easier, and likely to get use most of the way.

    1. Heuristically detect benchmark methods and got directly to fully optimized code for those method. For example In Benchmark.NET you attribute benchmarks with the [Benchmark] attribute. Thus you can look for that attribute. There are not that many benchmark systems out there and detecting them all is not hard. This would probably be sufficient.

    2. The most likely benchmarks that would be 'bad' are simple cases (one loop that calls one or very few APIs). We could detect code that looks like this can again skip tiering for them.

    3. While on-stack replacement is a hard problem in general, it is likely that simple cases (e.g. when the 'dumb' code has only a single local variable), are likely to be straightfoward, and are likely to be the most common things in benchmarks. Just implement OSR for this case.

    My main point here is that we have a way that users can fix things for sure, so the only issue is that they have to discover that, and that may not happen. We DON'T have to do a perfect job trying to detect the rest, and it is OK to be very heuristic and it is OK if the heuristics look 'ugly' because they are a 'best effort' kind of a thing.

    I personally think we should do (1) for V3.0

    点赞 评论 复制链接分享
  • weixin_39854369 weixin_39854369 4月前

    Updated workaround info above to include the attribute. I think you meant AggressiveOptimization instead of AggressiveInlining.

    RE (1): - Microbenchmarks that use BenchmarkDotNet typically don't run into the issue because tiering typically happens during the pilot or warmup phases. If a single iteration takes long enough that it moves to the measurement phase before the benchmark is tiered, then it would run into the same issue. - Looking for one of several specific attributes on every method that is called for the first time might be a bit too much work to do at run-time. Maybe the compiler could inject the AggressiveOptimization based on other attributes. - Maybe it could be done by assembly. If the compiler sees a method in an assembly using one of the Benchmark-like attributes, it could mark all methods in that assembly with AggressiveOptimization, in order to detect dependency methods that would otherwise be missed. - The perf of a method attributed with AggressiveOptimization may eventually diverge from an identical method that is not attributed. If the intention is to measure the perf of a library method that is unattributed and typically called by unattributed methods, the perf result may not be representative.

    点赞 评论 复制链接分享
  • weixin_39854369 weixin_39854369 4月前

    RE (3), OSR-like strategies would also be useful for profile-based speculative optimizations along with solving this issue. So it could be considered that fixing this issue with such a strategy is just a side-effect.

    点赞 评论 复制链接分享

相关推荐