concurrent small allocation defeats large allocation
If metaslab_group_alloc_normal()'s call to metaslab_activate() fails, it will skip this metaslab, not trying to allocate from it, and also not setting ms_selected_txg. The overall impact is a write performance impact, which is most noticeable on fragmented pools with many concurrent synchronous (ZIL) writes, or otherwise has some large and some small allocations.
metaslab_activate will fail if, while we are in metaslab_load() with the ms_lock dropped (which can be around 1 second on fragmented metaslabs, even if the spacemap is cached in the ARC), another zio (with the same allocator) succeeds in allocating from this metaslab and thus setting mg_primaries. Then when this allocation gets to metaslab_activate_allocator(), it will see that another metaslab has already been activated (i.e. there is already a primary metaslab selected for this vdev and allocator), and fail.
This can typically happen if the first (victim) allocation is large (e.g. for a ZIL block) and the concurrent allocation is small. The smaller allocation may be able to find enough contiguous space in a better (higher weight) metaslab, thus activating it.
The impact of not setting ms_selected_txg is that this metaslab may be unloaded at the end of the txg. So we may enter a cycle where every txg, we load (via metaslab_activate()) and then unload (via metaslab_sync_done()) some metaslabs, wasting a lot of time. Loading can take ~1 second to construct the range_tree, even when the spacemap is cached in the ARC and no I/O is required.
The impact of skipping this metaslab is that we will go on to a worse metaslab (goodness measured by the metaslab weight). We may hit this race again and skip the next metaslab, ending up loading (~1 second each) and then skipping many metaslabs before we are able to complete the allocation. This can cause us to load (costing time) and keep loaded (costing memory) many more metaslabs than are actually required to satisfy the allocations. (Or, instead of keeping them loaded, we may not keep them loaded if no other zio's set ms_selected_txg, in which case we may reload them the next txt, costing even more time.)
This problem was introduced by 9112 Improve allocation performance on high-end systems.
The fix is that if metaslab_activate() fails, metaslab_group_alloc_normal() needs to still try to allocate from this metaslab, and set ms_selected_txg.
External-issue: DLPX-61848 DLPX-61314
- C语言int p=10; printf ("%d",1.2*p)输出值为什么为0？ 令p=2，输出值又为什么等于858993459了？
- 急！ActiveMQ集群启动失败 No IOExceptionHandler registered, ignoring IO exception
- Runnable 中抛出java.lang.ArrayIndexOutOfBoundsException: -1