weixin_39847945
weixin_39847945
2021-01-11 04:34

Fuzzy row filter optimization for the first N consecutive tags

Hi, we're trying to build a monitoring system for our 50K+ servers using OpenTSDB.

In doing so, we have created a host dashboard page that shows a couple tens of charts for a single host during the specified time span. The page has to issue a query like follows for each metric that it displays.


opentsdb:4242/api/query?start=24h-ago&m=max:sys.cpu.idle%7Bhost=HOSTNAME%7D"

However, we noticed a significant slowdown in response time as we put more data into HBase and the cardinality of tags increases. The exact problem described in the link below.

http://opentsdb.net/docs/build/html/user_guide/writing.html#time-series-cardinality

We currently have around 8000 distinct host tags and when we open the aforementioned dashboard page, we see that the regionservers are spending all their CPU resources filtering the irrelevant rows using Regex filer while the page is rendering. The cardinality of host tags is expected to grow beyond 50K, and the page will simply be unusable in the end (it is already partly so unfortunately).

Luckily in our case, tagk for "host" was uid 001 and I realized we could append the filtering condition to the start key and apply fuzzy row filter to skip-scan the rows since the 001 tag can only appear at the front of the list of key-value pairs in the rowkey. (???? ??? 001 foo)

With the patch we are observing over 30X improvement in the response time and we expect the margin of improvement will further grow as the cardinality of the tag increases.

The attached patch is a generalization of the idea that sets up fuzzy row filter for the first N consecutive tags. So even if the filtering is not on 001, but on 002, it applies if we change the query to 001=*,002=foo. However, we can't expect performance improvement if the preceding tag has higher cardinality because in that case fuzzy row filter will not be able to skip many rows. For your information, a quick test showed that the performance overhead of ineffective fuzzy row filter is negligible.

It's understandable if you don't want to accept the patch since the benefit is limited to specific use cases and it may not be straightforward for the users that they should intentionally assign the first UIDs to their most used tags with their cardinalities in mind (e.g. curl http://opentsdb:4242/api/uid/assign?tagk=host). Nevertheless, I really hope it's merged so we don't have to maintain our own fork.

So what do you think? Please let me know if there's anything wrong with the patch or with our approach in building the system. Test cases are currently missing, if you want I can try to add some.

该提问来源于开源项目:OpenTSDB/opentsdb

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

6条回答

  • weixin_39847945 weixin_39847945 4月前

    The fuzzy filter requires the entire row key to be specified and it will only match those that are the same length as the filter.

    This is actually not true. If the prefix of the rowkey matches the filter, the row is included. None of our records have only a single tag and as noted above we've been running the patched version of opentsdb in production for months.

    点赞 评论 复制链接分享
  • weixin_39843151 weixin_39843151 4月前

    Hi , this sounds really cool actually and can definitely be useful. I'd like to see if we can generalize it a little more, maybe using it in conjunction with the "exact_tagks" flag to improve queries. (dunno if I've upstreamed that yet... I'll need to :) In that case the user has all of the filters then we can craft the fuzzy filter. In your case do you only have the host tag per metric?

    点赞 评论 复制链接分享
  • weixin_39847945 weixin_39847945 4月前

    Currently we have 9 tags: - arch - cpu - disk-device - filesystem - host - iface - os - service - ver

    As you can tell from the names, different metrics use different tags, but host and service, the two primary tags of our interest, are mandatory. We have queries on those two, so we first considered assigning 001 to service and 002 to host, so that fuzzy row filter can cover both use cases (001=xxx or 001=*,002=yyy). However we decided to keep using 001 for host and focus on host-queries only (which are much more frequent than service-queries anyway) as the cardinality of service tag is quite high and would make skip-scan on host not so effective. To address performance issues with service-queries, we're also considering to deploy another (duplicate) instance of OpenTSDB where 001 is assigned to service tag.

    We've been running the fork in production for a few weeks and haven't had any issue though we only use a subset of OpenTSDB features. I don't mind whether you merge this PR or write your own version of the idea, but I can say this is definitely a nice improvement and a low-hanging fruit.

    点赞 评论 复制链接分享
  • weixin_39843151 weixin_39843151 4月前

    Added a flexible fuzzy filter for use with the new explicit tags flag. Thanks for the work on this! 3449a53933b753e7525889a089f809e5eb834f86

    点赞 评论 复制链接分享
  • weixin_39847945 weixin_39847945 4月前

    Thanks. So if I understood correctly, it works only when the exact set of tags is specified, so we cannot take advantage of fuzzy row filter when we query just by host tag as above, am I right?

    点赞 评论 复制链接分享
  • weixin_39843151 weixin_39843151 4月前

    In your case it should still work, as you must have only had a host tag for your metrics in order to craft the filter. The fuzzy filter requires the entire row key to be specified and it will only match those that are the same length as the filter. If you asked for something like "metric{host=myhost}" but the series also had a "disk=sda" or "if=eth0" tag then it wouldn't work because the key would be too short. Did you have a mix of metrics with single tags and multiple tags?

    点赞 评论 复制链接分享

相关推荐