duanlun4411 2017-10-22 11:47
浏览 86
已采纳

通过间隔有效索引对象的结构

I'm currently playing with some ideas wrt CRF-ish work and I have an idea that I need help with.

Minimal Problem

I've got a bunch of function objects (think something expensive like neural nets). They are applied onto a linear buffer (think an array of floats or bytes) but at varying intervals. So they look like that (think of Start and End as "apply Object to buf[Start:End]":

| Object | Start | End |
|--------|-------|-----|
| A      | 0     | 4   |
| B      | 4     | 10  |
| C      | 13    | 15  |

Interval Characteristics

  • There may be some skips (for example, see the start of C vs the end of B)
  • There will definitely be changes to the intervals, both positive or negative (for example, B may change from [4:10] to [4:12].
  • When this happens, the object(s) associated with the intervals may have to be reapplied.
  • If the interval changes overlaps with another interval, both objects will have to be reapplied. For example, if B changes from [4:10] to [3:12], A would have to be applied to the range [0:3] and B would have to be applied to the range [3:12]
  • Depending on operation, downstream intervals will have to be updated as well, but the objects will not necessarily have to be reapplied. For example, if it were an insertion that changed the interval range for B, then the interval ranges for C will also increment by 2, but will not trigger a reapplication of C.

Program Characteristics

  • The intervals change a lot (it's a machine learning training loop).
  • Supported forms of interval updates are: insert, delete, shiftleft, shiftright. The latter two are the same as insert/delete but applied at the ends of the intervals.
  • Changes to the interval typically comes as a tuple (index, and size) or as a single index.
  • Application of function is fairly expensive operation and is CPU bound.
  • However, being that I am using Go, a couple of mutexes + goroutine solves a majority of the problem (there are some finer points but by large swarths it can be ignored).
  • One epoch can have anywhere from 5-60ish interval-object pairs.
  • Buffer is linear, but not necessarily contiguous.

Task

The tasks can be summarized as follows:

  1. Query by index: returns the interval and the object associated with the interval
  2. Update interval: must also update downstream if necessary (which is the majority case)
  3. Insertion of new intervals: must also update downstream

What I've Tried

  • Map with intervals as a key. This was a bad idea because I had to know if a given index that changed was within a interval or not
  • Linear structure to keep track of Starts. Discovered a bug immediately when I realized there may be skips.
  • Linear structures with "holes" to keep track of Starts. This turns out to be similar to a rope.
  • Ropes and Skip lists. Ended up refactoring what I had into the skiprope package that works for strings. More yak shaving. Yay.
  • Interval/Segment trees. Implementation is a bitch. I also tried a concrete variant of gods/augmentedtree but couldn't actually get the call-backing to work properly to evaluate it.

The Question

Is there any good data structure that I'm missing out on that would make these tasks easier?

Am I missing out on something blindingly obvious?

A friend suggested I look up incremental compilation methods because it's similar. An analogy used would be that Roslyn would parse/reparse fragments of text in a ranged fashion. That would be quite similar to my problem - just replace linear buffer of floats with linear buffer of tokens.

The problem is I couldn't find any solid useful information about how Roslyn does it.

  • 写回答

1条回答 默认 最新

  • douzhongjiu2263 2017-10-22 16:18
    关注

    This solution isn't particularly memory-efficient, but if I understand you correctly, it should allow for a relatively simple implementation of the functionality you want.

    1. Keep an array or slice funcs of all your function objects, so that they each have a canonical integer index, and can be looked up by that index.

    2. Keep a slice of ints s that is always the same size as your buffer of floats; it maps a particular index in your buffer to a "function index" in the slice of functions. You can use -1 to represent a number that is not part of any interval.

    3. Keep a slice of (int, int) pairs intervals such that intervals[i] contains the start-end indices for the function stored at funcs[i].

    I believe this enables you to implement your desired functionality without too much hassle. For example, to query by index i, look up s[i], then return funcs[s[i]] and intervals[s[i]]. When changes occur to the buffer, change s as well, cross-referencing between s and the intervals slice to figure out if neighboring intervals are affected. I'm happy to explain this part in more detail, but I don't totally understand the requirements for interval updates. (When you do an interval insert, does it correspond to an insert in the underlying buffer? Or are you just changing which buffer elements are associated with which functions? In which case, does an insert cause a deletion at the beginning of the next interval? Most schemes should work, but it changes the procedure.)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 制裁名单20240508芯片厂商
  • ¥20 易康econgnition精度验证
  • ¥15 线程问题判断多次进入
  • ¥15 msix packaging tool打包问题
  • ¥28 微信小程序开发页面布局没问题,真机调试的时候页面布局就乱了
  • ¥15 python的qt5界面
  • ¥15 无线电能传输系统MATLAB仿真问题
  • ¥50 如何用脚本实现输入法的热键设置
  • ¥20 我想使用一些网络协议或者部分协议也行,主要想实现类似于traceroute的一定步长内的路由拓扑功能
  • ¥30 深度学习,前后端连接