duanlun4411 2017-10-22 11:47
浏览 86
已采纳

通过间隔有效索引对象的结构

I'm currently playing with some ideas wrt CRF-ish work and I have an idea that I need help with.

Minimal Problem

I've got a bunch of function objects (think something expensive like neural nets). They are applied onto a linear buffer (think an array of floats or bytes) but at varying intervals. So they look like that (think of Start and End as "apply Object to buf[Start:End]":

| Object | Start | End |
|--------|-------|-----|
| A      | 0     | 4   |
| B      | 4     | 10  |
| C      | 13    | 15  |

Interval Characteristics

  • There may be some skips (for example, see the start of C vs the end of B)
  • There will definitely be changes to the intervals, both positive or negative (for example, B may change from [4:10] to [4:12].
  • When this happens, the object(s) associated with the intervals may have to be reapplied.
  • If the interval changes overlaps with another interval, both objects will have to be reapplied. For example, if B changes from [4:10] to [3:12], A would have to be applied to the range [0:3] and B would have to be applied to the range [3:12]
  • Depending on operation, downstream intervals will have to be updated as well, but the objects will not necessarily have to be reapplied. For example, if it were an insertion that changed the interval range for B, then the interval ranges for C will also increment by 2, but will not trigger a reapplication of C.

Program Characteristics

  • The intervals change a lot (it's a machine learning training loop).
  • Supported forms of interval updates are: insert, delete, shiftleft, shiftright. The latter two are the same as insert/delete but applied at the ends of the intervals.
  • Changes to the interval typically comes as a tuple (index, and size) or as a single index.
  • Application of function is fairly expensive operation and is CPU bound.
  • However, being that I am using Go, a couple of mutexes + goroutine solves a majority of the problem (there are some finer points but by large swarths it can be ignored).
  • One epoch can have anywhere from 5-60ish interval-object pairs.
  • Buffer is linear, but not necessarily contiguous.

Task

The tasks can be summarized as follows:

  1. Query by index: returns the interval and the object associated with the interval
  2. Update interval: must also update downstream if necessary (which is the majority case)
  3. Insertion of new intervals: must also update downstream

What I've Tried

  • Map with intervals as a key. This was a bad idea because I had to know if a given index that changed was within a interval or not
  • Linear structure to keep track of Starts. Discovered a bug immediately when I realized there may be skips.
  • Linear structures with "holes" to keep track of Starts. This turns out to be similar to a rope.
  • Ropes and Skip lists. Ended up refactoring what I had into the skiprope package that works for strings. More yak shaving. Yay.
  • Interval/Segment trees. Implementation is a bitch. I also tried a concrete variant of gods/augmentedtree but couldn't actually get the call-backing to work properly to evaluate it.

The Question

Is there any good data structure that I'm missing out on that would make these tasks easier?

Am I missing out on something blindingly obvious?

A friend suggested I look up incremental compilation methods because it's similar. An analogy used would be that Roslyn would parse/reparse fragments of text in a ranged fashion. That would be quite similar to my problem - just replace linear buffer of floats with linear buffer of tokens.

The problem is I couldn't find any solid useful information about how Roslyn does it.

  • 写回答

1条回答 默认 最新

  • douzhongjiu2263 2017-10-22 16:18
    关注

    This solution isn't particularly memory-efficient, but if I understand you correctly, it should allow for a relatively simple implementation of the functionality you want.

    1. Keep an array or slice funcs of all your function objects, so that they each have a canonical integer index, and can be looked up by that index.

    2. Keep a slice of ints s that is always the same size as your buffer of floats; it maps a particular index in your buffer to a "function index" in the slice of functions. You can use -1 to represent a number that is not part of any interval.

    3. Keep a slice of (int, int) pairs intervals such that intervals[i] contains the start-end indices for the function stored at funcs[i].

    I believe this enables you to implement your desired functionality without too much hassle. For example, to query by index i, look up s[i], then return funcs[s[i]] and intervals[s[i]]. When changes occur to the buffer, change s as well, cross-referencing between s and the intervals slice to figure out if neighboring intervals are affected. I'm happy to explain this part in more detail, but I don't totally understand the requirements for interval updates. (When you do an interval insert, does it correspond to an insert in the underlying buffer? Or are you just changing which buffer elements are associated with which functions? In which case, does an insert cause a deletion at the beginning of the next interval? Most schemes should work, but it changes the procedure.)

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 apm2.8飞控罗盘bad health,加速度计校准失败
  • ¥15 求解O-S方程的特征值问题给出边界层布拉休斯平行流的中性曲线
  • ¥15 谁有desed数据集呀
  • ¥20 手写数字识别运行c仿真时,程序报错错误代码sim211-100
  • ¥15 关于#hadoop#的问题
  • ¥15 (标签-Python|关键词-socket)
  • ¥15 keil里为什么main.c定义的函数在it.c调用不了
  • ¥50 切换TabTip键盘的输入法
  • ¥15 可否在不同线程中调用封装数据库操作的类
  • ¥15 微带串馈天线阵列每个阵元宽度计算