I'm currently playing with some ideas wrt CRF-ish work and I have an idea that I need help with.
Minimal Problem
I've got a bunch of function objects (think something expensive like neural nets). They are applied onto a linear buffer (think an array of float
s or byte
s) but at varying intervals. So they look like that (think of Start and End as "apply Object to buf[Start:End]
":
| Object | Start | End |
|--------|-------|-----|
| A | 0 | 4 |
| B | 4 | 10 |
| C | 13 | 15 |
Interval Characteristics
- There may be some skips (for example, see the start of C vs the end of B)
- There will definitely be changes to the intervals, both positive or negative (for example, B may change from
[4:10]
to[4:12]
. - When this happens, the object(s) associated with the intervals may have to be reapplied.
- If the interval changes overlaps with another interval, both objects will have to be reapplied. For example, if B changes from
[4:10]
to[3:12]
, A would have to be applied to the range[0:3]
and B would have to be applied to the range[3:12]
- Depending on operation, downstream intervals will have to be updated as well, but the objects will not necessarily have to be reapplied. For example, if it were an insertion that changed the interval range for B, then the interval ranges for C will also increment by 2, but will not trigger a reapplication of C.
Program Characteristics
- The intervals change a lot (it's a machine learning training loop).
- Supported forms of interval updates are: insert, delete, shiftleft, shiftright. The latter two are the same as insert/delete but applied at the ends of the intervals.
- Changes to the interval typically comes as a tuple (index, and size) or as a single index.
- Application of function is fairly expensive operation and is CPU bound.
- However, being that I am using Go, a couple of mutexes + goroutine solves a majority of the problem (there are some finer points but by large swarths it can be ignored).
- One epoch can have anywhere from 5-60ish interval-object pairs.
- Buffer is linear, but not necessarily contiguous.
Task
The tasks can be summarized as follows:
- Query by index: returns the interval and the object associated with the interval
- Update interval: must also update downstream if necessary (which is the majority case)
- Insertion of new intervals: must also update downstream
What I've Tried
- Map with intervals as a key. This was a bad idea because I had to know if a given index that changed was within a interval or not
- Linear structure to keep track of Starts. Discovered a bug immediately when I realized there may be skips.
- Linear structures with "holes" to keep track of Starts. This turns out to be similar to a rope.
- Ropes and Skip lists. Ended up refactoring what I had into the skiprope package that works for strings. More yak shaving. Yay.
- Interval/Segment trees. Implementation is a bitch. I also tried a concrete variant of gods/augmentedtree but couldn't actually get the call-backing to work properly to evaluate it.
The Question
Is there any good data structure that I'm missing out on that would make these tasks easier?
Am I missing out on something blindingly obvious?
A friend suggested I look up incremental compilation methods because it's similar. An analogy used would be that Roslyn would parse/reparse fragments of text in a ranged fashion. That would be quite similar to my problem - just replace linear buffer of floats with linear buffer of tokens.
The problem is I couldn't find any solid useful information about how Roslyn does it.