weixin_39714383 2020-11-30 08:56
浏览 0

Trace compaction

Materialize maintains changes to collections in "traces", each of which initially looks like a log of updates to the collection. This is fine for the first few moments of a demo, but with enough churn going on we will have two problems:

  • Physical compaction: The update batches will not be merged, and each "random access" to a trace will require an amount of work that increases linearly with the number of update batches absorbed into the trace.

    This issue can be addressed with the trace handle's distinguish_since method, which unblocks physical merging. We want to take some care with this because operators like join need the ability to start from a arrangement that is not ahead of the times they need (they need to be able to put a bookmark down at the times of their other input).

    It is probably the case that we can casually distinguish_since the lower envelope of timestamps we believe we will see in other inputs.

  • Logical compaction: Independently, the logical times at which the updates occur will be left at their original values, and a full history of changes is preserved. Even with physical merging, this means that a highly dynamic record, e.g. the total for an accumulation query like TPC-H query 01 or 06, will have a full update history and further updates to it will do the work of repeated re-accumulation of the history.

    This issue can be addressed with the trace handle's advance_through method, which indicates that users of the handle do not intend to distinguish between logical times not in advance of the argument supplied to advance_through. Differential is then able to consolidate equivalence classes of times and maintain a footprint proportional to the current size of the collection plus a bit of trailing window of edits proportional to the slop of advance_through.

    This issue is more complicated, as once we advance a trace handle, we can't go back. Whatever action precipitates this advancement seals off the potential to load up other Kafka sources and join them with the maintained traces at times other than those in advance of whatever we have advanced to. This is possibly something we want the user to explicitly opt into.

The first issue is fundamentally about physical representation, and we can probably do whatever we want here as long as things don't crash, and we don't run out of memory. Ideally I can fix some things in differential so that this is never a thing a human needs to deal with.

The second issue affects the logical output of the computation, and we want to think very seriously about how to stitch this in to the consistency guarantees. As a start, I would propose that we have a command that advances the lower bound of all collections in a (database, timespace, time domain) that is explicitly integrated into the command stream, so that on replay each use of each collection has a well-specified meaning (I hope).

该提问来源于开源项目:MaterializeInc/materialize

  • 写回答

5条回答 默认 最新

  • weixin_39714383 2020-11-30 08:56
    关注

    Related to trace compaction, when we "import" a compacted trace, it is likely very important to pick a time at which we intend for it to "start" and to advance timestamps in the trace to that timestamp. For example, if we have a collection that has been evolving for a while and it is now time, a new query using the associated trace should probably think about advancing each of the times in the trace to time, so that the stream of changes only references current times rather than a big splat of historical times.

    This is operationally not too hard, but it probably involves consulting. If anyone else ends up wanting to pick it off, check in with me about trace wrappers that do this.

    评论

报告相同问题?