Materialize maintains changes to collections in "traces", each of which initially looks like a log of updates to the collection. This is fine for the first few moments of a demo, but with enough churn going on we will have two problems:
-
Physical compaction: The update batches will not be merged, and each "random access" to a trace will require an amount of work that increases linearly with the number of update batches absorbed into the trace.
This issue can be addressed with the trace handle's
distinguish_sincemethod, which unblocks physical merging. We want to take some care with this because operators likejoinneed the ability to start from a arrangement that is not ahead of the times they need (they need to be able to put a bookmark down at the times of their other input).It is probably the case that we can casually
distinguish_sincethe lower envelope of timestamps we believe we will see in other inputs. -
Logical compaction: Independently, the logical times at which the updates occur will be left at their original values, and a full history of changes is preserved. Even with physical merging, this means that a highly dynamic record, e.g. the total for an accumulation query like TPC-H query 01 or 06, will have a full update history and further updates to it will do the work of repeated re-accumulation of the history.
This issue can be addressed with the trace handle's
advance_throughmethod, which indicates that users of the handle do not intend to distinguish between logical times not in advance of the argument supplied toadvance_through. Differential is then able to consolidate equivalence classes of times and maintain a footprint proportional to the current size of the collection plus a bit of trailing window of edits proportional to the slop ofadvance_through.This issue is more complicated, as once we advance a trace handle, we can't go back. Whatever action precipitates this advancement seals off the potential to load up other Kafka sources and join them with the maintained traces at times other than those in advance of whatever we have advanced to. This is possibly something we want the user to explicitly opt into.
The first issue is fundamentally about physical representation, and we can probably do whatever we want here as long as things don't crash, and we don't run out of memory. Ideally I can fix some things in differential so that this is never a thing a human needs to deal with.
The second issue affects the logical output of the computation, and we want to think very seriously about how to stitch this in to the consistency guarantees. As a start, I would propose that we have a command that advances the lower bound of all collections in a (database, timespace, time domain) that is explicitly integrated into the command stream, so that on replay each use of each collection has a well-specified meaning (I hope).
该提问来源于开源项目:MaterializeInc/materialize