weixin_39547392 2020-11-29 14:24
浏览 0

Thrashing while merging

Merging performance is still very slow when the geometry+index exceeds the size of memory because of all the paging.

For 2 billion points, there are about 49GB of geometry and 66GB of index. (Variable size for geometry, but apparently an average of 24 bytes; 32 bytes/record for index). At 8 sec/GB for I/O, that's 15 minutes for just one linear pass.

Linearizing the merge: There were 1252 chunks during the sort. If each pass merged two of them instead of merging from all 1252 at once, it would take 10.3 passes to get the complete file, so 2.5 hours of I/O from 10.3 linear runs through the data. Does this actually come out any better than letting it thrash?

Larger sort chunks would reduce the number of passes. (Do Macs still crash when doing random access writes to a 2GB map?) Use the ratio of the geometry size to the index size to estimate how many geometry records would fit in memory.

The index could maybe be a little smaller by using start+len instead of start…end and by packing start and sequence into smaller integers.

该提问来源于开源项目:mapbox/tippecanoe

  • 写回答

13条回答 默认 最新

  • weixin_39547392 2020-11-29 14:24
    关注

    Idea: radix sort. Instead of concatenating the whole index and geometry together before sorting, split it out into 500 (or however many open files we get) files by index prefix, sort/merge each of them, and then concatenate them back together at the end.

    In the simplest case, the radices are z4 or z5 tiles. If the data isn't worldwide, we can do better just by keeping track of the highest and lowest index seen. Better yet would be some tracking of the statistical distribution of prefixes, but that would be hard.

    Could possibly also split into more radices than there are files by splitting into many offsets into a giant memory-mapped file with holes, if that doesn't have bad performance.

    评论

报告相同问题?