duandi4238 2017-07-04 01:39
浏览 45
已采纳

简洁的二叉树之和,数组构造和寻址

Using 'sum' as a short hand for some arbitrary computation. I have a process that computes a single sum from a list of values by recursively summing pairs of values. Un-paired values are promoted up the tree unaltered until they can be paired.

Given this computation, I'm in search of the best way to balance computation (i.e. number of operation required to access array elements/nodes), and the most succinct encoding of all nodes in a 1 dimensional array (i.e. no gaps, nil values, or repeated values), and preferably without an additional index array that cannot be easily derived from the succinct encoding so that it would have to be saved along with the array.

Although the following are simple examples, in reality the number of values in the initial list can be extraordinarily large (2^47 or more).

For example, given the list [1, 2, 3, 4], the array is trivial: [10, 3, 7, 1, 2, 3, 4], and split nicely into rows that are easy to address by node, or as a reference to the entire row.

But for a 5 item list the tree looks like this:

Tree 1

         15
        /  \
       /    \
      /      \
     /        \
    10          5
  /   \       /   \
 3     7     5     -
/ \   / \   / \   / \
1  2  3  4 5   - -   -

The standard mapping left = i*2+1, right = i*2+2 gives us this array:

Array 1

[ 15, 10,  5,  3,   7,   5,  nil,   1,   2,   3,   4,   5,   nil,   nil, nil]

This array has 4 nil values, and the last element in the list '5' is repeated 2 times.

To improve this we can imply the repetition of the 5, and remove the nil values:

Array 2

[15, 10, 3, 7, 1, 2, 3, 4, 5]

Which is much more compact. This tree is the same, but conceptually looks a bit like:

Tree 2

       15
      / \
     /   \
    10    \
  /   \    \
 3     7    \
/ \   / \    \
1  2  3  4    5

In the Array 2 encoding I have 4 rows:

1. [1, 2, 3, 4]
2. [3, 7]
3. [10, 5]
4. [15]

Rows 1, 2 and 4 can simply be references into Array 2 allowing me to compute results 'in-place' with no allocations or copies. Very fast. Row 3 however, contains values in two non-contiguous cells. I have to break the simple logic used for the other rows, and possibly add copy, indexing or storage for a map.

I can construct complete/balanced sub trees (such as indexes 1-7, the tree for 1, 2, 3, 4), but it seems like they will not always be so nicely aligned when the odd number of items appears at different rows depending on input length. For example consider a tree with an initial list of 6 elements.

  • 写回答

2条回答 默认 最新

  • dongzhang3482 2017-07-04 20:26
    关注

    Let's assume your tree has N nodes on the final (most numerous) row.

    If you do store the nodes that are only propagated upwards, your tree has between 2*N-1 and 2*N-1+log2(N) nodes, total. The exact total number of nodes is given by OEIS A120511. Of these, at most floor(2 + log2(N-1)) are copied/propagated nodes.

    The tree has floor(2 + log2(N-1)) rows. The number of rows as a function of N (the number of elements on the final row) is OEIS A070941.

    The number of rows in such trees is quite low. For example, if you have 240 ≈ 1,000,000,000,000 nodes in the final row, you only have 42 rows in the tree. For 264 nodes, you have just 66. Therefore, if you need some operation per row, it is not a high overhead.

    A simple logarithmic-time function can compute the number of rows and the total number of nodes, given the number of nodes in the final row N:

    # Account for the root node
    rows = 1
    total = 1
    
    curr_left = N
    While (curr_left > 1):
        rows = rows + 1
        total = total + curr_left
        curr_left = (curr_left + 1) / 2
    End While
    

    where / denotes integer division, i.e. any fractional part is discarded/truncated/rounded towards zero. Again, for 264 nodes in the final row, the above loops only 65 times.

    When we know the total number of nodes in the tree, and the number of rows, we can use another logarithmic-time loop to compute the offset of the first element on each row of the tree, and the number of nodes on that row:

    first_offset = []
    nodes = []
    
    curr_row = rows - 1
    curr_offset = total - N
    curr_left = N
    
    While (curr_left > 1):
        nodes[curr_row] = curr_left
        first_offset[curr_row] = curr_offset
        curr_row = curr_row - 1
        curr_left = (curr_left + 1) / 2
        curr_offset = curr_offset - curr_left
    }
    
    first_offset[0] = 0
    nodes[0] = 1
    

    As before, for 264 nodes in the final row, the above loops only 65 times.

    All elements on a row are consecutive in memory, and if we use zero-based indexing, and N is the number of nodes on the final row, and we apply the above, then

    • rows is the number of rows in the tree

    • total is the total number of nodes in the tree

    • There are nodes[r] nodes on row r, if r >= 0 and r < rows

    • Array index for node on row r, column c is first_offset[r] + c

    • Node on row r, column c, with r > 0, has a parent on row r-1, column c/2, at array index first_offset[r-1] + c/2

    • Node on row r, column c, with r < rows - 1, has a left child on row r+1, column 2*c, at array index first_offset[r+1] + 2*c

    • Node on row r, column c, with r < rows - 1 and c < nodes[r] - 1, has a right child on row r+1, column 2*c+1, at array index first_offset[r+1] + 2*c + 1

    • Node on row r, column c, with r < rows - 1 and c < nodes[r] - 1, has both a left and a right child

    This array is compact, and other than the nodes that get propagated upwards (so, maybe a few dozen nodes for a terabyte-sized dataset), wastes no storage.

    If the number of nodes in the final row is stored with the array (for example, as an extra uint64_t following the array data), all readers can recover total, rows, first_offset[], and nodes[], and easily navigate the tree. (However, note that instead of just the array index, you use the "column" and "row" instead, and derive the array index using those.)

    Because first_offset[] and nodes[] arrays have at most a few dozen entries, they should stay hot in caches, and using them should not harm performance.

    Note that not all tree sizes are valid for the rules stated in the second paragraph above. For example, a tree with two nodes makes no sense: why would you duplicate the root node?

    If you do know that the tree size (total) is valid, you can find N based on total in O(log2(total)*log2log2(total)) time complexity using a binary search, or in O((log2(total))²) if you use a simple loop. Remember, total is between 2*N-1 and 2*N-1+log2(N). Conversely, N cannot be greater than (N + 1)/2, or smaller than (N + 1)/2 - log2(total), because total is greater than N, and therefore log2(N) is less than log2(total). So, a binary search could be implemented as

    Function Find_N(total):
        Nmax = (total + 1) / 2
        Nmin = Nmax - log2(total)
    
        t = Total(Nmin)
        If t == total:
            Return Nmin
        Else if t < total:
            Return "Bug!"
        End if
    
        t = Total(Nmax)
        if t == total:
            Return Nmax
        Else if t > total:
            Return "Bug!"
        End if
    
        Loop:
    
            N = (Nmin + Nmax) / 2
            If N == Nmin:
                Return "Invalid tree size!"
            End If
    
            t = Total(N)
            If t > total:
                Nmax = N
            Else if t < total:
                Nmin = N
            Else:
                return N
            End If            
        End Loop
    End Function
    

    Keep in mind that even with 264 nodes in the tree, the above function makes at most 1 + log2(64) = 6 calls to Total, a function implementing the first pseudocode snippet in this answer. Since you typically need this only once per program invocation, the overhead is truly irrelevant.

    You can calculate log2(x) using log(x)/log(2), using the log2() function from <math.h> since C99 (but since double has less precision than uint64_t, I would add +1 to the result, or round it towards positive infinity using ceil(), just to be sure), or even using a simple loop:

    Function ulog2(value):
        result = 0
        While (value > 0):
            result = result + 1
            value = value / 2
        End While
        Return result
    End Function
    

    where once again, / denotes integer division.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论
查看更多回答(1条)

报告相同问题?

悬赏问题

  • ¥30 STM32 INMP441无法读取数据
  • ¥100 求汇川机器人IRCB300控制器和示教器同版本升级固件文件升级包
  • ¥15 用visualstudio2022创建vue项目后无法启动
  • ¥15 x趋于0时tanx-sinx极限可以拆开算吗
  • ¥500 把面具戴到人脸上,请大家贡献智慧
  • ¥15 任意一个散点图自己下载其js脚本文件并做成独立的案例页面,不要作在线的,要离线状态。
  • ¥15 各位 帮我看看如何写代码,打出来的图形要和如下图呈现的一样,急
  • ¥30 c#打开word开启修订并实时显示批注
  • ¥15 如何解决ldsc的这条报错/index error
  • ¥15 VS2022+WDK驱动开发环境