duangang2825 2018-04-29 07:06
浏览 135
已采纳

为什么Golang的MD5分布似乎不一致?

I fully expect I have a bug somewhere or am misunderstanding something, but why does the following code not appear to exhibit uniform distribution?

func TestMD5(t *testing.T) {
    n := 50000
    counts := map[uint32]int{} // # of hashes per 1/nth shard

    for i := 0; i < n; i++ {
        hash := md5.Sum(newUUID())
        result := binary.BigEndian.Uint32(hash[:4])
        counts[result/uint32(n)]++
    }

    dupeShards := 0
    dupeEntries := 0
    for _, count := range counts {
        if count > 1 {
            dupeShards++
            dupeEntries += count - 1
        }
    }
    t.Logf("%d inputs hashed to the same %d shards as other inputs.", dupeEntries, dupeShards)

    if len(counts) < n*95/100 {
        t.Fatalf("%d populated shards not within 5%% of expected %d uniform distribution!", len(counts), n)
    }
}

https://play.golang.org/p/05mA0Dl9GBG

Explanation of code:

  • MD5 50k random UUIDs.
  • For each MD5 sum, take the first 4 bytes and convert to a uint32.
  • Divide the result by 50k (using truncated/floor division) to distribute the hashes into 50k evenly spaced shards.

==> I'd expect the 50k MD5 sums to be ~evenly distributed across the 50k shards, but I consistently see only ~38k shards populated, with clumping in ~10k of the shards:

main.go:29: 12075 inputs hashed to the same 9921 shards as other inputs.
main.go:32: 37925 populated shards not within 5% of expected 50000 uniform distribution!

I can repro this with other hashes too (e.g. FNV), so I'm guessing I'm misunderstanding something. Thank you for the help!

  • 写回答

1条回答 默认 最新

  • doucigua0278 2018-04-29 07:32
    关注

    This is absolutely normal behavior, and doesn't show any bias or incorrectness of the MD5 implementation.

    What you are doing is (very close to) taking 50,000 random numbers between 0 and 49,999. When you do this, it's almost certain that many of the numbers will be repeated, and therefore that some numbers won't appear. It would in fact be very unlikely that the 50,000 numbers should all be different with absolutely no repetitions.

    You can test this with a six-sided dice - if you throw it 6 times, you're very unlikely to get all six numbers, and much more likely to see around 3, 4 or 5 of them, with one, two or three repetitions. It's also related to the so-called birthday paradox.

    Another example of this phenomenon is the 'Panini sticker question'. A Panini sticker album is a book with space for around 600 football stickers which commemorate the World Cup of soccer. Each one is numbered and different, and they feature randomly in packets. You have to get one of each number to complete the album. Suppose that you bought exactly the right number of stickers to fill the album. It would be extremely lucky if you were able to fill the album perfectly, without having any doubles or missing stickers. In fact you have to buy on average a large multiple of the number of stickers in order to get at least one of each (if you don't swap duplicates with other collectors).

    The number of different values 0-49,999 which appear and the number which show 'clumping' can be calculated mathematically. I'm not sure exactly how you measure clumping. But the value of 38K populated values will be quite stable from one trial to the next, even though the actual values you see will change.

    In fact, the expected number of populated values is (1 - 1/e)n, where n is the number of possible values, and e is the mathematical constant 2.718281828... The answer for n=50000 is 31606. You won't always get this value of course, but all results should be within a few hundred or so (spitballing here). You made a slight mistake in your program so I haven't been able to decipher the relevant calculation that gives you ~37000.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥35 平滑拟合曲线该如何生成
  • ¥100 c语言,请帮蒟蒻写一个题的范例作参考
  • ¥15 名为“Product”的列已属于此 DataTable
  • ¥15 安卓adb backup备份应用数据失败
  • ¥15 eclipse运行项目时遇到的问题
  • ¥15 关于#c##的问题:最近需要用CAT工具Trados进行一些开发
  • ¥15 南大pa1 小游戏没有界面,并且报了如下错误,尝试过换显卡驱动,但是好像不行
  • ¥15 没有证书,nginx怎么反向代理到只能接受https的公网网站
  • ¥50 成都蓉城足球俱乐部小程序抢票
  • ¥15 yolov7训练自己的数据集