duangang2825 2018-04-29 07:06
浏览 135
已采纳

为什么Golang的MD5分布似乎不一致?

I fully expect I have a bug somewhere or am misunderstanding something, but why does the following code not appear to exhibit uniform distribution?

func TestMD5(t *testing.T) {
    n := 50000
    counts := map[uint32]int{} // # of hashes per 1/nth shard

    for i := 0; i < n; i++ {
        hash := md5.Sum(newUUID())
        result := binary.BigEndian.Uint32(hash[:4])
        counts[result/uint32(n)]++
    }

    dupeShards := 0
    dupeEntries := 0
    for _, count := range counts {
        if count > 1 {
            dupeShards++
            dupeEntries += count - 1
        }
    }
    t.Logf("%d inputs hashed to the same %d shards as other inputs.", dupeEntries, dupeShards)

    if len(counts) < n*95/100 {
        t.Fatalf("%d populated shards not within 5%% of expected %d uniform distribution!", len(counts), n)
    }
}

https://play.golang.org/p/05mA0Dl9GBG

Explanation of code:

  • MD5 50k random UUIDs.
  • For each MD5 sum, take the first 4 bytes and convert to a uint32.
  • Divide the result by 50k (using truncated/floor division) to distribute the hashes into 50k evenly spaced shards.

==> I'd expect the 50k MD5 sums to be ~evenly distributed across the 50k shards, but I consistently see only ~38k shards populated, with clumping in ~10k of the shards:

main.go:29: 12075 inputs hashed to the same 9921 shards as other inputs.
main.go:32: 37925 populated shards not within 5% of expected 50000 uniform distribution!

I can repro this with other hashes too (e.g. FNV), so I'm guessing I'm misunderstanding something. Thank you for the help!

  • 写回答

1条回答 默认 最新

  • doucigua0278 2018-04-29 07:32
    关注

    This is absolutely normal behavior, and doesn't show any bias or incorrectness of the MD5 implementation.

    What you are doing is (very close to) taking 50,000 random numbers between 0 and 49,999. When you do this, it's almost certain that many of the numbers will be repeated, and therefore that some numbers won't appear. It would in fact be very unlikely that the 50,000 numbers should all be different with absolutely no repetitions.

    You can test this with a six-sided dice - if you throw it 6 times, you're very unlikely to get all six numbers, and much more likely to see around 3, 4 or 5 of them, with one, two or three repetitions. It's also related to the so-called birthday paradox.

    Another example of this phenomenon is the 'Panini sticker question'. A Panini sticker album is a book with space for around 600 football stickers which commemorate the World Cup of soccer. Each one is numbered and different, and they feature randomly in packets. You have to get one of each number to complete the album. Suppose that you bought exactly the right number of stickers to fill the album. It would be extremely lucky if you were able to fill the album perfectly, without having any doubles or missing stickers. In fact you have to buy on average a large multiple of the number of stickers in order to get at least one of each (if you don't swap duplicates with other collectors).

    The number of different values 0-49,999 which appear and the number which show 'clumping' can be calculated mathematically. I'm not sure exactly how you measure clumping. But the value of 38K populated values will be quite stable from one trial to the next, even though the actual values you see will change.

    In fact, the expected number of populated values is (1 - 1/e)n, where n is the number of possible values, and e is the mathematical constant 2.718281828... The answer for n=50000 is 31606. You won't always get this value of course, but all results should be within a few hundred or so (spitballing here). You made a slight mistake in your program so I haven't been able to decipher the relevant calculation that gives you ~37000.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥20 机器学习能否像多层线性模型一样处理嵌套数据
  • ¥20 西门子S7-Graph,S7-300,梯形图
  • ¥50 用易语言http 访问不了网页
  • ¥50 safari浏览器fetch提交数据后数据丢失问题
  • ¥15 matlab不知道怎么改,求解答!!
  • ¥15 永磁直线电机的电流环pi调不出来
  • ¥15 用stata实现聚类的代码
  • ¥15 请问paddlehub能支持移动端开发吗?在Android studio上该如何部署?
  • ¥20 docker里部署springboot项目,访问不到扬声器
  • ¥15 netty整合springboot之后自动重连失效