为什么Golang的MD5分布似乎不一致？

I fully expect I have a bug somewhere or am misunderstanding something, but why does the following code not appear to exhibit uniform distribution?

func TestMD5(t *testing.T) {
    n := 50000
    counts := map[uint32]int{} // # of hashes per 1/nth shard

    for i := 0; i < n; i++ {
        hash := md5.Sum(newUUID())
        result := binary.BigEndian.Uint32(hash[:4])
        counts[result/uint32(n)]++
    }

    dupeShards := 0
    dupeEntries := 0
    for _, count := range counts {
        if count > 1 {
            dupeShards++
            dupeEntries += count - 1
        }
    }
    t.Logf("%d inputs hashed to the same %d shards as other inputs.", dupeEntries, dupeShards)

    if len(counts) < n*95/100 {
        t.Fatalf("%d populated shards not within 5%% of expected %d uniform distribution!", len(counts), n)
    }
}

https://play.golang.org/p/05mA0Dl9GBG

—

Explanation of code:

MD5 50k random UUIDs.
For each MD5 sum, take the first 4 bytes and convert to a uint32.
Divide the result by 50k (using truncated/floor division) to distribute the hashes into 50k evenly spaced shards.

==> I'd expect the 50k MD5 sums to be ~evenly distributed across the 50k shards, but I consistently see only ~38k shards populated, with clumping in ~10k of the shards:

main.go:29: 12075 inputs hashed to the same 9921 shards as other inputs.
main.go:32: 37925 populated shards not within 5% of expected 50000 uniform distribution!

I can repro this with other hashes too (e.g. FNV), so I'm guessing I'm misunderstanding something. Thank you for the help!

写回答
好问题 0 提建议
关注问题
分享
邀请回答
编辑收藏删除结题
收藏举报

1条回答默认最新

关注

码龄粉丝数原力等级 --

被采纳

被点赞

采纳率
doucigua0278 2018-04-29 07:32
关注
This is absolutely normal behavior, and doesn't show any bias or incorrectness of the MD5 implementation.

What you are doing is (very close to) taking 50,000 random numbers between 0 and 49,999. When you do this, it's almost certain that many of the numbers will be repeated, and therefore that some numbers won't appear. It would in fact be very unlikely that the 50,000 numbers should all be different with absolutely no repetitions.

You can test this with a six-sided dice - if you throw it 6 times, you're very unlikely to get all six numbers, and much more likely to see around 3, 4 or 5 of them, with one, two or three repetitions. It's also related to the so-called birthday paradox.

Another example of this phenomenon is the 'Panini sticker question'. A Panini sticker album is a book with space for around 600 football stickers which commemorate the World Cup of soccer. Each one is numbered and different, and they feature randomly in packets. You have to get one of each number to complete the album. Suppose that you bought exactly the right number of stickers to fill the album. It would be extremely lucky if you were able to fill the album perfectly, without having any doubles or missing stickers. In fact you have to buy on average a large multiple of the number of stickers in order to get at least one of each (if you don't swap duplicates with other collectors).

The number of different values 0-49,999 which appear and the number which show 'clumping' can be calculated mathematically. I'm not sure exactly how you measure clumping. But the value of 38K populated values will be quite stable from one trial to the next, even though the actual values you see will change.

In fact, the expected number of populated values is (1 - 1/e)n, where n is the number of possible values, and e is the mathematical constant 2.718281828... The answer for n=50000 is 31606. You won't always get this value of course, but all results should be within a few hundred or so (spitballing here). You made a slight mistake in your program so I haven't been able to decipher the relevant calculation that gives you ~37000.

本回答被题主选为最佳回答 , 对您是否有帮助呢?

解决无用
评论打赏
分享
举报

评论

按下Enter换行，Ctrl+Enter发表内容

报告相同问题？

关注问题

go-fastdfs-golang资源
2025-02-15 08:27

例如control文件可能包含控制分布式文件系统运行的逻辑代码，而Dockerfile和DockerfileForGitAction文件则为项目提供了容器化和自动化构建的指导，这说明该项目支持使用Docker技术进行快速部署和环境一致性管理。...
一致性哈希算法golang版本
2024-06-30 14:35

悟空丶123的博客一致性哈希（Consistent Hashing）是一种分布式系统中常用的算法，用于在节点（如缓存服务器）之间均匀分配数据。它的核心思想是将所有可能的哈希值组织成一个环形结构，并将数据和节点通过哈希值映射到这个环上。...
你真的了解MD5吗？
2022-01-19 17:58

腾讯云开发者的博客导语|日常开发中，在用到签名的地方我们基本上总是可以看到MD5的身影。但是你真的了解它吗？本文将以探索的思路带你走进MD5。引言在日常开发中，在用到签名的地方，我们经常可以看到会有一个...
Golang 笔记
2022-12-23 17:10

sumatch的博客不可被比较的类型: slice，因为 slice 是引用类型，除非是和nil比较 map，和 slice 同理，如果要比较两个 map 只能通过循环遍历实现函数类型为什么引用类型不能比较 ? 引用类型，是想去比较值还是地址？会有歧义...
Golang应用监控：Docker环境下的Prometheus集成
2025-05-11 18:48

Golang编程笔记的博客本文旨在为Golang开发者提供一套完整的Docker环境下Prometheus监控集成方案。Golang应用如何暴露Prometheus格式的指标如何在Docker环境中部署Prometheus监控栈如何配置Prometheus自动发现Docker服务监控数据可视化与...
golang大厂面试2
2023-07-04 14:42

theo.wu的博客理解不理解这些树的构造，是要解决什么问题？处理日志的时候如果发现突然量变大，该如何扩容让以前堆积的日志可以消耗掉？命令的时间差如何保证，命令混乱了怎么办（时间戳记录在服务器可以吗？题拍拍主要做拍搜服务...
Golang中实现分布式map的思路
2025-06-06 16:09

Golang编程笔记的博客在单机环境中，Go的map类型可以高效存储键值对，但当数据量超过单机内存限制，或需要更高的读写并发能力时，就需要将数据分布到多台机器上，形成“分布式Map”。本文将聚焦如何用Go语言实现一个基础版的分布式Map，...
Golang Devops项目开发
2023-07-27 22:40

theo.wu的博客通过go help test可以看到go test的使用说明：...关于build flags，调用go help build，这些是编译运行过程中需要使用到的参数，一般设置为空关于packages，调用go help packages，这些是关于包的管理，一般设置为空。
golang知识图谱
2021-09-06 17:01

csy2005csy的博客 hmac 实现了键控哈希消息身份验证码（ Keyed-Hash Message Authentication Code，HMAC） md5 实现了RFC 1321中所定义的MD5哈希算法 rand 实现了一个加密安全的伪随机数生成器 rc4 实现了RC4加密，其定义见Bruce ...
hash crc32_一致性hash算法(golang)
2020-11-21 09:09

weixin_39781143的博客往事还记得刚毕业入职到新公司的时候, 我的上级领导与前端同学解释后端技术栈庞杂. 大概记得举了一个例子是 “如何多台机器提供数据缓存存储服务?...虽然工作时间也不短了, 但是现在再问我一致性ha...
没有解决我的问题, 去提问

为什么Golang的MD5分布似乎不一致？

1条回答 默认 最新

1条回答默认最新