I fully expect I have a bug somewhere or am misunderstanding something, but why does the following code not appear to exhibit uniform distribution?
func TestMD5(t *testing.T) {
n := 50000
counts := map[uint32]int{} // # of hashes per 1/nth shard
for i := 0; i < n; i++ {
hash := md5.Sum(newUUID())
result := binary.BigEndian.Uint32(hash[:4])
counts[result/uint32(n)]++
}
dupeShards := 0
dupeEntries := 0
for _, count := range counts {
if count > 1 {
dupeShards++
dupeEntries += count - 1
}
}
t.Logf("%d inputs hashed to the same %d shards as other inputs.", dupeEntries, dupeShards)
if len(counts) < n*95/100 {
t.Fatalf("%d populated shards not within 5%% of expected %d uniform distribution!", len(counts), n)
}
}
https://play.golang.org/p/05mA0Dl9GBG
—
Explanation of code:
- MD5 50k random UUIDs.
- For each MD5 sum, take the first 4 bytes and convert to a uint32.
- Divide the result by 50k (using truncated/floor division) to distribute the hashes into 50k evenly spaced shards.
==> I'd expect the 50k MD5 sums to be ~evenly distributed across the 50k shards, but I consistently see only ~38k shards populated, with clumping in ~10k of the shards:
main.go:29: 12075 inputs hashed to the same 9921 shards as other inputs.
main.go:32: 37925 populated shards not within 5% of expected 50000 uniform distribution!
I can repro this with other hashes too (e.g. FNV), so I'm guessing I'm misunderstanding something. Thank you for the help!