Some birthday paradox examples¶
The common example and introduction to the birthday paradox is stated as something like
How large does a group of people need to be for the chances that two members of the group will have the same birthday will be above 50%?
The answer is 23, which is much smaller than one might originally think. It is called a paradox only because the answer runs against expections.
We will perform a few computations to illustrate this, and how it may apply to the numbers we worry about in security.
import math
from typing import NewType, TypeGuard, Any
import random
import matplotlib.pyplot as plt
import numpy as np
import inflect
A simulation¶
Let's create a simple simulation of the problem as typically introduced.
What we will to is grow a group of people, adding one person after another until the one added has the same birthday as someone already in the group. Each individual is given a birthday selected at random. This n-th person we add is the "first collision" of birthdays.
def first_collision_sim(d: int = 365) -> int:
if not (isinstance(d, int) and d > 0):
raise ValueError(f'd ({d}) must be a positive integer')
days: set[int] = set()
while True:
bd = random.randrange(1, d+1)
if bd not in days:
days.add(bd)
else:
break
return len(days)
Let's run the simulation a few times
for trial in range(10):
print(f'Trial {trial + 1}: First collision at person {first_collision_sim()}')
Trial 1: First collision at person 21 Trial 2: First collision at person 20 Trial 3: First collision at person 13 Trial 4: First collision at person 9 Trial 5: First collision at person 28 Trial 6: First collision at person 22 Trial 7: First collision at person 32 Trial 8: First collision at person 35 Trial 9: First collision at person 23 Trial 10: First collision at person 22
Let's run the simulation 100000 times and collect the results into firsts
firsts: list[int] = [first_collision_sim() for _ in range(100000)]
And now we plot a histogram of those.
bins = np.arange(min(firsts), max(firsts)+3, 3)
plt.hist(firsts, bins = bins)
plt.ylabel("occurences")
plt.xlabel("first collision")
plt.show()
It peaks near 23.
Computing probabilities¶
A probability is a number between 0 and 1. I'm picky, so I will create a type for that.
NewType
kinda-sorta creates a subclass of a type. At least with respect to type hinting. And TypeGuard
allows us to have something like isinstance
for our custom types to force type narrowing.
Prob = NewType('Prob', float)
def is_prob(val: Any) -> TypeGuard[Prob]:
"""true if val is a float, s.t. 0.0 <= va <= 1.0"""
if not isinstance(val, float):
return False
return val >= 0.0 and val <= 1.0
There are good vidoes and tutorials on where this first formula for computing the probability of a collision given a number of people (n) and number of possible birthdays (d). I will not go over that here.
$$ p(n, d) = 1.0 - \frac{V_{dn}}{V_t} $$
$$ V_{dn} = \frac{d!}{(d-n)!} $$
$$ V_t = d^n $$
def pbirthday_exact(n: int, d: int = 365) -> Prob:
"""probability that at least one pair of n "people" will the same birthday from d "days"."""
if not (isinstance(d, int) and d > 0):
raise ValueError(f'd ({d}) must be a positive integer')
if not (isinstance(n, int) and n > 0):
raise ValueError(f'n ({n}) must be a positive integer')
if n >= d:
return Prob(1.0)
v_dn = math.perm(d, n)
v_t = pow(d, n)
p = 1.0 - float(v_dn / v_t)
if not is_prob(p):
raise Exception("this should not happen")
return p
pbirthday_exact(22)
0.4756953076625501
The exact formula is useful for small d (a few hundred at most). After that, it is better to use an approximation. The "exact" formula is computationally expensive with anything other than small numbers.[^1] I've taken these from Wikipedia.
[^1]: I haven't actually tested how python's math.perm
handles larger numbers. Perhaps it does some cleverness with the math.gammaln
functions to do this efficiently. Still, the approximations are good approximations for larger inputs.
$$ p(n, d) \approx 1 - e^{\frac{-n^2}{2d}} $$
def pbirthday_approx(n: int, d:int = 365) -> Prob:
if not (isinstance(d, int) and d > 0):
raise ValueError(f'd ({d}) must be a positive integer')
if not (isinstance(n, int) and n > 0):
raise ValueError(f'n ({n}) must be a positive integer')
if n >= d:
return Prob(1.0)
p = 1.0 - math.exp(-(n * n)/(2*d))
if not is_prob(p):
raise Exception("this should not happen")
return p
pbirthday_approx(22)
0.48470395490313967
def pbirthday(n: int, d:int = 365, mode="auto") -> Prob:
if mode not in ["auto", "exact", "approximate"]:
raise ValueError('mode must be one of "auto", "exact", "approximate"')
if mode == "auto":
mode = "exact" if n < 1000 else "approximate"
return pbirthday_exact(n,d) if mode == "exact" else pbirthday_approx(n,d)
Suppose that instead of computing the probability of a collision given n things from a pool of d possibilities we want to know what is the minimum number we need to draw to have a p probability chance of a collision. For reasons that have to do with how people working in statistics call things, what we are asking for is the "quantile" for a probability distribution, so I'm naming this qbirthday
.
Our inputs are our number of "days" (really the number of possible distinctions) and the probability of a collision we are looking for.
$$ n(p,d) \approx \sqrt{2d \cdot \ln\left(\frac{1}{1-p}\right)} $$
def qbirthday_approx(p:float = 0.5, d: int = 365) -> int:
if not is_prob(p) or p == 0.0:
raise ValueError(f'p ({p}) must be a positive probability')
n = math.sqrt(2 * d * math.log(1.0/(1.0 - p)))
return math.ceil(n)
qbirthday_approx()
23
It might be fun to write the exact form for qbirthday, but I won't try that now.
qbirthday = qbirthday_approx
Now we test with some bigger numbers¶
1Password's UUIDs are not really UUIDs. They are just random 16-byte (128-bit) numbers. The the number of distinct possibilites is $d = 2^{128}$.
d = 2 ** 128
Let's consider how many of these we would need to have a 1 in a billion ($10^9$) chance of a collision
p = 1e-9
q = qbirthday(p, d)
print(f'{q:.2e}')
8.25e+14
def digit_count(x: float) -> int:
"""returns the nunmber of digits in the integer part of x"""
x = abs(x)
result = math.floor(math.log10(x) + 1)
return int(result)
# I'm surprised python doesn't have something like R's signif().
def round_to_nsf(number, nsf=6) -> int | float:
"""Rounds number to nsf sigificant figures"""
return round(number, nsf - digit_count(number))
inflector = inflect.engine()
print(inflector.number_to_words(round_to_nsf(q, nsf=3)))
eight hundred and twenty-five trillion
And so, we'd need to have more than 800 trillion UUIDs before we'd even have a one in a billion chance of a collision.
We can ask what would be the average number of items for each of these 10 billion users before we hit a one in one billion chance of collision. q is total number of items we'd need to have before reaching that chance of a collision.
q / 10_000_000_000
82496.350816977
So it would take these 10 billion users to have an average of more than 80 thousand items for use to hit a one in a billion chance of an item UUID collision.
I feel comfortable with our 16 byte (128 bit) UUIDs for a long time to come. Perhaps when we reach one billion users with many of them having 10,000 items, we should start planning for greater collision avoidance.
Note that at that point (1 billion users with 10,000 items each) the chance of collision is ...
p = pbirthday(n =1_000_000_000, d=2**128)
0.0
Well, I guess the probability may smaller than the smallest positive number that can be repressented by a float
. I could rewrite our calculation functions to deal with arbitrary precision rational numbers, but I won't. Instead I will say that that is a number smaller than $1/10^{44}$. It might be larger than that and other rounding errors led to the result of 0 earlier, but this isn't a class on numerican methods (and I have never taken such a class.)