Tuesday, August 4, 2009

The birthday problem

Everyone knows somebody who has the same birthday as they do. And if not, you can go to wikipedia, like I did, to find that both Anton van Leeuwenhoek and Kevin Kline were also born on October 24. In this post, I want to explore this famous problem. If we ignore Feb 29 and assume that births are evenly distributed over the days of the year (which may not be true), then the probability that two individuals chosen at random share the same birthday is 1/365.

So let's think about this famous gathering, and ask the question: if the individual chance that Albert and Max share the same birthday is 1/365, what is the probability that there are at least two people in this group that do share the same special day?



The key, as you probably guess, is that this is a combinations problem. Consider a group of 5 individuals (A-E), if F walks up to the group and introduces herself, there are 5 introductions to make.



If we think about building up a group in steps, then for n individuals the number of introductions that have been made is:

Σ 1 + 2 ... + n-1


This is the problem supposedly solved by Gauss as a young boy.

Another way to think about it is to consider that if each person in the group of n people shakes hands with every other person, there are n * (n-1) hands involved in handshakes, but then we must divide by two to get the unique interactions, since we've counted one hand for both of the interacting partners.

In general, we have the formula for combinations: C(n,k) = n! / (n-k)! k!, where k = 2.

We can solve the birthday problem in a couple of ways. We may say that we have the probability for each pair that they do not share a birthday, which is 364/365. The probability that all the independent combinations do not share a birthday is (364/365)**C(n,2). The probability of the complementary event, that at least one pair does share a birthday, is 1 - (364/365)**C(n,2).

The second approach is to consider the group with 2 people and P = 364/365 that they do not share a birthday. If a new person walks up to the group, there are 363 birthdays which would preserve the "no shared birthday" criterion. The probability of the desired event is then 1 - 364/365 * 363/365..., extended for n-1 steps.

This is easy to program, and I won't bore you with the details, but I will show this pretty plot I made using R:



Now, to the point of the post. I thought about finding a group near the critical size and testing it for the birthday criterion. I'm not such a big sports fan anymore, unless you consider politics a sport. How about the Presidents of the United States? Barack Obama is #44. You can get their vitals from wikipedia, but I found a text version on the web here.

The data needs just a bit of cleanup. One date lacks the comma, one date is listed as April 28th. And one entry has two tabs separating the name and the birthday. And why, exactly, are we presented with Obama's middle name? ("I got my middle name from someone who obviously, never thought I'd be running for President")-video.

Here are the results:

James K. Polk             November 2
Warren G. Harding November 2

Millard Fillmore January 7
Richard M. Nixon January 9
William McKinley January 29
Franklin D. Roosevelt January 30
Ronald Reagan February 6
William Henry Harrison February 9
Abraham Lincoln February 12
George Washington February 22
Andrew Jackson March 15
James Madison March 16
Grover Cleveland March 18
John Tyler March 29
Thomas Jefferson April 13
James Buchanan April 23
Ulysses S. Grant April 27
James Monroe April 28
Harry S Truman May 8
John Kennedy May 29
George H. W. Bush June 12
Calvin Coolidge July 4
George W. Bush July 6
John Quincy Adams July 11
Gerald R. Ford July 14
Barack Hussein Obama August 4
Herbert Hoover August 10
William J. Clinton August 19
Benjamin Harrison August 20
Lyndon B. Johnson August 27
William Howard Taft September 15
Jimmy Carter October 1
Rutherford B. Hayes October 4
Chester A. Arthur October 5
Dwight D. Eisenhower October 14
Theodore Roosevelt October 27
John Adams October 30
Warren G. Harding November 2
James K. Polk November 2
James A. Garfield November 19
Franklin Pierce November 23
Zachary Taylor November 24
Martin Van Buren December 5
Woodrow Wilson December 28
Andrew Johnson December 29


And here is the code:


fn = 'presidents.txt'
FH = open(fn,'r')
data = FH.read().strip()
FH.close()
L = data.split('\n')

def parseDate(s):
s = s.strip()
y = int(s.split()[-1])
s = s.split(',')[0]
m,d = s.split()
return {'year':y,'month':m,
'day':int(d) }

D = dict()
for e in L:
t = e.split('\t')
name,date = t[0],t[-1] # extra tabs in some
D[name] = parseDate(date)

for i,k1 in enumerate(D.keys()):
for j in range(i):
k2 = D.keys()[j]
if D[k1]['month'] == D[k2]['month']:
if D[k1]['day'] == D[k2]['day']:
print k1.ljust(25),
print D[k1]['month'],D[k1]['day']
print k2.ljust(25),
print D[k2]['month'],D[k2]['day']

print
mL = ['January','February','March','April',
'May','June','July','August','September',
'October','November','December']
def f(k):
return mL.index(D[k]['month']),D[k]['day']
for k in sorted(D.keys(),key=f):
print k.ljust(25),
print D[k]['month'],D[k]['day']