The Classic Coupon Collector's Problem
0. Introduction
At the moment a large German grocery store hands out 35 collector cards with the faces of the German national football team including staff.
For every 10 Euro worth of groceries you receive two separate packaged cards. The cards you receive are random, so there is no opportunity to specifically choose cards that are still missing in the collection.
All collected cards are then stored in a corresponding booklet.
A classic question in probability theory asks how many cards you need to receive (or in other words, money you need to spend) on average in order to obtain all cards and hence complete the collection. In the following, I will extensively discuss that question and draw mainly on [1] and [2].
1. Preliminaries
1.1 Assumptions and Notation
In order to make the model mathematically tractable, we need to introduce assumptions on the problem structure. The assumptions (a) and (b) are necessary as otherwise the mathematics become way more complicated.
(a). We assume that there is no trading of duplicates possible.
(b). The probability of being drawn is equally likely for all cards. Also, all drawings are independent of one another.
(c’). $N$ denotes the total number of cards available. $C_j$ denotes the (random) card (or player) that is chosen at step $j$, i.e. $C_j$ is a random variable.
(d). $S_m(i)$ denotes the number of cards of player $i$ after $m$ draws.
(e). Define $A = \inf\limits_{a \in A}\;\{S_a(i) > 0\; \forall\, i \in \{1, \ldots, N\}\}$. $A$ is then the largest lower bound of draws such that all distinct cards have been drawn at least once.
1.2 An auxiliary lemma
The general approach to solving the problem follows [2]:
We want to count the number of instances in which there is at least one card not drawn yet by converting it to a sum of events for which there are exactly $k$ cards missing with $k > 0$ some positive integer. To accomplish this, we use the inclusion-exclusion principle.
Proposition A: Inclusion-Exclusion Principle. For a probability space $(\Omega, \mathcal{A}, P)$ and sets $A_i \in \mathcal{A}$ for $i \in I = \{ 1, \ldots, n\}$, it holds that
\begin{equation} \boxed{P\Big(\bigcup\limits_{i = 1}^{n} A_i\Big) = \sum\limits_{i = 1}^{n} (-1)^{i + 1}\sum\limits_{J \subset \{1, \ldots, n\}, \vert J \vert = i} P\Big(\bigcap\limits_{j \in J} A_j\Big)} \end{equation}
Proof.
The following proof is the one I wrote up in the course of solving Georgii’s chapter one exercises and treats the more general claim
\begin{equation} P(S_J) = \sum\limits_{K: J\subset K \subset I} (-1)^{\vert K \setminus J \vert}P\Big(\bigcap_{k \in K} A_k\Big) \end{equation}
for $S_J = \bigcap\limits_{j \in J} A_j \cap \bigcap\limits_{j \in I\setminus J} A_j^c$ and some set $J \subset I$. For $J = \emptyset$ we recover the original claim of the proposition.
Following [1, p. 26] we first show that for all $K \subset I$ it holds that
\begin{equation} P\Big(\bigcap_{k \in K} A_k\Big) = \sum\limits_{J: K \subset J \subset I} P(S_J) \end{equation}
This identity follows at once as soon as we can establish the identity $\bigcap\limits_{k \in K} A_k = \dot{\bigcup}_{K \subset J \subset I} S_J$. There are at least two ways to show the disjointness of the LHS. A more abstract approach that I thank the math.SE network for is as follows.
For each element $\omega \in \Omega$ define the set $J_{\omega} = \{j \in I: \omega \in A_j\}$. Then obviously $\omega \in S_{J_{\omega}}$. Next, we show that there is no other set $L \neq J_{\omega}$ such that simultaneously $\omega \in S_{L}$ and $\omega \in S_{J_{\omega}}$ which would imply $S_{L} \cap S_{J_{\omega}} \neq \emptyset$. Therefore, every $\omega$ belongs to exactly one set $S_{J_{\omega}}$ and the sets $S_J$ are disjoint for different $J$. Let $\omega \in S_L$ und $\omega \in S_{J_{\omega}}$ and suppose that there is an index $j \in L\setminus J_{\omega}$. Then the set $S_L$ would be contained in $A_j$ by means of the non-empty intersection with $A_j$, i.e. $S_L \subseteq A_j$. On the other hand, $\omega$ would then be contained in $A_j^c$ due to definition of $S_J$. Contradiction, hence there is no such index $j \in L\setminus J_{\omega}$. Conversely, suppose there is an index $j \in J_{\omega}\setminus L$. Then $\omega \in A_j$ because of the definition of $J_{\omega}$. But then it must also hold that $\omega \in \Omega \setminus A_j$ due to $\omega \in S_L$ gelten. Contradiction again. In sum, it holds that $L = J_{\omega}$.
A more straightforward argument shows the disjointness directly. Let $J_1, J_2 \subset I$ and $J_1 \neq J_2$. Then it either holds that $J_1 \setminus J_2 \neq \emptyset$ or the other way around. Anyway, there is an index $j^+$ in $I\setminus J_1 \cup I \setminus J_2$ such that $j^+ \in J_1$ or $j \in J_2$. As a consequence, we arrive at
\begin{align}
B_{J_1} \cap B_{J_2} &= (\bigcap_{j \in J_1} A_j \cap \bigcap_{j \in I\setminus J_1} A_j^{c}) \cap (\bigcap_{j \in J_2} A_j \cap \bigcap_{j \in I\setminus J_2} A_j^{c}) \\
&= \bigcap_{j \in J_1 \cup J_2} A_j \cap \bigcap_{j \in I\setminus J_1 \cup I\setminus J_2} A_j^{c} \\
&= \bigcap_{j \in J_1 \cup J_2} A_j \cap \bigcap_{j \in (I\setminus J_1 \cup I\setminus J_2)\setminus j^{+}} A_j^{c} \cap A_j^{c, +} \\
&\subseteq A_j^{+} \cap A_j^{c, +} \\
&= \emptyset
\end{align}
Next, we show the equality of the left- and right-hand side.
Let $\omega \in \bigcap_{k \in K} A_k$ with $K\subset I$. Then $K \subseteq J_{\omega}$, hence $\omega \in S_{J_{\omega}}$ and because of $J_{\omega} \subset I$ and $J_{\omega} \subset \bigcup\limits_{K \subset J \subset I} J$ that $\omega \in \bigcup\limits_{J: K \subset J \subset I} S_J$. As a result, we showed that $\bigcup\limits_{k \in K} A_k \subset \bigcup\limits_{J: K \subset J \subset I} S_J$. Now let $\omega \in \bigcup\limits_{J: K \subset J \subset I} S_J$. Then at least one $k \in K$ is contained in $J_{\omega}$ or otherwise $\omega \not\in S_J$ for all $K \subset J$ and then $\omega \not\in \bigcup\limits_{J: K \subset J \subset I} S_J$. Consequently, $K \subseteq J_{\omega}$. But then $\omega \in A_k$ for all $k \in K$ and whence $\omega \in \bigcap\limits_{k \in K} A_k$.
Finally, we prove the general equality by inserting the identity that we just proved and replace $P(\bigcap_{k \in K} A_k)$ with $\sum\limits_{J: K \subset J \subset I} P(S_J)$. Thereafter we sum over all $J \subset K \subset L \subset I$. The last equality then is due to the following calculation \begin{equation} \sum\limits_{K: J \subset K \subset L} (-1)^{\vert K \setminus J \vert} = \sum\limits_{k = 1}^{\vert K \vert} \binom{\vert K \vert}{\vert K \vert - k} (-1)^{\vert K \setminus J \vert} = (1 + (-1))^{\vert K \setminus J \vert} = \begin{cases} 0 & J \neq K \neq L \\ 1 & \; J = K = L \end{cases} \qquad \qquad \square \end{equation}
3. Solving the problem theoretically
3.1 The exact probability
Consider the event $\{A > a\}$ for some $a \in \mathbb{N}$. In other words, there is at least one type of card among those $N$ cards that has not been drawn yet. We may rewrite the event to $\{A > a\} = \{\exists i \in \{1, \ldots, N\} : S_a(i) = 0\}$. The probability of the latter event, however, is equal to the probability of the union of all atomic events $\{S_a(j) = 0\}$ for $j$ reaching from 1 to $N$. Intuitively, this is clear: the probability of having at least one, that is one or more, cards among $N$ cards not being drawn, is equal to the probability of all those cards not being drawn. Hence, formally, we arrive at
\begin{equation} P(\{A > a\}) = P\Big(\{\exists i \in \{1, \ldots, N\} : S_a(i) = 0\}\Big) = P\Big(\bigcup\limits_{j = 1}^{N}\{S_a(j) = 0\}\Big) \end{equation}
Next, we use Proposition A (Inclusion-Exclusion-principle) to transform the union of events into a sum of intersection of events.
\begin{equation} P(\{A > a \}) = P\Big(\bigcup\limits_{j = 1}^{N}\{S_a(j) = 0\}\Big) = \sum\limits_{i = 1}^{N} (-1)^{i + 1}\sum\limits_{J \subset \{1, \ldots, N\}, \vert J\vert = i} P\Big(\bigcap\limits_{j \in J} \{S_a(i_j) = 0\}\Big) \qquad \qquad (3) \end{equation}
Now observe that
\begin{equation} \bigcap\limits_{j \in J} \{S_a(i_j) = 0\} = \bigcap\limits_{l = 1}^{a}\{C_l \not\in J\} \end{equation}
For some element $i_{j_1} \in J$, it now holds due to the uniformness assumption that $P(\{C_l = i_{j_1}\}) = \frac{1}{N}$. Moreover, for $J \subset \{1, \ldots, N\}$ with $\vert J \vert = k$ it holds that $P(\{C_l \in J\}) = \frac{k}{N}$. Therefore,
\begin{equation} P(\{C_l \not\in J\}) = P(\{C_l \in J\}^c) = 1 - P(\{C_l \in J\}) = 1 - \frac{k}{N} \end{equation}
Because of the assumed independence of the draws, the formula further simplifies to
\begin{equation} P\Big(\{C_1 \not\in J\} \cap \ldots \cap \{C_a \not\in J\}\Big) = P\Big(\{C_1 \not\in J\}\Big) \cdot \ldots \cdot P\Big(\{C_a \not\in J\}\Big) = \Big(1 - \frac{k}{N}\Big)^a \end{equation}
Reinserting this in formula (3) gives
\begin{equation} P(\{A > a \}) = P\Big(\bigcup\limits_{j = 1}^{N}\{S_a(j) = 0\}\Big) = \sum\limits_{i = 1}^{N} (-1)^{i + 1}\sum\limits_{J \subset \{1, \ldots, N\}, \vert J\vert = i} \Big(1 - \frac{i}{N}\Big)^a \qquad \qquad (3) \end{equation}
Since there are $\binom{N}{i}$ subsets of the numbers $1, \ldots, N$ with $i$ elements, we simplify further to
\begin{equation} P(\{A > a \}) = P\Big(\bigcup\limits_{j = 1}^{N}\{S_a(j) = 0\}\Big) = \sum\limits_{i = 1}^{N} (-1)^{i + 1} \binom{N}{i} \Big(1 - \frac{i}{N}\Big)^a \qquad \qquad (3) \end{equation}
To obtain the final formula, first note that
\begin{equation} \Big(1 - \frac{i}{N}\Big)^{a - 1} - \Big(1 - \frac{i}{N}\Big)^{a} = \Big(1 - \frac{i}{N}\Big)^{a - 1}\Big(1 - (1 - \frac{i}{N})\Big) = \Big(1 - \frac{i}{N}\Big)^{a - 1}\frac{i}{N} \end{equation}
and
\begin{align}
P(\{A = a\}) &= P(\{A > a - 1\}) - P(\{A > a\}) \\
&= \sum\limits_{i = 1}^{N} (-1)^{i + 1}\frac{i}{N}\binom{N}{i}\Big(1 - \frac{i}{N}\Big)^{a - 1} \\
&= \sum\limits_{i = 1}^{N} (-1)^{i + 1}\binom{N - 1}{i - 1}\Big(1 - \frac{i}{N}\Big)^{a - 1} \\
\end{align}
Hence, the final result is
\begin{equation} \boxed{P(\{A = a\}) = \sum\limits_{i = 1}^{N} (-1)^{i + 1}\binom{N - 1}{i - 1}\Big(1 - \frac{i}{N}\Big)^{a - 1}} \end{equation}
3.2 The expected value of draws
From basic calculus it is known that for $x \in (-1, 1)$ the expression $\frac{1}{1 - x}$ has the convergent power series representation
\begin{equation} \frac{1}{1 - x} = \sum\limits_{k = 0}^{\infty} x^k \end{equation}
Differentiating both sides with respect to $x$ then yields
\begin{equation} \frac{1}{(1 - x)^2} = \sum\limits_{k = 0}^{\infty} kx^{k - 1} \end{equation}
This fact together with proposition A.2 then allows us the derive the expected value as follows.
\begin{align}
\mathbb{E}[A] &= \sum\limits_{a = 1}^{\infty} aP(A = a) \\
&= \sum\limits_{a = 1}^{\infty} a \sum\limits_{i = 1}^{N} (-1)^{i + 1}\binom{N - 1}{i - 1}\Big(1 - \frac{i}{N}\Big)^{a - 1} \\
&= \sum\limits_{i = 1}^{N} (-1)^{i + 1}\binom{N - 1}{i - 1}\sum\limits_{a = 1}^{\infty} a\Big(1 - \frac{i}{N}\Big)^{a - 1} \\
&= \sum\limits_{i = 1}^{N} (-1)^{i + 1}\binom{N - 1}{i - 1}\frac{1}{(1 - (1 - \frac{i}{N}))^2} \\
&= N\sum\limits_{i = 1}^{N} (-1)^{i + 1}\binom{N}{i}\frac{1}{i} \\
&= N\sum\limits_{i = 1}^{N} \frac{1}{i}
\end{align}
The last step follows from the identity $\binom{N}{i} = \sum\limits_{j = i} \binom{j - 1}{i - 1}$, a change of the summation index and the application of the binomial theorem. Consequently, the expected value is
\begin{equation} \boxed{\mathbb{E}[A] = N\sum\limits_{i = 1}^{N} \frac{1}{i}} \end{equation}
Finally, we can answer the initial question. For $N = 35$ cards, it would take on average
\begin{equation}
\mathbb{E}[A] = 35\sum\limits_{i = 1}^{35} \frac{1}{i} = 35\cdot \frac{54,437,269,998,109}{13,127,595,717,600} \approx 145
\end{equation}
cards to complete the album. This makes $\frac{145}{2}\cdot 10$ Euro $\approx 725$ Euro worth of groceries. A family spending of 50 Euro per week would indicate a total of 14.5 weeks or close to 4 months on average to collect the profiles of all German football players.
4. Solving the problem computationally in Python
The following Python code simulates repeated drawing of cards and hence delivers an empirical solution of the problem.
import numpy
import random
i = 0
All_lengths = []
random.seed(a=12345, version=2)
while i < 10000:
All_numbers = []
LEN = len(numpy.unique(All_numbers))
while LEN < 35:
A = random.randint(1, 35)
All_numbers.append(A)
LEN = len(numpy.unique(All_numbers))
All_lengths.append(len(All_numbers))
i = i + 1
print(sum(All_lengths)/len(All_lengths))Running the above code on my machine gives a value of 144.9484 in a couple of seconds which is already pretty impressively close to the true value.
5. Epilogue
There are other approaches to the coupon collector problem. For instance,
(i). the resulting distribution can be shown to be a convolution of geometric distributions (cf. [2]). (ii). the derivation based on Stirling numbers of the second kind $S(n, k)$ gives a more combinatorial flavor. These numbers can be derived similarly with the help of the inclusion-exclusion principle and give the number of on-to functions from an $n$-element set to $k$-element set. Here, the $n$ can be interpreted as the number of draws and $k$ as the number of distinct cards (or coupons).
Finally, the mathematics get a lot more difficult when giving up assumption (b), that is, the uniformness assumption. If there are some player cards that are rarer (e.g. for the most popular players), then there is no known closed-form solution that is built-up of elementary functions.
References
[1] H.-O. Georgii (2014): Stochastik - Eine Einführung. Spektrum.
[2] Peter Ruckdeschel (2011): Das Coupon-Collector Problem und warum und in welcher Hinsicht das letzte fehlende Panini-Bild am “teuersten” ist.