The Classic Coupon Collector's Problem

0. Introduction

At the moment a large German grocery store hands out 35 collector cards with the faces of the German national football team including staff.
For every 10 Euro worth of groceries you receive two separate packaged cards. The cards you receive are random, so there is no opportunity to specifically choose cards that are still missing in the collection. All collected cards are then stored in a corresponding booklet.

A classic question in probability theory asks how many cards you need to receive (or in other words, money you need to spend) on average in order to obtain all cards and hence complete the collection. In the following, I will extensively discuss that question and draw mainly on [1] and [2].

1. Preliminaries

1.1 Assumptions and Notation

In order to make the model mathematically tractable, we need to introduce assumptions on the problem structure. The assumptions (a) and (b) are necessary as otherwise the mathematics become way more complicated.

(a). We assume that there is no trading of duplicates possible.

(b). The probability of being drawn is equally likely for all cards. Also, all drawings are independent of one another.

(c’). NN denotes the total number of cards available. CjC_j denotes the (random) card (or player) that is chosen at step jj, i.e. CjC_j is a random variable.

(d). Sm(i)S_m(i) denotes the number of cards of player ii after mm draws.

(e). Define A=msub{Sa(i)>0i{1,,N}}A = \inf\limits_{a \in A}\;\{S_a(i) > 0\; \forall\, i \in \{1, \ldots, N\}\}. AA is then the largest lower bound of draws such that all distinct cards have been drawn at least once.

1.2 An auxiliary lemma

The general approach to solving the problem follows [2]:

We want to count the number of instances in which there is at least one card not drawn yet by converting it to a sum of events for which there are exactly kk cards missing with k>0k > 0 some positive integer. To accomplish this, we use the inclusion-exclusion principle.

Proposition A: Inclusion-Exclusion Principle. For a probability space (Ω,A,P)(\Omega, \mathcal{A}, P) and sets AiAA_i \in \mathcal{A} for iI={1,,n}i \in I = \{ 1, \ldots, n\}, it holds that

P(i=1nAi)=i=1n(1)i+1J{1,,n},|J|=iP(jJAj)

Proof.

The following proof is the one I wrote up in the course of solving Georgii’s chapter one exercises and treats the more general claim

P(SJ)=K:JKI(1)|KJ|P(kKAk)

for SJ=jJAjjIJAjcS_J = \bigcap\limits_{j \in J} A_j \cap \bigcap\limits_{j \in I\setminus J} A_j^c and some set JIJ \subset I. For J=J = \emptyset we recover the original claim of the proposition.

Following [1, p. 26] we first show that for all KIK \subset I it holds that

P(kKAk)=J:KJIP(SJ)

This identity follows at once as soon as we can establish the identity kKAk=˙KJISJ\bigcap\limits_{k \in K} A_k = \dot{\bigcup}_{K \subset J \subset I} S_J. There are at least two ways to show the disjointness of the LHS. A more abstract approach that I thank the math.SE network for is as follows.

For each element ωΩ\omega \in \Omega define the set Jω={jI:ωAj}J_{\omega} = \{j \in I: \omega \in A_j\}. Then obviously ωSJω\omega \in S_{J_{\omega}}. Next, we show that there is no other set LJωL \neq J_{\omega} such that simultaneously ωSL\omega \in S_{L} and ωSJω\omega \in S_{J_{\omega}} which would imply SLSJωS_{L} \cap S_{J_{\omega}} \neq \emptyset. Therefore, every ω\omega belongs to exactly one set SJωS_{J_{\omega}} and the sets SJS_J are disjoint for different JJ. Let ωSL\omega \in S_L und ωSJω\omega \in S_{J_{\omega}} and suppose that there is an index jLJωj \in L\setminus J_{\omega}. Then the set SLS_L would be contained in AjA_j by means of the non-empty intersection with AjA_j, i.e. SLAjS_L \subseteq A_j. On the other hand, ω\omega would then be contained in AjcA_j^c due to definition of SJS_J. Contradiction, hence there is no such index jLJωj \in L\setminus J_{\omega}. Conversely, suppose there is an index jJωLj \in J_{\omega}\setminus L. Then ωAj\omega \in A_j because of the definition of JωJ_{\omega}. But then it must also hold that ωΩAj\omega \in \Omega \setminus A_j due to ωSL\omega \in S_L gelten. Contradiction again. In sum, it holds that L=JωL = J_{\omega}.

A more straightforward argument shows the disjointness directly. Let J1,J2IJ_1, J_2 \subset I and J1J2J_1 \neq J_2. Then it either holds that J1J2J_1 \setminus J_2 \neq \emptyset or the other way around. Anyway, there is an index j+j^+ in IJ1IJ2I\setminus J_1 \cup I \setminus J_2 such that j+J1j^+ \in J_1 or jJ2j \in J_2. As a consequence, we arrive at

BJ1BJ2=(jJ1AjjIJ1Ajc)(jJ2AjjIJ2Ajc)=jJ1J2AjjIJ1IJ2Ajc=jJ1J2Ajj(IJ1IJ2)j+AjcAjc,+Aj+Ajc,+=

Next, we show the equality of the left- and right-hand side.

Let ωkKAk\omega \in \bigcap_{k \in K} A_k with KIK\subset I. Then KJωK \subseteq J_{\omega}, hence ωSJω\omega \in S_{J_{\omega}} and because of JωIJ_{\omega} \subset I and JωKJIJJ_{\omega} \subset \bigcup\limits_{K \subset J \subset I} J that ωJ:KJISJ\omega \in \bigcup\limits_{J: K \subset J \subset I} S_J. As a result, we showed that kKAkJ:KJISJ\bigcup\limits_{k \in K} A_k \subset \bigcup\limits_{J: K \subset J \subset I} S_J. Now let ωJ:KJISJ\omega \in \bigcup\limits_{J: K \subset J \subset I} S_J. Then at least one kKk \in K is contained in JωJ_{\omega} or otherwise ωSJ\omega \not\in S_J for all KJK \subset J and then ωJ:KJISJ\omega \not\in \bigcup\limits_{J: K \subset J \subset I} S_J. Consequently, KJωK \subseteq J_{\omega}. But then ωAk\omega \in A_k for all kKk \in K and whence ωkKAk\omega \in \bigcap\limits_{k \in K} A_k.

Finally, we prove the general equality by inserting the identity that we just proved and replace P(kKAk)P(\bigcap_{k \in K} A_k) with J:KJIP(SJ)\sum\limits_{J: K \subset J \subset I} P(S_J). Thereafter we sum over all JKLIJ \subset K \subset L \subset I. The last equality then is due to the following calculation K:JKL(1)|KJ|=k=1|K|(|K||K|k)(1)|KJ|=(1+(1))|KJ|={0JKL1J=K=L

3. Solving the problem theoretically

3.1 The exact probability

Consider the event {A>a}\{A > a\} for some aNa \in \mathbb{N}. In other words, there is at least one type of card among those NN cards that has not been drawn yet. We may rewrite the event to {A>a}={i{1,,N}:Sa(i)=0}\{A > a\} = \{\exists i \in \{1, \ldots, N\} : S_a(i) = 0\}. The probability of the latter event, however, is equal to the probability of the union of all atomic events {Sa(j)=0}\{S_a(j) = 0\} for jj reaching from 1 to NN. Intuitively, this is clear: the probability of having at least one, that is one or more, cards among NN cards not being drawn, is equal to the probability of all those cards not being drawn. Hence, formally, we arrive at

P({A>a})=P({i{1,,N}:Sa(i)=0})=P(j=1N{Sa(j)=0})

Next, we use Proposition A (Inclusion-Exclusion-principle) to transform the union of events into a sum of intersection of events.

P({A>a})=P(j=1N{Sa(j)=0})=i=1N(1)i+1J{1,,N},|J|=iP(jJ{Sa(ij)=0})(3)

Now observe that

jJ{Sa(ij)=0}=l=1a{ClJ}

For some element ij1Ji_{j_1} \in J, it now holds due to the uniformness assumption that P({Cl=ij1})=1NP(\{C_l = i_{j_1}\}) = \frac{1}{N}. Moreover, for J{1,,N}J \subset \{1, \ldots, N\} with J=k\vert J \vert = k it holds that P({ClJ})=kNP(\{C_l \in J\}) = \frac{k}{N}. Therefore,

P({ClJ})=P({ClJ}c)=1P({ClJ})=1kN

Because of the assumed independence of the draws, the formula further simplifies to

P({C1J}{CaJ})=P({C1J})P({CaJ})=(1kN)a

Reinserting this in formula (3) gives

P({A>a})=P(j=1N{Sa(j)=0})=i=1N(1)i+1J{1,,N},|J|=i(1iN)a(3)

Since there are (Ni)\binom{N}{i} subsets of the numbers 1,,N1, \ldots, N with ii elements, we simplify further to

P({A>a})=P(j=1N{Sa(j)=0})=i=1N(1)i+1(Ni)(1iN)a(3)

To obtain the final formula, first note that

(1iN)a1(1iN)a=(1iN)a1(1(1iN))=(1iN)a1iN

and

P({A=a})=P({A>a1})P({A>a})=i=1N(1)i+1iN(Ni)(1iN)a1=i=1N(1)i+1(N1i1)(1iN)a1

Hence, the final result is

P({A=a})=i=1N(1)i+1(N1i1)(1iN)a1

3.2 The expected value of draws

From basic calculus it is known that for x(1,1)x \in (-1, 1) the expression 11x\frac{1}{1 - x} has the convergent power series representation

11x=k=0xk

Differentiating both sides with respect to xx then yields

1(1x)2=k=0kxk1

This fact together with proposition A.2 then allows us the derive the expected value as follows.

E[A]=a=1aP(A=a)=a=1ai=1N(1)i+1(N1i1)(1iN)a1=i=1N(1)i+1(N1i1)a=1a(1iN)a1=i=1N(1)i+1(N1i1)1(1(1iN))2=Ni=1N(1)i+1(Ni)1i=Ni=1N1i

The last step follows from the identity (Ni)=j=i(j1i1)\binom{N}{i} = \sum\limits_{j = i} \binom{j - 1}{i - 1}, a change of the summation index and the application of the binomial theorem. Consequently, the expected value is

E[A]=Ni=1N1i

Finally, we can answer the initial question. For N=35N = 35 cards, it would take on average

E[A]=35i=1351i=3554,437,269,998,10913,127,595,717,600145

cards to complete the album. This makes 145210\frac{145}{2}\cdot 10 Euro 725\approx 725 Euro worth of groceries. A family spending of 50 Euro per week would indicate a total of 14.5 weeks or close to 4 months on average to collect the profiles of all German football players.

4. Solving the problem computationally in Python

The following Python code simulates repeated drawing of cards and hence delivers an empirical solution of the problem.

import numpy
import random
i = 0
All_lengths = []
random.seed(a=12345, version=2)

while i < 10000:
    All_numbers = []
    LEN = len(numpy.unique(All_numbers))
    while LEN < 35:
        A = random.randint(1, 35)
        All_numbers.append(A)
        LEN = len(numpy.unique(All_numbers))
    All_lengths.append(len(All_numbers))
    i = i + 1
print(sum(All_lengths)/len(All_lengths))

Running the above code on my machine gives a value of 144.9484 in a couple of seconds which is already pretty impressively close to the true value.

5. Epilogue

There are other approaches to the coupon collector problem. For instance,

(i). the resulting distribution can be shown to be a convolution of geometric distributions (cf. [2]). (ii). the derivation based on Stirling numbers of the second kind S(n,k)S(n, k) gives a more combinatorial flavor. These numbers can be derived similarly with the help of the inclusion-exclusion principle and give the number of on-to functions from an nn-element set to kk-element set. Here, the nn can be interpreted as the number of draws and kk as the number of distinct cards (or coupons).

Finally, the mathematics get a lot more difficult when giving up assumption (b), that is, the uniformness assumption. If there are some player cards that are rarer (e.g. for the most popular players), then there is no known closed-form solution that is built-up of elementary functions.

References

[1] H.-O. Georgii (2014): Stochastik - Eine Einführung. Spektrum.

[2] Peter Ruckdeschel (2011): Das Coupon-Collector Problem und warum und in welcher Hinsicht das letzte fehlende Panini-Bild am “teuersten” ist.