How would you estimate the number of seagulls that live in Stockholm in the summer? Because of the difficulty of conducting a formal seagull census, techniques such as mark-recapture experiments can be used to estimate the population by marking random samples.

Estimating the number of seagulls in Stockholm using a set of mark-recapture experiments could work like this: scientists start by “capturing” a random sample of seagulls, “marking” the seagulls with metal bird-anklets, and releasing the seagulls. Later, the scientists capture another random sample of seagulls and count the number of “recaptured” seagulls that already have bird-anklets. From the information gathered from that process alone, scientists can estimate the seagull population!

In this post, I’ll go through the math behind one method to estimate the population from mark-capture experiments. The math required is a nice reminder of statistics, with appearances by Bayes’ theorem and the Binomial distribution. This post will focus on the math, and I’ll gloss over many important and interesting questions, such as how to take a representative sample of things that move around, lose their mark, migrate, reproduce, and die.

Capture, Mark, Release, repeat!

Before looking at the math, I’ll illustrate the example above. The process is:

  • Capture a random sample
    • Note how many were captured and how many were recaptured (i.e. have a mark)
    • Update the population estimate
  • Mark the sample
  • Release the sample
  • Repeat!

Below is an illustration. Click on the box below to go to the next step. The large circles are captured things. Light beige circles represent those we have not seen before. Dark blue are those that are recaptured.

Math

One method to do population estimation using mark-recapture experiments is explained in the 1986 paper “Population Estimation from Mark-Recapture Experiments using a Sequential Bayes Algorithm” by W. J. Gazey and M. J. Stanley.

The method produces a discrete probability distribution over a fixed set of possible population sizes \(N\). What this means is that I need to provide a guess of the range of the population and choose the granularity of estimates. For example, if I think the population is between 20 and 1000 and I need the granularity of 20, this method can give a distribution for the population \(N = [20, 40, …, 980, 1000]\). It will give me values such as \( P(N_i = 100) = .32 \), which says that there is a 32% probability of the population being is 100 as opposed to the other possible values.

In addition, this method incorporates the results of multiple mark-recapture experiments. I’ll annotate the experiments with the superscript \((t)\), as in \(C^{(t)}\).

To update \(P(N)\), the method starts with \(P(R^{(t)} \mid N_i)\), or the probability that we would recapture in experiments \(t\) a total of \(R^{(t)}\) things given the true population \(N_i\). \(P(R^{(t)} \mid N_i)\) is a function of how many things are already marked \(M_t\) (which we know), how many we captured \(C_t\) (which we know), and the population \(N_i\) (which I’ll get back to). Remembering combinatorics, we get

\( P(R^{(t)} \mid N_i) = \binom{C^{(t)}}{R^{(t)}} \left( \frac{M^{(t)}}{N_i} \right)^{R^{(t)}} \left( 1 - \frac{M^{(t)}}{N_i} \right)^{C^{(t)} - R^{(t)}} \)

Equation 1c from the Gazey and Stanley paper.

However, what we really want is \(P(N | \mathcal{D} )\), or the distribution over the population given what we know about the values of \(M\), \(C\), and \(R\) from each experiment. To do that, we can apply Bayes’ Theorem:

\( P(N_i^{(t)}|\mathcal{D}^{(t)}) = \frac{ P(N_i^{(t - 1)}) P(R^{(t)} \mid N_i) }{ \sum_{i} P(N_i^{(t - 1)}) P(R^{(t)} \mid N_i) } \)

Each experiments \(t\), we’ll use the current estimation \(P(N^{t-1})\) from the previous experiment. The equation requires a prior over the population \(P(N^{(t=0)})\), for which we’ll use a uniform distribution over the population range. To turn this into an estimated population range, I try to find the 95% highest posterior density interval of \(P(N)\).

There’s also a problem where if \(N_i < M\) we get undefined values. To hack around this, I zero out those probabilities before computing the updated distribution.

How the distribution changes by experiment

And that’s it! Below is a simulation of the distribution \(P(N)\) updating with each new capture-mark-release study.

The animation has three components: the numbers being updated, the current distribution \(P(N)\) given the data we know, and what it would look like if we could see the entire population. After slowly walking through a few steps, the animation will speed up and go through around 20 mark-recapture experiments.

 

In code

The simulation for the animation was generated using this code. When I don’t need to illustrate the values, I can try more realistic populations. For example:

true_N = 8000
fs = FieldStudy(true_N, 1000, 100000, 100)

for _ in range(21):
    fs.sample(difficulty_of_catch=0.005)

print("Mode", fs.N[np.argmax(fs.P_k)])

In this example, the true population is \( 8000 \). I start with a guess that the population is between \( 1000 \) and \( 100000 \). After 20 experiments (excluding the initial marking experiment), this gives me a mode of \( 8000 \) and a range of \(7000 - 12000\). Not bad!

difficulty_of_catch is the probability of capturing each thing. I needed that for the demo and probably wouldn’t know that value if I did a real experiment.

  • Gazey, W. J., and M. J. Staley. “Population estimation from mark‐recapture experiments using a sequential Bayes algorithm.” Ecology 67.4 (1986): 941-951.
  • Code that produces simulations