ProbabilityCommon Distributions
There are several specific probability distributions which arise commonly in real-world applications. Learning the properties of these distributions and how those properties lead to their frequent appearance can help us build random models and reason about properties about random systems that involve these distributions. In this section, we will explore several such distributions.
Bernoulli distribution
Suppose we conduct an experiment with exactly two outcomes, which we will encode as 0 and 1. For example, consider the following scenarios
- You flip a coin and it comes up heads (1) or tails (0)
- Someone's position on a political issue is either positive (1) or negative (0)}
- Someone can either be healthy (1) or sick (0)
- In an online survey, a user answers either true (1) or false (0)
The distribution of the result of such an experiment is governed by a single parameter , which is the probability of the outcome encoded as 1. The probability of the other outcome is , since one of the two outcomes must occur. It is customary to think of the outcomes 1 and 0 as success and failure, respectively, in which case may be referred to as the success probability. A sequence of independent Bernoulli random variables with the same success probability is referred to as a sequence of Bernoulli trials.
We write to mean that is Bernoulli distributed with success probability . The expected value of a random variable is
and its variance is
Exercise
Consider a sum  of 10 independent Bernoulli random variables with success probability .
- Find the mean and variance of .
- Find the value of which maximizes . Hint: write down an expression for and then use Julia to find its maximum value.
Solution.
- For let then Therefore,
and
- Observe that for we have if and only if there are successes. Now, there are ways in which we can have success and each of them occurs with probability , by independence. Therefore
We can use Julia's built-in binomial function and an array comprehension as follows
maximum([binomial(10, k)*0.36^k*(1 - 0.36)^(10 - k) for k in 0:10])
to find that the value of that maximizes is and the maximum is approximately
The binomial distribution
Example
What is the probability of rolling exactly 18 sixes in 100 independent rolls of a fair die?
Solution. There are many ways to roll 18 sixes. We could roll 18 sixes followed by 82 non-sixes, and that happens with probability
by independence. Similarly, the probability of rolling 2 non-sixes, then 9 sixes, then 14 non-sixes, then 9 more sixes, and finally 66 non-sixes also has probability given by . In fact, for every choice of 18 positions in which the sixes may fall, there is a an outcome with exactly 18 sixes whose probability is . Since there are of these outcomes, the probability that one of them occurs is
Generally, independent trials with success probability will lead to total successes with probability
This distribution is called the binomial distribution and is denoted .
Exercise
Stirling's approximation allows us to more easily manipulate factorial expressions algebraically. It says that
Suppose that is even and that . Use Stirling's approximation to show that times the probability mass assigned to 0 by the distribution converges to a finite, positive constant as . Find the value of this constant.
Solution. Let be the probability mass at 0. Substituting Stirling's approximation for the factorial expressions in tells us that
as . Simplifying the big mess of an expression on the left hand side tells us that as . Therefore, as .
Geometric distribution
The geometric distribution with parameter is the distribution of the index of the first success in a sequence of independent Bernoulli trials.
The probability that the first success occurs on trial is equal to the probability that the first trials fail and the th trial succeeds. The probability of this event is . Therefore, the probability mass function of the geometric distribution is
Exercise
Use Monte Carlo to find the mean and variance of the geometric distribution with parameter .
Hint: you can sample from the geometric distribution using the definition: count the number of times you have to run rand(Uniform(0, 1)) until you get a result less than .
Solution. Here's an example solution:
using Statistics, Distributions
function sample_geometric(p)
    k = 1
    while true
        if rand(Uniform(0, 1)) < p
            return k
        else
            k += 1
        end
    end
end
samples = [sample_geometric(1/3) for i=1:1_000_000]
m = mean(samples)
σ² = mean(x^2 for x in samples) - m^2
(m, σ²)
The pair returned by this block is very close to , leading us to conjecture that the mean and variance are 3 and 6, respectively.
Note: the superscript of 2 is part of the variable name. You can get this symbol at the Julia prompt using \^2«tab»
We can use Taylor series to work out exact expressions for the mean and variance. The mean is equal to
and we recognize all the terms except the first as times the derivative of
By the formula for the sum of a geometric series, this expression is equal to
and so the mean of the geometric distribution is
The variance can be worked in a similar but more tedious way, and the result is
These expressions do indeed evaluate to 3 and 6, respectively, when is substituted.
Exercise
Suppose that  is geometric with success probability , and consider the random variable . What is the expected value of ?
Solution. The random variable is equal to with probability , for all positive integers . Therefore, the expected value is
So has infinite mean.
Exercise
Explain why ceil(log(rand())/log(1-p)) returns a random variable whose distribution is geometric with success probability .
Solution. Let define the ceiling function on The question is asking to show that if is uniformly distributed in , then
is geometrically distributed with success probability .
This is true because of the inverse cdf trick of Exercise . To show that this is indeed the case, it suffices to show that if is the cdf of a geometrically distributed random variable with success probability then the generalized inverse of is
for all Now, let be the cdf of a geometric random variable with success probability and denote the floor function on Then
where the last line follows from evaluating the geometric sum. The jumps in clearly occur at positive integer values. Therefore, if we let we find that the generalized inverse of is given by
for all But if then because Therefore, for all
Now if is uniformly distributed in then is also uniformly distributed in so is indeed geometrically distributed with success probability
Exercise
Every time you visit your favorite restaurant, you choose a meal uniformly at random from the 10 available meals. How many visits will it take on average before you've tried all 10 meals?
Hint: try letting be the number of visits from the time you try the th unique meal to the time when you try the st unique meal.
Solution. For let be the number of visits it takes to try the th unique meal after trying the th unique meal. Then the number of visits it takes to try all the meals is Now, for any non-negative integer $X_k = n$ if all the previous visits yielded the meals that have already been tried. Because the meals are chosen independently and uniformly at random, we find that
for all and any non-negative integer For notational simplicity, let Then
for all Now, as we recall from elementary calculus, the term-by-term differentiation theorem gives
Therefore,
for all and thus
We find that, on average, about visits are needed to try all the different meals.
Poisson Distribution
The Poisson distribution arises as the number of 1's observed in a large number of low-probability Bernoulli random variables. This situation models a surprising variety of real-world scenarios:
- The number of calls received at a call center in a given hour. Each potential caller has a low probability of calling during that particular hour, and there are many potential callers who are acting independently.
- The number of meteorites which strike earth in a given year. There are many meteorites which might hit earth, and each one does so with low probability.
- The number of mutations on a strand of DNA. Each mutation occurs with low probability, but there are many potential sites for a mutation.
- The number of claims filed with an insurance company in a given month. There are many customers, and they file claims independently and with low probability each month.
Exercise
- Find the expected value of , where is a sum of 1000 independent Bernoulli random variables with success probability .
- Find the probability mass function of . Hint: find an expression representing the probability mass at each  from 0 to 1000, and then use Julia to evaluate it. You will need to define n = big(1000)andp = big(3)/1000because arbitrary precision arithmetic is required to avoid overflow issues.
- Compare your results to the probability mass function defined on .
Solution. (i) The expected value of each Bernoulli random variable is , so by linearity of expectation the expected value of is .
(ii) Consider all possible length-1000 strings of 0's or 1's. Of these, there are with ones and zeros, and each of those strings has a probability of of being the result of independent sequence of random variables (where ). Therefore, the probability of the event is . We can obtain a vector of these probabilities as follows:
n = big(1000) p = big(3)/1000 massfunction = [binomial(n,k)*p^k*(1-p)^(n-k) for k=0:1000]
(iii) We can run [3^big(k)/factorial(big(k))*exp(-3) for k=0:1000] to get the first 1001 values of the given probability mass function. We see that the values are quite similar. The first ten pairs of values are
(0.0495631, 0.0497871)
(0.149137, 0.149361)    
(0.224154, 0.224042)    
(0.224379, 0.224042)    
(0.168284, 0.168031)    
(0.100869, 0.100819)    
(0.0503334, 0.0504094)  
(0.0215065, 0.021604)   
(0.00803259, 0.00810151)
(0.0026641, 0.0027005)  Inspired by this exercise, we make the following definition:
Definition (Poisson distribution)
The Poisson distribution with mean  is the distribution whose probability mass function is
The probability mass function XEQUATIONX1785XEQUATIONX for XEQUATIONX1786XEQUATIONX
The expression in the definition of the Poisson distribution arises as a limit of the expression
In other words, we use a success probability of so that the expected number of successes remains constant as .
The connection between the Poisson and Bernoulli random variables may be used to obtain the mean and variance of the Poisson distribution. The average number of successes in Bernoulli() trials is , by linearity of expectation. Therefore, we expect that the mean of a Poisson random variable with parameter is equal to . Similarly, the variance of the number of successes in Bernoulli( ) trials is equal to . Taking , we predict that the variance of a Poisson random variable with parameter is also equal to . Both of these predictions are accurate:
Theorem
The mean and variance of a Poisson random variable with parameter  are  and , respectively.
Exercise
Suppose that the number of typos on a page is a Poisson random variable with mean .
- Provide an explanation for why the Poisson distribution might be a good approximation for the distribution of typos on a page.
- Find the probability that a particular page is typo-free.
Solution. (i) A typo opportunities on a page convert to actual typos with a small but roughly constant probability, there are quite a few of them, and different typos are (roughly) independent of one another. Thus the number of typos is a sum of independent Bernoulli random variables. (ii) The probability that a Poisson random variable with parameter is equal to 0 is
Exponential distribution
The exponential distribution also emerges as a limit involving Bernoulli random variables: imagine placing a light bulbs activated by independent random variables at every multiple of on the positive real number line. Consider the position of the leftmost lit bulb. The probability that it occurs to the right of a point is equal to the probability that all of the bulbs to the left remain unlit:
This probability converges to as .
Definition (Exponential distribution)
Let . The exponential distribution with parameter  is the probability measure on  which assigns mass  to the interval , for all .
Equivalently, the exponential distribution with parameter is the probability measure whose density is
Exercise
Find the mean of the exponential distribution with parameter .
Solution. We calculate
Exercise
Suppose that 
Solution. Observing that 
as required.
Cauchy distribution
The Cauchy distribution spreads probability mass way out on the real number line.
Definition (Cauchy distribution)
The Cauchy distribution is the probability measure on 
The amount of probability mass assigned by the Cauchy distribution to the interval 
This mass goes to 0 so slowly that the Cauchy distribution doesn't even have a well-defined mean, let alone a variance. We say that the Cauchy distribution is heavy-tailed, and we will use it as an example when we want to study the effects of heavy tails on results like the law of large numbers or the central limit theorem.
Exercise
Show that the mean of the Cauchy distribution is not well-defined.
Solution. Let 
Therefore 
Exercise
Choose 
Solution. Let 
 Then for all 
 Since 
 Now, by the fundamental theorem of calculus, we know that if 
for all 
and 
Normal distribution
Because of the central limit theorem, which we will discuss in the next section, the normal distribution plays a central role in probability and statistics.
Definition (Normal distribution)
For 
The standard normal distribution is 
Exercise
Show that if 
Solution. Let 
 for all 
But 
and thus 
Example
In terms of the cumulative distribution function 
Solution. Let's denote by 
Therefore, the desired probability is
We can compute this probability in Julia as follows:
using Distributions Φ(x) = cdf(Normal(0, 1),x) Φ(3/sqrt(3)) - Φ(1/sqrt(3))
We find that the probability is approximately 
If we sum two independent random variables with means 
Theorem
If 
Exercise
Suppose that 
Solution. We know that 
Continuing in this way, we find that 
The multivariate normal distribution
If 
is said to have multivariate normal distribution.
If 
The figure below shows a graph of this density as well as 1000 samples from the distribution of 
A graph of a multivariable normal density
Exercise
Show that the covariance matrix of a multivariate normal random vector 
Note: you may use the following properties: 
Solution. Before showing that 
Second, if 
A similar argument also shows that 
Now, from the first observation, we deduce that 
Now, 
 English
English