Last time we learned about continuous distributions and how they are described by the frequency function. I also promised to save us from the infinitesimal probabilities represented in the frequency function. The answer is the **Cumulative Distribution Function (CDF)**, which is merely the area under the frequency function curve to the left of a given point. (You may remember with horror from your schooldays that the way to get this area is to integrate the curve, but we need not worry about the mechanics of this here.) What matters is that the area to the left of any value x represents not the probability of the variable being *exactly* x (which is infinitesimal) but the probability that it is *less than* x. And this is a nice finite number, varying of course from 0 to 1 as x increases from some suitably low value (minus infinity if you like) and a high one (plus infinity).

Furthermore, this cumulative probability is what we are generally interested in. (We rarely care whether the project will finish *on* a particular date, but rather whether it will finish *by* a particular date.) So now we are cooking with gas! The cumulative curve is generally in the shape of an S (since its gradient is at its highest at a point coinciding with the peak of the frequency function) and is sometimes called an S-curve. Here is the CDF for the near-normal distribution we got last time by adding together 10 throws of a die:

The values read from this curve are often called percentiles. We say that the “80^{th} percentile is $40″ meaning that there is an 80% chance that the project cost will not exceed $40. This is also sometimes called the “P80″ point.

Now, suppose we are interested in the P80 point, and we have a project comprising two subprojects with the same P80 cost, $40. What is the P80 point for the total project? From what we have learned so far, it should be clear that it is NOT $80.

Why not? Because we have learned that means and variances are additive. Since the variances add, any value characterized by a non-zero linear offset from the mean cannot be additive. The fact is that we do not have enough information to answer the question, but we can say that the answer will be *less than* $80. This should not surprise us. If there is a 20% chance of each project costing more than $40, the chance of them BOTH doing so is only 4%, so we might expect that $80 is more like the P96 point. (This is not rigorous because there are an infinite number of other ways in which the total cost could exceed $80, but it should give you a feel for why you cannot add percentiles. And for the same reason in reverse, you cannot “pro-rate” percentiles either.)

To answer the question we have to know something about the two individual distributions, so let’s suppose that they are both the same normal distribution we illustrated above. We know that the distribution for the total project will be normal with a mean of $70 and a standard deviation of about $7.64. (5.4 times the square root of 2). Now, for the normal distribution each percentile is a given number of standard deviations away from the mean. The relationship is not easy to calculate but can be looked up in tables. The P80 point is about 0.84 standard deviations above the mean, so we can deduce that the P80 point is actually $70 + .84*7.64, or about $76.4.

The naïve $80 answer is actually about 1.31 standard deviations above the mean. Looking up the tables backward, we can also deduce that this is actually roughly the P90 point.

We are nearly finished our 3-part introduction to probability and I want to finish by talking about **measures of central tendency**. As their name implies, they are measures of what we might imprecisely call the center of a distribution. The normal distribution we have dealt with so far is symmetrical, so there is not much doubt about where it centre is, but if the distribution is not symmetrical we have a more nuanced situation. Take this skewed beta distribution:

The mean value is $4889. We can also identify the peak of the frequency function, the most likely value also called the mode, which is about $4800. Finally there is the median, which is the P50 point on the S-curve, at $4867. Which of these measures is the best depends upon what one is using it for. They are measuring different things. Suppose we have a group of 30 people in a room. 29 are regular guys but the 30^{th} is Warren Buffet with a net worth of say $30 billion. The mean net worth will be pretty close to $1 billion, but this does not tell us much about the typical guy in the group. The mode or median would give us a better idea. If Buffet leaves and George Soros replaces him, the mean net worth will change but the median and the mode will not. So in cases where we have very skewed distributions, the mode or median will be more representative. And the median is more reliable than the mode because there is not really a mode in this example. (No two people are likely to have exactly the same net worth, so we have to aggregate them into ranges to even get the concept of most likely.) This is why the median is the most widely quoted value when describing social phenomenon like income and wealth which tend to be very unevenly distributed.

But medians are just percentiles, *which means we cannot add them together.* Remember that the rule about adding up means applies regardless of the distribution, so although it is not always the most representative value it is very convenient for computation. Let’s see what happens when we add two subprojects with this skewed cost distribution:

You will see that the mean is indeed twice the old mean, to within a rounding error of $1, at $9779. But the mode is about $9700, and the median $9760. You may also notice that the three values have become closer together in proportional terms. The reason is that the distribution has become less skewed. And the reason for that is our old friend the Central Limit Theorem. Even highly skewed distributions obey the theorem and tend towards the (symmetrical) normal distribution.

One final point about the CLT and about the fact that variances are additive, which quantifies the “swings and roundabouts” effect we all experience. Both are real. They are not aberrations. I have heard lecturers, and others who should know better, talk about the CLT (when they actually mean the additive nature of variances) as if it is some mischievous gremlin which needs to be overcome or counteracted in some way. (This is generally because they do not get the answers they want. In this they make two mistakes. Firstly, misdiagnosing the problem altogether, and secondly blaming it on the CLT rather than the additive nature of variances.) But it reflects real life; if one neutralizes it in ones model of real life then ones model will be wrong.

Finally, a book recommendation. * The Flaw of Averages* by Sam Savage warns of the danger of basing decisions on averages (or indeed any single-point estimate of a random variable). It is written for the general reader and a lot of fun, and it refers to project schedules to exemplify a particularly pernicious strain of the “flaw”.

*If you enjoyed this post, make sure you subscribe to the Camel feed here.! You can also follow me on Twitter here.*

These have been a great series on quantitative techniques that I realized I need to brush up on. Thanks!