Twitter is awash with data graphics purporting to show some kind of injustice, anomaly or discrepancy. Before retweeting one of these you may want to consider two things:
- Is the data trustworthy?
- Does the data actually support the claim in the tweet?
The latter in particular is a more general concern beyond the Twitter universe. In particular, real-world random numbers can lead to real-world differences that you might not expect. One should assess the probability of interesting numbers arising by chance, without the need for conspiracy or collusion or inherent bias. And one of the key things to consider in this respect is sample size.
The graphic below – produced by BBC3’s Free Speech programme – has been doing the rounds. I’m not aware of anything inherently wrong with the data (though I haven’t taken much time to check it), but it provides a nice illustration of the importance of understanding sample-size effects.
The text accompanying this graphic was:
“33% of MPs went to private school compared to 7% of the general population. Does this matter to you? #RegisterToVote“
Sample sizes are given here, but the pie charts that visually show the proportion of privately educated MPs for the parties do not encode them in any way. Moreover, the charts go in descending order according to the proportions, seemingly implying the “worst offenders” are the Green Party and UKIP on the left (of the image). These two parties have a grand total of three MPs between them. It’s also worth noting that the charts do not cover all parties. For example, SNP MPs are missing.
So how can we take into account the effects of sample size? One simple option is to answer the question: “What is the probability I would observe at least as many privately educated individuals as seen in party X if I took a random sample of the UK adult population that is the same size as the total number of MPs in party X?”. We’ll call this the random sample probability or RSP. Strictly speaking this involves sampling without replacement (no two MPs can be the same person) but given that there are millions of state and privately educated adults, the difference between sampling with and without replacement is negligible. That means we can use the binomial distribution to calculate these probabilities.
In fact, for the Green Party and UKIP it’s even simpler. For the former, with 7% of adults having gone to private school there’s obviously a 7% (or 0.07 in decimal) chance randomly selecting someone from the UK adult population will get you a privately educated person. For the latter, select two individuals at random from the population and the chance that they’ll both have gone to private school is 0.072 = 0.0049, or about 0.5%.
For the other parties we do need the binomial distribution. For the Labour party the RSP depends on whether the 10% stated corresponds to 25, 26 or 27 MPs (this is a rounding issue). Correspondingly the RSP is 6.1%, 3.9% or 2.4%. Each possibility is less than the RSP of the Green Party but more than that of UKIP.
For the Liberal Democrats the RSP is about one in a trillion! The data from the graphic is ambiguous as to whether there are 157, 158 or 159 privately educated Conservative MPs. It doesn’t really matter since any which way the probability is essentially 0% – it’s too small to calculate in R and R can cope with some pretty small numbers.
What do we conclude from this? Despite the fact that 100% of their MPs received a private school education, the evidence that the Green Party favours privately educated individuals is the least conclusive (at least using the numbers provided in the graphic). That is to say, the Green Party’s relative number of privately and state educated MPs is more likely to arise by chance from a random selection of the UK population than for any other party. This is the exact opposite of the implicit message one is likely to get from just glancing at the pie charts. Sample size is critical here.
To be clear, I’m not making a statement about whether an unrepresentative parliament (in terms of private/state schooling) is bad. The point is that if we ignore sample size and the high variability associated with small samples then we’re likely to come to the wrong conclusion.