Benford’s law is a well known mathematical phenomenon describing the frequency distribution of the leading significant digits in many collections of data that span several orders of magnitude. In base-10: 1 will be the leading digit in about 30% of cases, 2 in 18% of cases and decreasing to less than 5% for 9 if the dataset conforms to Benford’s law. Formally, the probability for the leading digit, D1, of a number in a dataset being d is given by
The resultant distribution is shown below:
Why many datasets obey the equation above isn’t trivial to prove so I won’t try to do so. The Numberphile video below presents an intuitive if incomplete explanation and if you really like maths you can try this paper.
Of more concern here is which datasets should follow Benford’s law. Frequently cited examples include the front page of the Financial Times, populations of countries and certain accounting data. In addition, some mathematical constructs – like the Fibonacci series – also follow Benford’s law.
Benford’s law and elections
Perhaps based on the successful application of Benford’s law in the analysis of tax fraud, there have been a number of attempts to use it to look for evidence of electoral fraud. This includes the 2009 Iranian elections (see here, here, here, here and here, for example) and the Russian elections in 2012 (see this Guardian article (the validity of Benford’s law for electoral data may still be up for debate, but it’s nice to see from the Guardian that Betteridge’s law is still doing well) and more details here).
I recently started putting together a database of parliamentary election results to help me with future analysis of the upcoming 2015 election. In the process I though it would be interesting to look at whether results from 2010 conformed to Benford’s law with the help of this spreadsheet from the Electoral Commission website.
4,150 candidates stood for election across the 650 constituencies, with tallies ranging from 17 (Godfrey Spickernell of the Blue Environment Party) to 35,471 (Stephen Timms of the Labour Party). While many analyses of electoral data with Benford’s law have concentrated on major parties/candidates separately, the size of this complete set seems, on the face of it, ideal for probing the law.
The results are shown in the chart below. The black dots mark the actual values while the green line joins the expected results from Benford’s law. The inner and outer green ranges surrounding the green line mark the one and three standard deviation limits one would expect from binomially distributed data with n = 4,150 and p given by the corresponding Benford probability. Loosely this means about 6 of the 9 real-world data points should fall within the inner green range and all 9 points should lie within the outer range if the underlying data follows Benford’s law.
While the true data has a passing resemblance to expectations from Benford’s law (certainly more so than a flat distribution) there are huge discrepancies, particularly for digits 1, 3 and 4.
To determine why the data may not obey Benford’s law it is insightful to look at the actual distribution of votes. However, because the data covers several orders of magnitude it is more useful to plot the log of the number of votes. The complete data is plotted as a histogram below. The lines show the distributions for candidates from the three “traditional” parties (the Conservative Party (blue), the Labour Party (red) and the Liberal Democrats (orange)) and all other parties combined (yellow).
Clearly we can see that rather than a smooth distribution we have a twin-peaked one. From the overlaid lines it’s straightforward to see that the low end of the distribution is completely dominated by the “other” candidates. This peaks around 3.15 (corresponding to about 1,400 votes) and then drops off quickly (on a log scale at least). By contrast, it is only after the other candidate counts drop off that we start to see counts for the traditional parties. This leads to a dip in counts around 3.4 – 3.75 (corresponding to around 2,500 to 5,500 votes) and thus explains some of the deficit for leading digits 3, 4 and 5. Quirkily, around 3.65 (about 4,500 votes), the share in counts is roughly equal between all four “parties”.
As can be seen from the chart, the second peak in the distribution comes largely from the combination of counts from the traditional parties, with the lower side augmented by the other parties. (The other parties here are predominantly the parties in Scotland, Wales and Northern Ireland with major support only within that nation i.e. the SNP, Plaid Cymru, Sinn Féin, the DUP and the SDLP.) The maximum occurs between 4.2 and 4.3, corresponding to roughly 16,000 to 20,000 votes. This peak in the 1 digits isn’t surprising if you take into account that constituencies are designed to be (very) roughly equally sized. To illustrate we can use the median total vote in a constituency – 46,426. To get a total count between 10,000 and 19,999, a candidate has to poll ~21.5% to ~43%, quite possible for candidates from any of the three major parties that (in England in 2010 at least) dominated the political landscape.
It shouldn’t be surprising from this that Benford’s law is a poor fit for the traditional parties individually either (Conservative, Labour, Liberal Democrat) since that data only covers roughly one order of magnitude. Things are a bit better for the other candidates combined but there is still an excess of 1’s as leading digit (again this should not be particularly surprising given the aforementioned position of the first peak in the histogram).
Something that does actually obey Benford’s law
The data from the spreadsheet has 134 different party affiliations (including Independent and not declared). Calculating all votes for these parties and (pseudo-parties) and extracting the leading digits gives the following distribution which does nicely follow Benford’s law, albeit with a lot less data.
Benford’s law 2: the second digit
The non-conformance to Benford’s law of most of the data presented above shouldn’t be taken as evidence of electoral fraud. There is a perfectly good explanation for this observation – roughly equal constituency size combined with broad support across most of them for a handful of parties. Similar conditions occur in other electoral systems. However, rather than give up on Benford completely, there has been quite some effort to apply a natural extension of the law to the second digit and use that information to hunt for electoral fraud. I’ll cover this idea in part 2.