In part 1 I showed that candidate vote counts from the 2010 election in the UK didn’t conform to Benford’s law but that there was a perfectly reasonable explanation for this. In short: the combination of roughly equal constituency size and broad support for three parties lead to a large number of candidates getting counts around ten to twenty thousand votes and relatively few getting three to four thousand. Hence there is an excess of 1’s and a deficiency in 3’s and 4’s, as seen here.
Benford and the second digit
There is a natural extension of Benford’s law from the first digit to the second.
Recall from part 1 that Benford’s law, henceforth the Benford first-digit distribution (B1D), can be summarised by
where D1 is the digit in question and d is between 1 and 9. That natural extension to this is to apply the same formula to all two-digit pairs, D12, too. Simply:
for d between 10 and 99. To find the probability that the second digit is (say) 7, we would then just add up the probabilities of the nine two-digit combinations whose second digit is 7 (namely 17, 27, 37, 47, 57, 67, 77, 87 and 97). Formally, the probability of the second digit, D2, being d is given by:
for d between 0 and 9. Plug the numbers in and you’ll retrieve the distribution below:
Clearly this is a lot closer to being uniform than the first-digit distribution but it still favours the lower digits. To (hopefully) avoid confusion I’ll refer to this as the Benford second-digit distribution (B2D). We can summarise the results of all these formulas using the table below. The central rectangle provides the probability for each two-digit pair, the bottom row gives the first-digit probabilities (ie B1D) from the column sums and the end column gives the B2D probabilities from the row sums.
The second digit and UK election data
Debate rages as to whether B2D (and the combination of B1D and B2D) can be applied to serious searches for electoral fraud. Statements on the matter range from “[a] simple and quick general test to screen for numerical anomalies” to “essentially useless”. So how do things look for the second-digits in the 2010 election data across all candidates? This is shown below with the black data points. The B2D expectation is marked with the green line with the inner and outer green ranges indicating one and three standard deviations (assuming binomially-distributed data) from expectation, respectively. In addition, the white block encloses three standard deviations either side of a uniform (ie flat or constant) distribution.
On a purely visual inspection, the agreement between B2D and the real data is remarkably good. Even with the much gentler gradient than B1D, it is still clear that B2D is a better fit to the data than the naive assumption that second digits should be uniformly distributed. From this it seems that B2D may indeed be a useful indicator in the hunt for electoral fraud (at least if we assume, as seems reasonable, there was no widespread fraud in the UK elections of 2010).
A simple χ2 analysis agrees with visual inspection. The formula for this test statistic is
where Oi is the observed number of counts for digit i and Ei is the expected number of counts for digit i (the product of the Benford probability and the total number of candidates). For the ten digits here we have 9 degrees of freedom, giving a critical value of 16.92 at the 95% confidence level. For the non-statisticians that very roughly means if the sum above exceeds 16.92 then we have reason to believe that the data does not conform to the specified distribution. The observed sum is 4.18. In contrast, comparing with the expectations of a uniform distribution gives a value of 56.18.
If we look at only Conservative and Liberal Democrat candidates (separately) we observe similar results, although the χ2 value of 8.57 for the Conservatives when comparing to a uniform distribution is also perfectly reasonable.
However, the Labour party data better matches a uniform distribution (9.46) than the expectations of Benford (21.35).
What isn’t clear to me is why the data should be expected to follow B2D, especially when it doesn’t follow the closely-related B1D. In the case of the individual parties, the data doesn’t really cover much more than an order of magnitude as is generally expected of B1D. In addition, it’s not particularly obvious why fraudulent election data might be expected to differ from B2D.
Papers from Mebane and Shikano & Mack use simulations to investigate the extent to which election data should follow B2D. They show that, under their simulations, B2D-like results do occur but with inflated χ2 statistics. That is, the simulated data follows the B2D trend but their are larger discrepancies than one would export from “pure” Benford data. Both papers then attempt to explain why simulations differ from the expectation of Benford rather than why they agree to such an extent.
One of the supposed plus points of using Benford’s law is that it is simple – all that’s required is a list of vote counts. If you have to do simulations, using underlying knowledge of the election in question, much of that simplicity is lost. Other methods, such as examining the last digit and last two digits are also simple and – as Deckert, Myagkov & Ordeshook point out – are backed-up by a solid link between discrepancies and fraudulent activity.
I created this simple Shiny application that allows easy switching between parties, election year (2010 and 2005) and digit (first or second). I’m not sure it makes things any clearer.
It has been suggested that the distribution of second digits in electoral data can be used as a possible indicator of fraud and that this circumnavigates the problem associated with (roughly) constant constituency/precinct size that can lead to deviations from Benford’s law for the first digit. Data from the (presumably fraud-free) UK general election of 2010 seems to follow the posited second-digit distribution closely. However, I am yet to be convinced there is any compelling reason to expect clean electoral data to follow it in general and fraudulent election results to differ. Hence I’m currently unconvinced of the usefulness of studying second digits as a tool in election forensics.
However, I am not in any way an expert on electoral forensics, just an interested observer. So feel free to point out any glaring errors or omissions and I will happily reconsider.