The demonstration discussed in this post can be found here.
I’ve frequently found that choropleth maps have a tendency to look great but not deliver that much in terms of insight. So I wanted to design something that didn’t just look kind of pretty but had the potential to be a useful tool to someone interested in the data in question. With this in mind I needed to find a suitable dataset with which to experiment. Having previously worked with 2011 England and Wales census data I thought it might be interesting to use that again, displaying the same data in a completely different way.
The base map
Maps of the UK in the SVG (Scalable Vector Graphic) format I required are easy to find, but not ones with the 348 regions I wanted marked on. So I sent a polite email to the folks at the Office for National Statistics (who compiled the data) asking if they had one I could use and thankfully they were able to oblige with something that was almost perfect. A few minor alterations were required (or at least made my life easier) but mercifully each path in the markup that described an administrative region’s boundary came with a custom attribute whose value matched one of the named regions in the dataset exactly. Without this one-to-one correspondence I would have had to spend hours going through the map’s markup and working out which region each path represented.
In an entirely non-political statement (it shaved 80 kB off the file size) I removed Scotland and Northern Ireland from the map using the highly advanced technique of highlighting the relevant paths in the markup and pressing the delete key.
As with the census scatter plot, I wanted to design an interface that was flexible enough to allow the user to select any of the 89 variables in the dataset and for the display to adapt accordingly. While all the values in the dataset are expressed in percentages, the range of values covered by the different variables varies greatly. However, I still wanted to maintain some sense of consistency between variables. In the end I plumped for an encoding scheme based on the idea of a box plot with a diverging colour scheme (more). As a result, regions that were “about average” for one variable and “about average” for a second variable would appear the same colour in both regardless of any outliers in the data or where the data as a whole was situated (say near 10% or near 90%). This differs from a colour scale based on dividing the range of the chosen variable or the entire possible range of values (i.e. from 0-100%) evenly.
The original map is comprised of many paths that mark out the borders of the regions and has no fill colours. Where the paths of separate regions coincide (i.e. internal borders) one gets the effect of a thicker border. There is no path that marks out England and Wales as a whole so no extra thickness between the land and sea. Consequently, with a transparent (effectively white) background, external borders appear darker than internal borders. This, to me at least, appeared a bit odd.
After experimenting with line thickness and trying to somehow create a land/sea border from the assembled regions using Illustrator and Inkscape (and failing) I decided the easiest way to cover this slightly odd look was to add a darker background colour to create a clear border between land and sea. This was easy to implement but the obvious colour for this background was blue since it was largely sea (plus the bit of Scotland I had chopped off) which clashed with my initial choice for colour scale of blue to red (via yellow).
A bit more experimenting led me to the conclusion that combining a green to red colour scale with a blue background looked great to my eyes. But I am a trichromat – the only vision deficiency I suffer from is myopia. Those with colour vision deficiencies (particularly red-green colour blindness) would probably disagree quite strongly. So I tried a few schemes, took a few screenshots and opened up ImageJ with the Vischeck plug-in that simulates protanopia, deuteranopia (both forms of what is frequently referred to as red-green colour blindness) and tritanopia (a.k.a. blue-yellow colour blindness).
My conclusion from this was that the green to red colour scheme would, unsurprisingly, be difficult to interpret for those with red-green colour blindness. However, I still found the green to red scheme with the blue background the most aesthetically pleasing. In the end I decided to implement two colour schemes – green to red with a pale blue background and blue to red with a medium grey background – that can be switched between via a drop-down menu. If you are reading this and you do suffer from colour blindness then please let me know what you make of both; otherwise the simulations are shown below.
Enter “choropleth map” into the search bar of Google and take a look at the results under “images”. You’ll find many of the results include a key based on a stack of appropriately coloured boxes with numerical ranges on one side or the other. To me this is missing a trick when the encoding is based on an interval rather than categorical scale as the processing of information in the key is more verbal than visual. Using a number line or axis makes the ranges covered more apparent, which is especially important when the ranges covered by the boxes differ. Rather than rescaling the axis for each change of variable, however, I chose to use a single axis running from 0 to 100% and shift and resize the boxes accordingly.
The upside of this persistency is that once the user is familiar with how the axis stretches across the page, a quick glance at it is all that is required to get an idea of how the data is spread out and where the bulk of it lies (see the top three examples in the figure below). The downside is that, for a few variables (e.g. the percentage of respondents that identified themselves as Sikh, fourth example below), the boxes are squished up together at the left-hand edge. Here one probably needs to reference the table to gain an understanding of what the colour encoding really refers to.
The table is a fairly straightforward collection of region names and values. The default ordering is alphabetical so that if you have a region in mind whose value for a given variable you want to find then it’s just a case of scrolling through until you reach the appropriate letter. However, it can also be sorted into ascending or descending order by value with ties then sorted into alphabetical and reverse alphabetical order, respectively.
Linking map, table and key
On their own the map and table may be useful tools – the former highlights groups of geographical regions that share similar values for a given variable, the latter provides precise values for that variable in an easy to look-up manner. But more power can be gained from their composition by linking them dynamically based on user interaction.
Put simply, I added the ability to click on a region of the map and highlight both it and the corresponding entry in the table; or click on a table entry for the same effect – for those who, like me, have an imperfect knowledge of the political geography of England and Wales. With the key already designed with an integrated axis it seemed logical to also add a vertical marker to it at the position corresponding to the value of the chosen variable for the selected region.
I also experimented with adding a very small cumulative distribution plot to the key using the vertical extent of the boxes in the key as representing the range from 0 to 100% of the data (rather than the 0 to 100% that represent the values themselves on the horizontal axis). For some variables this worked reasonably well (top of figure below), for others it was just too cramped to be of much use (third example below). Instead I settled for a horizontal marker, based on the same idea but specific to the selected data point (second and fourth examples below). That is to say, the height of the horizontal marker above the base of the boxes is proportional to the data value’s position in the ordered list of all values for the chosen variable.
One persistent issue with choropleth maps is that the regions that are coloured are rarely equally sized. There isn’t, therefore, a one-to-one match between the amount of a certain colour in a map and the prevalence of whatever that colour represents. It also means that some regions may even be difficult to locate. For example, when the demonstration is first loaded, the variable encoded in the map is the percentage of respondents who said they were “Economically active: Self-employed”. On the axis at the bottom the biggest box is coloured red but it may not be obvious that there is any red regions on the map. There is, in fact, only one: the Isles of Scilly – the tiny collection of islands off the south-western extremity of England. If you browse through the variables you will find the the Isles of Scilly fall in the outlier region (high or low) quite frequently. There is a similar issue with some of the tiny regions of inner London.
To make it easier to identify where these outliers actually are I added a little interactivity to the key itself. When you click on a box, all the regions that are not the same colour as the box are set to white and their entries are filtered from the table. This makes it much easier to find small regions that are outliers.