Showing Marginal Distributions

Taking a bottom-up approach to chart design creates a framework in which custom enhancements to standard plots can be reasonably straightforward. As a base example take the following simple scatter plot:

Simple scatter plot of UK census data.

This is 2011 census data for England & Wales (data published by the Office for National Statistics, available here). Each black circle represents a different administrative area and is positioned according to the percentage of respondents who were self-employed (horizontal axis) and unemployed (vertical axis) in that area. The scaling of the axes has been deliberately chosen so that one percentage point in the horizontal direction is equal to one percentage point in the vertical direction.

The plot highlights a number of aspects of the data, including (unsurprisingly) a general negative correlation between the two variables and an extreme outlier at the right which happens to correspond to the Isles of Scilly administrative region.

A number of possible enhancements can be made to the plot to clarify information about the marginal distributions, some of these are described below in order of increasing detail.

Range frames (see E. Tufte, The Visual Display of Quantitative Information for more details) graphically reinforce data about the ranges of the variables by limiting each axis line to the range of the corresponding variable:

Scatter plot with range frame.

It is also straightforward to use box plots and guidelines for a more detailed summary of the marginal distributions:

Scatter plot with box plots.

A less abstract alternative is to bin the marginal distributions and plot the corresponding histograms:

Scatter plot with histograms.

Finally, the exact marginal cumulative distributions can be shown together with appropriate indicators such as the positions of the first, second (i.e. median) and third quartiles (labeled Q1, Q2 and Q3, respectively):

Scatter plot with cumulative distributions.

Each of these alternatives has its advantages and downsides. The range frame example highlights more information that the original plot without increasing the size of the figure at all. With a bottom-up approach it is a very simple change to implement but any extra insight gained will be limited.

Conversely, the latter two examples explicitly display a lot more information that, while contained in, is difficult to decode from the original scatter plot. A downside to this is a larger footprint for the figure (or a smaller scatter plot). It is thus a question of whether the extra information conveyed is worth the “cost” of the change in dimensions.

As one might expect, it is straightforward to encode a categorical variable in the scatter plot. In the next figure administrative areas that are in Wales are marked with red circles while those in England are light blue.

Scatter plot with categorical encoding through colour.

However, it also possible to encode categorical data in the marginal histograms. In the following case the red bars encode the data for Wales while the black bars encode the distribution for both England and Wales (as before).

Scatter plot with histograms and categorical encoding through colour.

This figure illustrates the general relationship between unemployment and self-employment, the marginal distribution for self-employment, the marginal distribution for unemployment, the marginal distribution for self-employment in Wales and the marginal distribution for unemployment in Wales.

Return to Scrapbook contents page