3. “The Technical Tools of Statistics” read at the 125th Anniversary Meeting of the American Statistical Association,
Boston, November 1964, published in April 1965 American Statistician.
http://cm.bell-labs.com/cm/ms/departments/sia/tukey/memo/techtools.html
/via Adam Cooper, “Exploratory Data Analysis”
http://blogs.cetis.ac.uk/adam/2012/05/18/exploratory-data-analysis/
John Tukey
“journeyman
carpenter of data-analytical
tools”
4. “A Boy's Work is Never Done”, KellyB. (flickr: foreverphoto/2467694199/)
5. “Exploratory data analysis
is an attitude,
a flexibility,
and reliance on display,
not a bundle of techniques
and should be so taught.”
John Tukey
Tukey, John W. "We need both exploratory and confirmatory." The
American Statistician 34.1 (1980): 23-25.
http://www.ece.rice.edu/~fk1/classes/ELEC697/TukeyEDA.pdf
6. “I … cannot disagree strongly enough with statements
about the dangers of putting powerful tools in the
hands of novices. Computer algebra, statistics, and
graphics systems provide plenty of rope for novices to
hang themselves and may even help to inhibit the
learning of essential skills needed by researchers. The
obvious problems caused by this situation do not
justify blunting our tools, however. They require better
education in the imaginative and disciplined use of
these tools. And they call for more attention to the
way powerful and sophisticated tools are presented to
novice users.”
Leland Wilkinson, The Grammar of Graphics, Springer-Verlag, 1999,
ISBN 0-387-98774-6, p15-16.
21. See also: IPython notebook demo
http://nbviewer.ipython.org/gist/psychemedia/9c54721e853403b43d21/pivotTable_demo.ipynb
22. “There is no more reason to expect
one graph to ‘tell all’ than to expect
one number to do the same.”
-- John Tukey
23. If quantities are conserved,
can you think of them in terms of flow?
24. “[T]he picture examining eye
is the best finder we have
of the wholly unanticipated.”
Tukey, John W. "We need both exploratory and confirmatory." The
American Statistician 34.1 (1980): 23-25.
http://www.ece.rice.edu/~fk1/classes/ELEC697/TukeyEDA.pdf
John Tukey
48. “Hand-drawing of graphs, except
perhaps for reproduction in books
and in some journals, is now
economically wasteful, slow, and
on the way out.”
– John Tukey
Wikipedia – Journeyman:
“A journeyman is an individual who has completed an apprenticeship and is fully educated in a trade or craft, but not yet a master. To become a master, a journeyman has to submit a master work piece to a guild for evaluation and be admitted to the guild as a master.
“In parts of Europe, as in later medieval Germany, spending time as a wandering journeyman (Wandergeselle), moving from one town to another to gain experience of different workshops, was an important part of the training of an aspirant master. Carpenters in Germany have retained the tradition of traveling journeymen even today, although only a few still practice.”
Bar charts are a very effective way of displaying particular sorts of information, such as counts. But what other ways are there of displaying data?
Bar charts are a very effective way of displaying particular sorts of information, such as counts. But what other ways are there of displaying data?
Datawrapper provides a variety of chart types, including:
horizontal and vertical (column) bar charts,
grouped bars that collate different bars according to groups (for example, election on election percentage of the vote for different political parties),
stacked column charts (for example, for a selection of countries we could display a column showing the total number of medals constructed by stacking the individual gold, silver and bronze medal counts for those countries)
line charts, which are widely used for plotting some value on the vertical y-axis against time on the horizontal x-axis
pie charts, to show proportions of a whole, and variants thereof, such as the donut chart (a pie chart with the middle cut out)
simple data tables (never underestimate the power of a table – they can be really useful for showing specific values, and can be very powerful when allowing the user to sort the table either by ascending or descending values in particular columns)
maps, which as we shall see, can draw out very powerful relationships across data elements.
We’ve also seen some other “basic” charts that can be useful for displaying the distribution of data elements:
the block histogram shows a count on the y-axis of data elements falling within particular ranges of values on the x-axis
the scatterplot allows us to plot two values against each other, for example height versus weight. These charts can provide us with clues about possible correlations or relationships between the two values. Some scatterplot tools further allow us to colour each point according to group membership so that we can look to see whether numbers are clustered or grouped according to group membership.
Visualising data is a powerful way of asking questions of data – what data points you choose to display and how you display them represent the framing of the question. What the data looks like is the response, but a response that often takes careful reading. The data source has drawn you the answer – you need to turn it into words that you can use to formulate further questions to check your understanding of the answer first provided. (Each question (each chart) typically leads to another… or more than one other…)
Asking questions that have a graphical answer is one way of querying a data source – but are there other approaches?
Let’s explore that a little more – what do we mean by asking questions of data?
Custom search engines are a powerful tool for helping us developed focussed web search tools that limit results to a particular part of the web we are interested in, either by location or topic.
We can also use (advanced) search limits in ‘everyday’ web queries using the major web search engine.
For example, the query shown on this slide searches for the word underspend appearing in Excel spreadsheets (filetype:xls) that can be found on UK government websites (or more specifically, websites hosted on the gov.uk domain (site:gov.uk)).
Another query limit combination I have found useful is:
confidential filetype:ppt
This can turn up presentations that have been delivered at closed corporate events but that have leaked on to the web…
Even if you don’t consider yourself a geek or database expert, writing advanced search queries using search limits is but a small step away from writing queries over databases themselves.
One of the most widely used languages for querying databases is SQL. The above slide shows a simple, made up SQL query that could have a similar effect to the simpler search engine query made over a very simple search engine database.
The idea is that we select those webPages where the text content of the webpage contains the word underspend anywhere – the % signs denote wildcard characters so the underspend word can appear preceded or followed by any number of arbitrary characters. We also want the query to be limited to pages that have a particular filetype and domain.
Far more complicated queries can be written over far more complex databases. What’s important is that you develop an idea of what sorts of database structure and query are possible, not necessarily that you can run and query such databases yourself.
For more examples, see:
Asking Questions of Data – Garment Factories Data Expedition – http://schoolofdata.org/2013/05/24/asking-questions-of-data-garment-factories-data-expedition/
Asking Questions of Data – Some Simple One-Liners http://schoolofdata.org/2013/05/13/asking-questions-of-data-some-simple-one-liners/
One of the simplest, but often one of the most useful, things we can do is to count things. You just need to be creative in what you count!
One of the nice features about working with database query languages such as SQL is that we can write queries that count the number of responses and allows us to rank results on that basis. For example, in a database of public spending transactions with different companies, we could count the number of transactions with a particular company, sum the value of transactions carried out with a particular company, or find the companies with the largest total amount spent with a particular company.
This further refinement of the same graphic shows how the two values can be compared.
On the left, each column is rank ordered and lines connect similar items, offering a direct columnar or column based comparison.
On the right, the ordering is according to the rank order of the right hand column,, allowing direct comparison across the rows.
As has already been mentioned, a key part of the journalistic exercise is putting things into context.
When working with data, interpreting what the data says often depends on understanding the context and more importantly, the caveats, that arise by virtue of asking a particular question of a particular dataset that has been collected in a particular way under particular conditions.
That said, given a particular data set, are there any obvious questions we can ask of it?
When results are ranked, as for example in the case of league tables, there are often easy picking stories to be had around top 3/bottom three positions. In national rankings, local news stories can be identified if your local schools or council appears in either of those extremes.
For contextualisation purposes, it often makes sense to look at distributions. Many summary statistics report on the mean value, but looking at measures of variation, or spread, about a mean, as well as the position of a median value, can often change the context of a story.
If the lecture room has 20 students in it on an income of £6,000 maintenance loan per year, the total income is £120,000 and their average mean income is £6,000. If an academic in the room is on £40,000, the total income for the room is £160,000. The average mean income is now just a little over £7,500. If we define a poverty level as a mean income below £10, 000, the members of the room are, on average, in poverty. If a senior academic such as professor on an income over £65,000 wanders into the room, the total income goes to over £225,000. With 22 people now in the room, the average mean income is now over £10,000: the room is out of poverty. The median average income, however, is still at £6,000.
As well as top, bottom, mean and median, we should also look to outliers. If Bill Gates or Mark Zuckerberg walks into a bar, the average net worth of people in that bar is likely to go up to a level of previously unimagined wealth.
Here are several reasons why you should pay attention to outliers:
they may be ‘dirty’ or incorrect data points that need to be corrected and that may well raise questions about data quality;
the outlier may truly be an outlier, a remarkable point and a story in its own right;
the outlier may skew other measures, such as mean values or other summary statistics. In such cases, it may make sense to use other measures or to rerun the summary statistic without including the outlier values to get a better feel for how the other members of the distribution relate to each other.
This rather dense graphic is a view over local council spending data in my local area as relates to spend on libraries. The separate charts show the accumulated spend over a period of time with different suppliers. The intention of the display was to provide at a glance a view of accumulated spend with different companies across different directorates and spending areas to see whether any companies had a significant spend compared to other companies.
The table at the bottom shows the top of a league table of companies with the largest accumulated spend by directorate and expense type.
At first glance, the spend on phone lines with different suppliers seems to outweigh the spend on books. How can that be? Are the librarians spending their time calling premium rate phone lines?
If we guess at 20 libraries and a 6 month spend period, then assume that the phone lines correspond to broadband data bills, do the monthly payments per library still seem outrageous? These assumptions are testable via questions to the relevant authorities, of course, but demonstrate the care we need to take when trying to understand why a number that may appear to be large is that large.
See also: Local Council Spending Data – Time Series Charts http://blog.ouseful.info/2013/11/06/local-council-spending-data-time-series-charts/
As well as looking for outliers, we should also look for similarities between things we expect to be different and differences between things we expect to be the same, or at least, similar.
Looking again at some of my local council’s spending data, I noticed a search on “music” pulled back what appeared to be a shift in responsibility between directorates for spend on school music service provision.
An obvious question that follows is: if the service did change hands (something we can check), was there a resulting difference in the way that the directorates were spending? Could we, for example, identify whether any projects got dropped (or at least, renamed out of scope!)?
This forensic approach can also be used to track the consequences of a shift in control of a service, if we know it has happened. When a service changes hand, we can keep a note of the fact and then a year on look for evidence in whether treatment of the service has changed, at least in consequences for spending.
See also: What Role, If Any, Does Spending Data Have to Play in Local Council Budget Consultations? http://blog.ouseful.info/2013/11/03/what-role-if-any-does-spending-data-have-to-play-in-local-council-budget-consultations/
If you in the position of paying for energy supply bills – electricity and gas – you’ll probably be familiar with the idea that payments are set so you tend to overpay on a monthly basis. After collecting the interest on your overpayments, the utility companies may eventually get round to sending you a small repayment to cover the excess (ex- of any interest, of course…).
Is the same true at the council level?
One thing I noticed in the spend my local council spent with supplier Southern Electric was that there appeared to be more than a few “negative payments”. So where were these coming from? The chart shown in this slide has positive payments made by date (not ordered on an evenly space timeline) in black, and the magnitude of negative payments shown in red. Where a red triangle sits over a black dot, this shows that a positive and negative payment of the same amount were made on the same day. Why’s that?
Some days show several negative payments – again, what’s happening? There’s not necessarily anything suspicious going on, but what story does this chart appear to tell us, particularly in terms of the similarities in amount of certain positive and negative spends?
Just by the by, this chart refines the question I’m asking of the spend with Southern Electric, asking for more information about positive and negative payments made on the gas and electricity accounts separately.
As well as similarities and differences, data can tell us tales about trends…
Regular releases from the ONS – the Office of National Statistics – provide bread and butter news stories on a regular basis according to a known schedule.
For example, monthly job seeker figures get a monthly write-up in OnTheWight, the hyperlocal news blog local to me. The report makes a comparison between the current figures and figures from the previous month and from the same month of the previous year. The aim is is so that we can see how the numbers have changed month on month, and year on year.
I started to explore a simple script that would take data directly from the ONS and produce assets that could be reused in a news story – for example, to produce a table showing the change in figures over recent months.
I also started to explore ways in which we could automate the production of prose from the data [code: https://gist.github.com/psychemedia/7536017]. For example, the following phrase was generated automatically from monthly figures:
The total number of people claiming Job Seeker's Allowance (JSA) on the Isle of Wight in October was 2781, up 94 from 2687 in September, 2013, and down 377 from 3158 in October, 2012.
The words up and down were selected based on simple if-then rule that compared figures to see which was the greater. The numbers and dates are pulled in from the data. The other words are canned phrases.
The automated production of text from data is something that has received attention from several companies, particular in the area of baseball reports and financial reporting. See for example: http://blog.ouseful.info/2013/05/22/notes-on-narrative-science-and-automated-insight/
Being able to define sentences and natural language constructions that can be used as templates to display data in textual form is a skill that could well feed into specialist areas of data driven reporting. Identifying the patterns in the data that can be mapped onto natural language explanations of those patterns in a reliable way is another area in which wordsmiths, statisticians and developers may have to work together in the future.
If we plot a line chart with some quantity against a time axis, we can often see increasing or decreasing trends over time. If we are looking for constant rates of increase in some value, it often makes sense to use a log/logarithmic scale to display that value on the y-axis Periodic trends can also be seen as ‘waves’ appearing in the line over time, but other displays can draw out periodicity or seasonality in a more visually compelling way.
For example, in these charts – of jobless figures on the Isle of Wight once again – we have months ordered along the horizontal x-axis and the number of job allowance claimants on the vertical y-axis. The separate coloured lines represent different years. On the left, we use a legend to identify the lines, on the right is an example of labeling the lines directly.
The lines show strong seasonality in behaviour. Being a tourist destination, job seeker figures tend to fall over the summer months. Putting lines for several years on the same axis allows us to compare annual cycles over time.
Another trend we can try to pull out is change over years for each given month. Here, the horizontal x-axis blocks out the months, as before, but within each month we have an ordered range of years. The line within each block thus represents the year-on-year change in numbers within a given month.
The step change within each month suggests that the way the figures were calculated changed significantly several years ago.
Further reading: a good guide to statistics as used by government, include a description of the way that “seasonal adjustments” are handled, is provided by the House of Commons Library’s Statistical Literacy Guide http://www.parliament.uk/business/publications/research/briefing-papers/SN04944/statistical-literacy-guide
As well as the patterns we can see over time by plotting data against a time axis, we can also look for patterns in space…
In part because they are so recognisable to the majority of people as an idea as well as an artefact, maps are widely used in many publications.
I have already mentioned how the use of a map to compare travel claims by MPs based on their constituency locations provided a way of making a particular sort of comparison between MPs (in particular, a comparison based on geographical location).
But we can take the idea of a map more generally, as a spatial distribution of points that are related in some way, with strong relations represented as spatial proximity.
Things that are close together on the page are taken to be close together in some sort of space, a space which may be conceptual or social, not just (or not even) geographic.
Take this map, for example, a map of Twitter users commonly followed by a sample of followers of @UL_journalism.
The map has been laid out so that Twitter users who are heavily interlinked are grouped closely together (for the most part, at least). A network statistic has been used in an attempt to colour clusters of nodes with high interconnection. The coloured regions thus represent a first attempt at identifying different groupings of Twitter user. You will note how the spatial layout algorithm and the grouping/colouring algorithm complement each other well – they both seem to tell a similar story, where the story is that certain groups of individuals are somehow alike.
About the technique: http://schoolofdata.org/2014/02/14/mapping-social-positioning-on-twitter/
Let’s have a closer look at some of the regions…
As well as similarities and differences, data can tell us tales about trends…
As well as similarities and differences, data can tell us tales about trends…