Conversations with data

•Als PPTX, PDF herunterladen•

4 gefällt mir•4,285 views

Tony Hirst

#dalmooc 27/10/12 slides

Bildung

Conversations with
Data Tony Hirst
Computing and Communications,
The Open University

(Recognising
and addressing
a skills gap)

“The Technical Tools of Statistics” read at the 125th Anniversary Meeting of the American Statistical Association,
Boston, November 1964, published in April 1965 American Statistician.
http://cm.bell-labs.com/cm/ms/departments/sia/tukey/memo/techtools.html
/via Adam Cooper, “Exploratory Data Analysis”
http://blogs.cetis.ac.uk/adam/2012/05/18/exploratory-data-analysis/
John Tukey
“journeyman
carpenter of data-analytical
tools”

“A Boy's Work is Never Done”, KellyB. (flickr: foreverphoto/2467694199/)

“Exploratory data analysis
is an attitude,
a flexibility,
and reliance on display,
not a bundle of techniques
and should be so taught.”
John Tukey
Tukey, John W. "We need both exploratory and confirmatory." The
American Statistician 34.1 (1980): 23-25.
http://www.ece.rice.edu/~fk1/classes/ELEC697/TukeyEDA.pdf

“I … cannot disagree strongly enough with statements
about the dangers of putting powerful tools in the
hands of novices. Computer algebra, statistics, and
graphics systems provide plenty of rope for novices to
hang themselves and may even help to inhibit the
learning of essential skills needed by researchers. The
obvious problems caused by this situation do not
justify blunting our tools, however. They require better
education in the imaginative and disciplined use of
these tools. And they call for more attention to the
way powerful and sophisticated tools are presented to
novice users.”
Leland Wilkinson, The Grammar of Graphics, Springer-Verlag, 1999,
ISBN 0-387-98774-6, p15-16.

See also: IPython notebook demo
http://nbviewer.ipython.org/gist/psychemedia/9c54721e853403b43d21/pivotTable_demo.ipynb

“There is no more reason to expect
one graph to ‘tell all’ than to expect
one number to do the same.”
-- John Tukey

If quantities are conserved,
can you think of them in terms of flow?

“[T]he picture examining eye
is the best finder we have
of the wholly unanticipated.”
Tukey, John W. "We need both exploratory and confirmatory." The
American Statistician 34.1 (1980): 23-25.
http://www.ece.rice.edu/~fk1/classes/ELEC697/TukeyEDA.pdf
John Tukey

underspend filetype:xls site:gov.uk
Search limits

Structured queries
underspend filetype:xls site:gov.uk
select webPages where
text like “%underspend%”
and filetype=“xls”
and domain=“gov.uk”
SQL

http://www.coolinfographics.com/blog/2014/8/29/false-visualizations-sizing-circles-in-infographics.html

Outliers may be rare occurrences
over time too…
Streaks and runs…

“Hand-drawing of graphs, except
perhaps for reproduction in books
and in some journals, is now
economically wasteful, slow, and
on the way out.”
– John Tukey

“I know of no person or group that is
taking nearly adequate advantage of
the graphical potentialities of the
computer.”
– John Tukey

Hopefully, that
contained some
ouseful.info
-- @psychemedia

Empfohlen

Visual ConversationsTony Hirst

Εκπαίδευση Ενηλίκων με Αντεστραμμένη Διδασκαλία (flipped classroom)John Tzortzakis

Συμμετοχη Ι.Α.Κ.Ε. στο 3ο forum της Money ShowJohn Tzortzakis

빅데이터 시대의 미디어&커뮤니케이션 교육과 연구Han Woo PARK

Data visualization and digital humanities researchSusan Smith

Data Science definitionCarloLauro1

Let's talk about Data ScienceCarlo Lauro

Data, Science, Society - Claudio Gutierrez, University of ChileLEARN Project

Empfohlen

Visual ConversationsTony Hirst

Εκπαίδευση Ενηλίκων με Αντεστραμμένη Διδασκαλία (flipped classroom)John Tzortzakis

Συμμετοχη Ι.Α.Κ.Ε. στο 3ο forum της Money ShowJohn Tzortzakis

빅데이터 시대의 미디어&커뮤니케이션 교육과 연구Han Woo PARK

Data visualization and digital humanities researchSusan Smith

Data Science definitionCarloLauro1

Let's talk about Data ScienceCarlo Lauro

Data, Science, Society - Claudio Gutierrez, University of ChileLEARN Project

BNW Technology PresentationRachel

Data Visualization in Exploratory Data AnalysisEva Durall

Citizen Science overview for ASU HSD598 graduate course, "Citizen Science"Darlene Cavalier

Argumentation 101 for Learning Analytics PhDs!Simon Buckingham Shum

Should Intelligent Design replace the Darwinian Theory of Evolution? - Contraghostexorcist

Fact Checking & Information RetrievalMatthew Lease

150609 c4 e-universityinnovationecosystemsCenter for Entrepreneurship (C4E), University of Cyprus

UCSD Library Presentation 10182010Philip Bourne

data science @NYT ; inaugural Data Science Initiative Lecturechris wiggins

Neo luddismJulian Beckton

Big Data Talent in Academic and Industry R&DUniversity of Washington

Carla Diana's CHI2011 recapCarla Diana

Being EngelbartianJohn Bradley

And Then the Internet Happened Prospective Thoughts about Concept Mapping in ...Daniel McLinden

Learning Analytics as Educational Knowledge InfrastructureSimon Buckingham Shum

Meyer dig ethno_2013sdpEric Meyer

Pliny: 4 perspectivesJohn Bradley

And Then the Internet Happened Prospective Thoughts about Concept Mapping in ...Daniel McLinden

Part 1 Information networking as technology tools, uses, and soci.docxherbertwilson5999

Kenneth Cukier gfke 2014innovationoecd

15 in 20 research fiestaTony Hirst

Dev8d jupyterTony Hirst

Weitere ähnliche Inhalte

Ähnlich wie Conversations with data

BNW Technology PresentationRachel

Data Visualization in Exploratory Data AnalysisEva Durall

Citizen Science overview for ASU HSD598 graduate course, "Citizen Science"Darlene Cavalier

Argumentation 101 for Learning Analytics PhDs!Simon Buckingham Shum

Should Intelligent Design replace the Darwinian Theory of Evolution? - Contraghostexorcist

Fact Checking & Information RetrievalMatthew Lease

150609 c4 e-universityinnovationecosystemsCenter for Entrepreneurship (C4E), University of Cyprus

UCSD Library Presentation 10182010Philip Bourne

data science @NYT ; inaugural Data Science Initiative Lecturechris wiggins

Neo luddismJulian Beckton

Big Data Talent in Academic and Industry R&DUniversity of Washington

Carla Diana's CHI2011 recapCarla Diana

Being EngelbartianJohn Bradley

And Then the Internet Happened Prospective Thoughts about Concept Mapping in ...Daniel McLinden

Learning Analytics as Educational Knowledge InfrastructureSimon Buckingham Shum

Meyer dig ethno_2013sdpEric Meyer

Pliny: 4 perspectivesJohn Bradley

And Then the Internet Happened Prospective Thoughts about Concept Mapping in ...Daniel McLinden

Part 1 Information networking as technology tools, uses, and soci.docxherbertwilson5999

Kenneth Cukier gfke 2014innovationoecd

Ähnlich wie Conversations with data (20)

BNW Technology Presentation

Data Visualization in Exploratory Data Analysis

Citizen Science overview for ASU HSD598 graduate course, "Citizen Science"

Argumentation 101 for Learning Analytics PhDs!

Should Intelligent Design replace the Darwinian Theory of Evolution? - Contra

Fact Checking & Information Retrieval

150609 c4 e-universityinnovationecosystems

UCSD Library Presentation 10182010

data science @NYT ; inaugural Data Science Initiative Lecture

Neo luddism

Big Data Talent in Academic and Industry R&D

Carla Diana's CHI2011 recap

Being Engelbartian

And Then the Internet Happened Prospective Thoughts about Concept Mapping in ...

Learning Analytics as Educational Knowledge Infrastructure

Meyer dig ethno_2013sdp

Pliny: 4 perspectives

And Then the Internet Happened Prospective Thoughts about Concept Mapping in ...

Part 1 Information networking as technology tools, uses, and soci.docx

Kenneth Cukier gfke 2014

Mehr von Tony Hirst

15 in 20 research fiestaTony Hirst

Dev8d jupyterTony Hirst

Ili 16 robotTony Hirst

Jupyternotebooks ou.pptxTony Hirst

Virtual computing.pptxTony Hirst

ouseful-parlihacksTony Hirst

Gors appropriateTony Hirst

Robotlab jupyterTony Hirst

Fco open data in half day th-v2Tony Hirst

Notes on the Future - ILI2015 WorkshopTony Hirst

Community Journalism Conf - hyperlocal data wireTony Hirst

Residential school 2015_robotics_interestTony Hirst

Data Mining - Separating Fact From Fiction - NetIKXTony Hirst

Week4Tony Hirst

A Quick Tour of OpenRefineTony Hirst

Data reuse OU workshop bingoTony Hirst

Inspiring content - You Don't Need Big Data to Tell Good Data Stories Tony Hirst

Lincoln jun14datajournalismTony Hirst

Lincoln Journalism Research Day - Data JournalismTony Hirst

Mehr von Tony Hirst (20)

15 in 20 research fiesta

Dev8d jupyter

Ili 16 robot

Jupyternotebooks ou.pptx

Virtual computing.pptx

ouseful-parlihacks

Gors appropriate

Robotlab jupyter

Fco open data in half day th-v2

Notes on the Future - ILI2015 Workshop

Community Journalism Conf - hyperlocal data wire

Residential school 2015_robotics_interest

Data Mining - Separating Fact From Fiction - NetIKX

Week4

A Quick Tour of OpenRefine

Data reuse OU workshop bingo

Inspiring content - You Don't Need Big Data to Tell Good Data Stories

Lincoln jun14datajournalism

Lincoln Journalism Research Day - Data Journalism

Kürzlich hochgeladen

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"National Information Standards Organization (NISO)

Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K

Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732

How to Make a Pirate ship Primary Education.pptxmanuelaromero2013

A Critique of the Proposed National Education Policy ReformChameera Dedduwage

Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1

URLs and Routing in the Odoo 17 Website AppCeline George

Advanced Views - Calendar View in Odoo 17Celine George

Paris 2024 Olympic Geographies - an activityGeoBlogs

Software Engineering Methodologies (overview)eniolaolutunde

1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh

Grant Readiness 101 TechSoup and Remy ConsultingTechSoup

Student login on Anyboli platform.helpinRaunakKeshri1

microwave assisted reaction. General introductionMaksud Ahmed

Arihant handbook biology for class 11 .pdfchloefrazer622

TataKelola dan KamSiber Kecerdasan Buatan v022.pdfSarwono Sutikno, Dr.Eng.,CISA,CISSP,CISM,CSX-F

Código Creativo y Arte de Software | Unidad 1Maestría en Comunicación Digital Interactiva - UNR

Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019

Hybridoma Technology ( Production , Purification , and Application ) Sakshi Ghasle

Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth

Kürzlich hochgeladen (20)

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"

Z Score,T Score, Percential Rank and Box Plot Graph

Separation of Lanthanides/ Lanthanides and Actinides

How to Make a Pirate ship Primary Education.pptx

A Critique of the Proposed National Education Policy Reform

Employee wellbeing at the workplace.pptx

URLs and Routing in the Odoo 17 Website App

Advanced Views - Calendar View in Odoo 17

Paris 2024 Olympic Geographies - an activity

Software Engineering Methodologies (overview)

1029 - Danh muc Sach Giao Khoa 10 . pdf

Grant Readiness 101 TechSoup and Remy Consulting

Student login on Anyboli platform.helpin

microwave assisted reaction. General introduction

Arihant handbook biology for class 11 .pdf

TataKelola dan KamSiber Kecerdasan Buatan v022.pdf

Código Creativo y Arte de Software | Unidad 1

Sanyam Choudhary Chemistry practical.pdf

Hybridoma Technology ( Production , Purification , and Application )

Introduction to ArtificiaI Intelligence in Higher Education

Conversations with data

1. Conversations with Data Tony Hirst Computing and Communications, The Open University

2. (Recognising and addressing a skills gap)

3. “The Technical Tools of Statistics” read at the 125th Anniversary Meeting of the American Statistical Association, Boston, November 1964, published in April 1965 American Statistician. http://cm.bell-labs.com/cm/ms/departments/sia/tukey/memo/techtools.html /via Adam Cooper, “Exploratory Data Analysis” http://blogs.cetis.ac.uk/adam/2012/05/18/exploratory-data-analysis/ John Tukey “journeyman carpenter of data-analytical tools”

4. “A Boy's Work is Never Done”, KellyB. (flickr: foreverphoto/2467694199/)

5. “Exploratory data analysis is an attitude, a flexibility, and reliance on display, not a bundle of techniques and should be so taught.” John Tukey Tukey, John W. "We need both exploratory and confirmatory." The American Statistician 34.1 (1980): 23-25. http://www.ece.rice.edu/~fk1/classes/ELEC697/TukeyEDA.pdf

6. “I … cannot disagree strongly enough with statements about the dangers of putting powerful tools in the hands of novices. Computer algebra, statistics, and graphics systems provide plenty of rope for novices to hang themselves and may even help to inhibit the learning of essential skills needed by researchers. The obvious problems caused by this situation do not justify blunting our tools, however. They require better education in the imaginative and disciplined use of these tools. And they call for more attention to the way powerful and sophisticated tools are presented to novice users.” Leland Wilkinson, The Grammar of Graphics, Springer-Verlag, 1999, ISBN 0-387-98774-6, p15-16.

7. Data accessibility Data sensemaking

8. Clean Shape Augment Look

9. Dirty Data

10. openrefine.org

11.

12.

13.

14.

15.

16.

17.

18. Shapes…

19.

20. I see trees…

21. See also: IPython notebook demo http://nbviewer.ipython.org/gist/psychemedia/9c54721e853403b43d21/pivotTable_demo.ipynb

22. “There is no more reason to expect one graph to ‘tell all’ than to expect one number to do the same.” -- John Tukey

23. If quantities are conserved, can you think of them in terms of flow?

24. “[T]he picture examining eye is the best finder we have of the wholly unanticipated.” Tukey, John W. "We need both exploratory and confirmatory." The American Statistician 34.1 (1980): 23-25. http://www.ece.rice.edu/~fk1/classes/ELEC697/TukeyEDA.pdf John Tukey

25. How can we look at data?

26.

27.

28. How do we ask questions of data?

29. underspend filetype:xls site:gov.uk Search limits

30. Structured queries underspend filetype:xls site:gov.uk select webPages where text like “%underspend%” and filetype=“xls” and domain=“gov.uk” SQL

31. Count things Sort things

32. http://www.coolinfographics.com/blog/2014/8/29/false-visualizations-sizing-circles-in-infographics.html

33. How do we interpret the answers?

34. Look for outliers Top 3… …bottom 3

35. Outliers may be rare occurrences over time too… Streaks and runs…

36. Look for similarities & differences

37.

38.

39.

40. Look for trends

41.

42.

43.

44. Look for patterns & structure

45.

46.

47.

48. “Hand-drawing of graphs, except perhaps for reproduction in books and in some journals, is now economically wasteful, slow, and on the way out.” – John Tukey

49.

50.

51. Recording your conversations

52. Rstudio.org

53. IPython Notebook

54. “I know of no person or group that is taking nearly adequate advantage of the graphical potentialities of the computer.” – John Tukey

55. Hopefully, that contained some ouseful.info -- @psychemedia

Hinweis der Redaktion

Wikipedia – Journeyman: “A journeyman is an individual who has completed an apprenticeship and is fully educated in a trade or craft, but not yet a master. To become a master, a journeyman has to submit a master work piece to a guild for evaluation and be admitted to the guild as a master. “In parts of Europe, as in later medieval Germany, spending time as a wandering journeyman (Wandergeselle), moving from one town to another to gain experience of different workshops, was an important part of the training of an aspirant master. Carpenters in Germany have retained the tradition of traveling journeymen even today, although only a few still practice.”
Bar charts are a very effective way of displaying particular sorts of information, such as counts. But what other ways are there of displaying data?
Bar charts are a very effective way of displaying particular sorts of information, such as counts. But what other ways are there of displaying data?
Datawrapper provides a variety of chart types, including: horizontal and vertical (column) bar charts, grouped bars that collate different bars according to groups (for example, election on election percentage of the vote for different political parties), stacked column charts (for example, for a selection of countries we could display a column showing the total number of medals constructed by stacking the individual gold, silver and bronze medal counts for those countries) line charts, which are widely used for plotting some value on the vertical y-axis against time on the horizontal x-axis pie charts, to show proportions of a whole, and variants thereof, such as the donut chart (a pie chart with the middle cut out) simple data tables (never underestimate the power of a table – they can be really useful for showing specific values, and can be very powerful when allowing the user to sort the table either by ascending or descending values in particular columns) maps, which as we shall see, can draw out very powerful relationships across data elements.
We’ve also seen some other “basic” charts that can be useful for displaying the distribution of data elements: the block histogram shows a count on the y-axis of data elements falling within particular ranges of values on the x-axis the scatterplot allows us to plot two values against each other, for example height versus weight. These charts can provide us with clues about possible correlations or relationships between the two values. Some scatterplot tools further allow us to colour each point according to group membership so that we can look to see whether numbers are clustered or grouped according to group membership.
Visualising data is a powerful way of asking questions of data – what data points you choose to display and how you display them represent the framing of the question. What the data looks like is the response, but a response that often takes careful reading. The data source has drawn you the answer – you need to turn it into words that you can use to formulate further questions to check your understanding of the answer first provided. (Each question (each chart) typically leads to another… or more than one other…) Asking questions that have a graphical answer is one way of querying a data source – but are there other approaches? Let’s explore that a little more – what do we mean by asking questions of data?
Custom search engines are a powerful tool for helping us developed focussed web search tools that limit results to a particular part of the web we are interested in, either by location or topic. We can also use (advanced) search limits in ‘everyday’ web queries using the major web search engine. For example, the query shown on this slide searches for the word underspend appearing in Excel spreadsheets (filetype:xls) that can be found on UK government websites (or more specifically, websites hosted on the gov.uk domain (site:gov.uk)). Another query limit combination I have found useful is: confidential filetype:ppt This can turn up presentations that have been delivered at closed corporate events but that have leaked on to the web…
Even if you don’t consider yourself a geek or database expert, writing advanced search queries using search limits is but a small step away from writing queries over databases themselves. One of the most widely used languages for querying databases is SQL. The above slide shows a simple, made up SQL query that could have a similar effect to the simpler search engine query made over a very simple search engine database. The idea is that we select those webPages where the text content of the webpage contains the word underspend anywhere – the % signs denote wildcard characters so the underspend word can appear preceded or followed by any number of arbitrary characters. We also want the query to be limited to pages that have a particular filetype and domain. Far more complicated queries can be written over far more complex databases. What’s important is that you develop an idea of what sorts of database structure and query are possible, not necessarily that you can run and query such databases yourself. For more examples, see: Asking Questions of Data – Garment Factories Data Expedition – http://schoolofdata.org/2013/05/24/asking-questions-of-data-garment-factories-data-expedition/ Asking Questions of Data – Some Simple One-Liners http://schoolofdata.org/2013/05/13/asking-questions-of-data-some-simple-one-liners/
One of the simplest, but often one of the most useful, things we can do is to count things. You just need to be creative in what you count! One of the nice features about working with database query languages such as SQL is that we can write queries that count the number of responses and allows us to rank results on that basis. For example, in a database of public spending transactions with different companies, we could count the number of transactions with a particular company, sum the value of transactions carried out with a particular company, or find the companies with the largest total amount spent with a particular company.
This further refinement of the same graphic shows how the two values can be compared. On the left, each column is rank ordered and lines connect similar items, offering a direct columnar or column based comparison. On the right, the ordering is according to the rank order of the right hand column,, allowing direct comparison across the rows.
As has already been mentioned, a key part of the journalistic exercise is putting things into context. When working with data, interpreting what the data says often depends on understanding the context and more importantly, the caveats, that arise by virtue of asking a particular question of a particular dataset that has been collected in a particular way under particular conditions. That said, given a particular data set, are there any obvious questions we can ask of it?
When results are ranked, as for example in the case of league tables, there are often easy picking stories to be had around top 3/bottom three positions. In national rankings, local news stories can be identified if your local schools or council appears in either of those extremes. For contextualisation purposes, it often makes sense to look at distributions. Many summary statistics report on the mean value, but looking at measures of variation, or spread, about a mean, as well as the position of a median value, can often change the context of a story. If the lecture room has 20 students in it on an income of £6,000 maintenance loan per year, the total income is £120,000 and their average mean income is £6,000. If an academic in the room is on £40,000, the total income for the room is £160,000. The average mean income is now just a little over £7,500. If we define a poverty level as a mean income below £10, 000, the members of the room are, on average, in poverty. If a senior academic such as professor on an income over £65,000 wanders into the room, the total income goes to over £225,000. With 22 people now in the room, the average mean income is now over £10,000: the room is out of poverty. The median average income, however, is still at £6,000. As well as top, bottom, mean and median, we should also look to outliers. If Bill Gates or Mark Zuckerberg walks into a bar, the average net worth of people in that bar is likely to go up to a level of previously unimagined wealth. Here are several reasons why you should pay attention to outliers: they may be ‘dirty’ or incorrect data points that need to be corrected and that may well raise questions about data quality; the outlier may truly be an outlier, a remarkable point and a story in its own right; the outlier may skew other measures, such as mean values or other summary statistics. In such cases, it may make sense to use other measures or to rerun the summary statistic without including the outlier values to get a better feel for how the other members of the distribution relate to each other.
This rather dense graphic is a view over local council spending data in my local area as relates to spend on libraries. The separate charts show the accumulated spend over a period of time with different suppliers. The intention of the display was to provide at a glance a view of accumulated spend with different companies across different directorates and spending areas to see whether any companies had a significant spend compared to other companies. The table at the bottom shows the top of a league table of companies with the largest accumulated spend by directorate and expense type. At first glance, the spend on phone lines with different suppliers seems to outweigh the spend on books. How can that be? Are the librarians spending their time calling premium rate phone lines? If we guess at 20 libraries and a 6 month spend period, then assume that the phone lines correspond to broadband data bills, do the monthly payments per library still seem outrageous? These assumptions are testable via questions to the relevant authorities, of course, but demonstrate the care we need to take when trying to understand why a number that may appear to be large is that large. See also: Local Council Spending Data – Time Series Charts http://blog.ouseful.info/2013/11/06/local-council-spending-data-time-series-charts/
As well as looking for outliers, we should also look for similarities between things we expect to be different and differences between things we expect to be the same, or at least, similar.
Looking again at some of my local council’s spending data, I noticed a search on “music” pulled back what appeared to be a shift in responsibility between directorates for spend on school music service provision. An obvious question that follows is: if the service did change hands (something we can check), was there a resulting difference in the way that the directorates were spending? Could we, for example, identify whether any projects got dropped (or at least, renamed out of scope!)? This forensic approach can also be used to track the consequences of a shift in control of a service, if we know it has happened. When a service changes hand, we can keep a note of the fact and then a year on look for evidence in whether treatment of the service has changed, at least in consequences for spending. See also: What Role, If Any, Does Spending Data Have to Play in Local Council Budget Consultations? http://blog.ouseful.info/2013/11/03/what-role-if-any-does-spending-data-have-to-play-in-local-council-budget-consultations/
If you in the position of paying for energy supply bills – electricity and gas – you’ll probably be familiar with the idea that payments are set so you tend to overpay on a monthly basis. After collecting the interest on your overpayments, the utility companies may eventually get round to sending you a small repayment to cover the excess (ex- of any interest, of course…). Is the same true at the council level? One thing I noticed in the spend my local council spent with supplier Southern Electric was that there appeared to be more than a few “negative payments”. So where were these coming from? The chart shown in this slide has positive payments made by date (not ordered on an evenly space timeline) in black, and the magnitude of negative payments shown in red. Where a red triangle sits over a black dot, this shows that a positive and negative payment of the same amount were made on the same day. Why’s that? Some days show several negative payments – again, what’s happening? There’s not necessarily anything suspicious going on, but what story does this chart appear to tell us, particularly in terms of the similarities in amount of certain positive and negative spends?
Just by the by, this chart refines the question I’m asking of the spend with Southern Electric, asking for more information about positive and negative payments made on the gas and electricity accounts separately.
As well as similarities and differences, data can tell us tales about trends…
Regular releases from the ONS – the Office of National Statistics – provide bread and butter news stories on a regular basis according to a known schedule. For example, monthly job seeker figures get a monthly write-up in OnTheWight, the hyperlocal news blog local to me. The report makes a comparison between the current figures and figures from the previous month and from the same month of the previous year. The aim is is so that we can see how the numbers have changed month on month, and year on year. I started to explore a simple script that would take data directly from the ONS and produce assets that could be reused in a news story – for example, to produce a table showing the change in figures over recent months. I also started to explore ways in which we could automate the production of prose from the data [code: https://gist.github.com/psychemedia/7536017]. For example, the following phrase was generated automatically from monthly figures: The total number of people claiming Job Seeker's Allowance (JSA) on the Isle of Wight in October was 2781, up 94 from 2687 in September, 2013, and down 377 from 3158 in October, 2012. The words up and down were selected based on simple if-then rule that compared figures to see which was the greater. The numbers and dates are pulled in from the data. The other words are canned phrases. The automated production of text from data is something that has received attention from several companies, particular in the area of baseball reports and financial reporting. See for example: http://blog.ouseful.info/2013/05/22/notes-on-narrative-science-and-automated-insight/ Being able to define sentences and natural language constructions that can be used as templates to display data in textual form is a skill that could well feed into specialist areas of data driven reporting. Identifying the patterns in the data that can be mapped onto natural language explanations of those patterns in a reliable way is another area in which wordsmiths, statisticians and developers may have to work together in the future.
If we plot a line chart with some quantity against a time axis, we can often see increasing or decreasing trends over time. If we are looking for constant rates of increase in some value, it often makes sense to use a log/logarithmic scale to display that value on the y-axis Periodic trends can also be seen as ‘waves’ appearing in the line over time, but other displays can draw out periodicity or seasonality in a more visually compelling way. For example, in these charts – of jobless figures on the Isle of Wight once again – we have months ordered along the horizontal x-axis and the number of job allowance claimants on the vertical y-axis. The separate coloured lines represent different years. On the left, we use a legend to identify the lines, on the right is an example of labeling the lines directly. The lines show strong seasonality in behaviour. Being a tourist destination, job seeker figures tend to fall over the summer months. Putting lines for several years on the same axis allows us to compare annual cycles over time.
Another trend we can try to pull out is change over years for each given month. Here, the horizontal x-axis blocks out the months, as before, but within each month we have an ordered range of years. The line within each block thus represents the year-on-year change in numbers within a given month. The step change within each month suggests that the way the figures were calculated changed significantly several years ago. Further reading: a good guide to statistics as used by government, include a description of the way that “seasonal adjustments” are handled, is provided by the House of Commons Library’s Statistical Literacy Guide http://www.parliament.uk/business/publications/research/briefing-papers/SN04944/statistical-literacy-guide
As well as the patterns we can see over time by plotting data against a time axis, we can also look for patterns in space…
In part because they are so recognisable to the majority of people as an idea as well as an artefact, maps are widely used in many publications. I have already mentioned how the use of a map to compare travel claims by MPs based on their constituency locations provided a way of making a particular sort of comparison between MPs (in particular, a comparison based on geographical location). But we can take the idea of a map more generally, as a spatial distribution of points that are related in some way, with strong relations represented as spatial proximity. Things that are close together on the page are taken to be close together in some sort of space, a space which may be conceptual or social, not just (or not even) geographic.
Take this map, for example, a map of Twitter users commonly followed by a sample of followers of @UL_journalism. The map has been laid out so that Twitter users who are heavily interlinked are grouped closely together (for the most part, at least). A network statistic has been used in an attempt to colour clusters of nodes with high interconnection. The coloured regions thus represent a first attempt at identifying different groupings of Twitter user. You will note how the spatial layout algorithm and the grouping/colouring algorithm complement each other well – they both seem to tell a similar story, where the story is that certain groups of individuals are somehow alike. About the technique: http://schoolofdata.org/2014/02/14/mapping-social-positioning-on-twitter/ Let’s have a closer look at some of the regions…
As well as similarities and differences, data can tell us tales about trends…
As well as similarities and differences, data can tell us tales about trends…