3. The key thing here is to learn how to solve
your own problems. Asking a tutor should be
your last resort - they will not be there for the rest
of your life!
4. 1.Coming up with a question
You need to find a data source. But where?Spend 15 minutes mapping out potential
data sources related to your field. They might be commercial or governmental; they
might need collecting or already be compiled somewhere. For example, if your field
was cycling there will be :
● transport data
● crime data
● health data (encouraging people to cycle as part of healthy lifestyle, for
example)
● environmental data (pollution)
● community data (things being shared online by cyclists)
Also take a look at the examples at http://delicious.com/paulb/foieg
5. 2. Use advanced search techniques to find data for a journalistic
question
There are lots of different ways to search, not just typing things
into Google.
You can limit by file type, domain, site and use Boolean limits.
6. ● Limit by filetype:
○ filetype:xls will restrict results to Excel spreadsheets;
○ filetype:csv to 'comma separated values' spreadsheets;
○ filetype:doc to Word documents - often used for internal documents
○ filetype:pdf to PDFs - often used for official reports
● Limit by domain:
■ site:gov.uk will restrict results to UK government websites
■ .ac.uk to UK educational establishments (not all of them
reputable) - the US equivalent is .edu
■ .org.uk to (mostly) nonprofit organisations - again, this is not
guaranteed. You can also try .org although this will include
results from other countries.
■ .mod.uk - the Ministry of Defence
■ .nhs.uk - NHS sites
■ .dh.gov.uk - Department of Health
■ .police.uk - police websites, including British Transport Police,
the Met
○ Limit by website:
■ site:bolton.gov.uk will further limit results to just one website,
rather than all local authority websites.
■ Likewise site:city.ac.uk would only return results from City
University's website
○ You can limit your search further by using quotation marks so that
only pages containing the exact phrase are returned, e.g. "annual
report"
○ You can also expand it by using 'Boolean' operators like OR, e.g.
7. Then put it all together:
e.g. "deaths in police custody filetype:xls site:gov.uk"
Try other 'operators' such as
● + before a search term to ensure it is in the pages
themselves, e.g. +custody
● phrases in quotes, e.g. "deaths in custody"
● The * wildcard, e.g. "deaths in * custody"
● The ~ operator for synonyms, e.g. ~deaths
8. 3. Making sense of the data
Chances are that the data you've found will raise further questions.
There may be:
● jargon that you need to understand,
● codes that need translating,
● holes in the data,
● contextual data needed: the populations of different regions; data
for previous years; etc.
● questions about how it was gathered - the methodology
Use your journalistic skills to answer those
questions.
9. Spreadsheet skills
You can also use some spreadsheet techniques to put the data into a
form that is going to be easier to interrogate - for example try the
following:
● split addresses so that the postcode is in a separate column
(Data > Text into columns in Excel, or =SPLIT in Google Docs) -
or separate forename and surname.
● Or you want to count how many times a value appears
(=COUNTIF), or how many values are above a certain number.
● Work out the total using =SUM(D:D) if your numbers are in
column D, for example
● Work out the amount per day by using =SUM(D:D)/30 for a 30
day month, etc.
● Work out a median average by using a formula like =MEDIAN(D:
D). Compare that with other types of average like =AVERAGE(D:
D) or =MODE(D:D)
10. 4. Basic visualisations
Find a transcript of a politician's - or two politicians' - speeches and
visualise them using Wordle.com, Tagxedo or ManyEyes. (The
advanced search techniques mentioned above may help)
You can either compare one politician's speeches on a particular issue before
and after taking office - or one politician's speech with his or her replacement.
Spend some time tweaking the visualisation:
● Are similar words treated differently, e.g. "patient" and "patients" or
"choice" and "options"? Should you combine the counts to clarify the
emphases? What are the ethical issues of doing so?
● Should you reduce your sample to the top 10 or 20 words or phrases to
make it clearer?
● Can you customise the words included (try copying into a text editor first),
colour scheme, arrangement, fonts, etc. to greater effect?
● Is a word cloud best - or should you use a bar chart based on word
counts?
11. Advanced tutorial 1 - GDoc webscraper
Follow the tutorials tagged 'importHTML' on Excel Notes: http://excelnotes.posterous.
com/tag/importhtml
...and 'importXML' on the Online Journalism Blog - http://onlinejournalismblog.
com/tag/importxml (start from the bottom)
For a really 'live' scraper, see instructions on how to grab XML from Backtweets or
RSS from a Twitter search in this tutorial:
http://www.brelson.com/2009/11/using-google-spreadsheets-to-extract-twitter-
data/
12. Advanced tutorial 2 - interrogating data
Follow the tutorial at http://excelnotes.posterous.com/tag/filters
And the one at http://excelnotes.posterous.com/tag/sumifs
Or if you want to play with Google Refine, search for 'Getting Started
With Local Council Spending Data' or go to http://blog.ouseful.
info/2011/01/28/getting-started-with-local-council-spending-data/
13. Advanced tutorial 3 - Scraper tools
Data can come in all sorts of forms. Based on the data you found already, try
one or more of the following:
● Using a PDF conversion service to get to the data within - a list here: http:
//helpmeinvestigate.posterous.com/tag/pdfs - also: http://www.
pdftoexcelonline.com/
● Grabbing tables from a database search: try the Firefox plugin Outwit Hub
(free version stores 100 results; buy a licence for more)